Numerical linear algebra¶

System information (for reproducibility)

versioninfo()

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Introduction¶

Topics in numerical algebra:
- BLAS
- solve linear equations $\mathbf{A} \mathbf{x} = \mathbf{b}$
- regression computations $\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}$
- eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$
- generalized eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{B} \mathbf{x}$
- singular value decompositions $\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T$
- iterative methods for numerical linear algebra
Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).
All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra.
- Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See documentation.

BLAS¶

BLAS stands for basic linear algebra subprograms.
See netlib for a complete list of standardized BLAS functions.
There are many implementations of BLAS.
- Netlib provides a reference implementation.
- Matlab uses Intel's MKL (mathematical kernel libaries). MKL implementation is the gold standard on market. It is not open source but the compiled library is free for Linux and MacOS.
- Julia uses OpenBLAS. OpenBLAS is the best open source implementation.
There are 3 levels of BLAS functions.
- Level 1: vector-vector operation
- Level 2: matrix-vector operation
- Level 3: matrix-matrix operation

Level	Example Operation	Name	Dimension	Flops
1	$\alpha \gets \mathbf{x}^T \mathbf{y}$	dot product	$\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$	$2n$
1	$\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$	axpy	$\alpha \in \mathbb{R}$, $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$	$2n$
2	$\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$	gaxpy	$\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$	$2mn$
2	$\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T$	rank one update	$\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$	$2mn$
3	$\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$	matrix multiplication	$\mathbf{A} \in \mathbb{R}^{m \times p}$, $\mathbf{B} \in \mathbb{R}^{p \times n}$, $\mathbf{C} \in \mathbb{R}^{m \times n}$	$2mnp$

Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).

Examples¶

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

Some operations appear as level-3 but indeed are level-2.

Example 1. A common operation in statistics is column scaling or row scaling $$ \begin{eqnarray*} \mathbf{A} &=& \mathbf{A} \mathbf{D} \quad \text{(column scaling)} \\ \mathbf{A} &=& \mathbf{D} \mathbf{A} \quad \text{(row scaling)}, \end{eqnarray*} $$ where $\mathbf{D}$ is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form
$$ \mathbf{X}^T \mathbf{W} \mathbf{X}, $$ where $\mathbf{W}$ is a diagonal matrix with observation weights on diagonal.

Column and row scalings are essentially level-2 operations!

using BenchmarkTools, LinearAlgebra, Random

Random.seed!(123) # seed
n = 2000
A = rand(n, n) # n-by-n matrix
d = rand(n)  # n vector
D = Diagonal(d) # diagonal matrix with d as diagonal

2000×2000 Diagonal{Float64, Vector{Float64}}:
 0.919181   ⋅         ⋅         ⋅        …   ⋅         ⋅          ⋅ 
  ⋅        0.426019   ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅        0.746586   ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅        0.819201      ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
 ⋮                                       ⋱                       
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅           0.572021   ⋅          ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅        0.0403848   ⋅ 
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         0.141618

Dfull = convert(Matrix, D) # convert to full matrix

2000×2000 Matrix{Float64}:
 0.919181  0.0       0.0       0.0       …  0.0       0.0        0.0
 0.0       0.426019  0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.746586  0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.819201     0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0       …  0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0       …  0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 ⋮                                       ⋱                       
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0       …  0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0       …  0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.0
 0.0       0.0       0.0       0.0          0.572021  0.0        0.0
 0.0       0.0       0.0       0.0          0.0       0.0403848  0.0
 0.0       0.0       0.0       0.0          0.0       0.0        0.141618

# this is calling BLAS routine for matrix multiplication: O(n^3) flops
# this is SLOW!
@benchmark $A * $Dfull

BenchmarkTools.Trial: 42 samples with 1 evaluation.
 Range (min … max):   97.589 ms … 202.120 ms  ┊ GC (min … max): 0.00% … 0.15%
 Time  (median):     116.057 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   121.254 ms ±  21.703 ms  ┊ GC (mean ± σ):  1.77% ± 2.70%

   ▁▄ █▁   ▁█ ▄▁ ▁▄                    ▄                         
  ▆██▆██▆▆▆██▁██▆██▆▆▆▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  97.6 ms          Histogram: frequency by time          202 ms <

 Memory estimate: 30.52 MiB, allocs estimate: 2.

# dispatch to special method for diagonal matrix multiplication.
# columnwise scaling: O(n^2) flops
@benchmark $A * $D

BenchmarkTools.Trial: 426 samples with 1 evaluation.
 Range (min … max):   9.075 ms … 19.795 ms  ┊ GC (min … max):  0.00% … 42.32%
 Time  (median):      9.685 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   11.717 ms ±  3.066 ms  ┊ GC (mean ± σ):  18.21% ± 18.97%

   █▆▅▆▁                                       ▅▅▂▁▂▁          
  ▇█████▇▅█▅▆▁▅▆▅▅▁▁▄▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▆██████▅▄▅▁▅▄▄▄ ▇
  9.08 ms      Histogram: log(frequency) by time      17.6 ms <

 Memory estimate: 30.52 MiB, allocs estimate: 4.

# in-place: avoid allocate space for result
# rmul!: compute matrix-matrix product AB, overwriting A, and return the result.
@benchmark rmul!($A, $D)

BenchmarkTools.Trial: 429 samples with 1 evaluation.
 Range (min … max):   5.184 ms … 21.888 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.604 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.679 ms ±  4.733 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ██▆█▃▃ ▁         ▅   ▂   ▄▁  ▃     ▁           ▁           
  ▆▆██████▇█▇▆▁▃▁▃▃▃▅█▆▃▇█▄█▇██▆▇█▄▄▃▆▆█▆▃▄▇▅▄▃▃▃▄▇█▄▄▃▃▃█▃▅▃ ▄
  5.18 ms         Histogram: frequency by time        21.5 ms <

 Memory estimate: 96 bytes, allocs estimate: 2.

Note: In R or Matlab, diag(d) will create a full matrix. Be cautious using diag function: do we really need a full diagonal matrix?

using RCall

R"""
d <- runif(5)
diag(d)
"""

RObject{RealSxp}
          [,1]      [,2]      [,3]      [,4]       [,5]
[1,] 0.3830245 0.0000000 0.0000000 0.0000000 0.00000000
[2,] 0.0000000 0.5793777 0.0000000 0.0000000 0.00000000
[3,] 0.0000000 0.0000000 0.2344167 0.0000000 0.00000000
[4,] 0.0000000 0.0000000 0.0000000 0.4607606 0.00000000
[5,] 0.0000000 0.0000000 0.0000000 0.0000000 0.01739585

Example 2. Innter product between two matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$ is often written as $$ \text{trace}(\mathbf{A}^T \mathbf{B}), \text{trace}(\mathbf{B} \mathbf{A}^T), \text{trace}(\mathbf{A} \mathbf{B}^T), \text{ or } \text{trace}(\mathbf{B}^T \mathbf{A}). $$ They appear as level-3 operation (matrix multiplication with $O(m^2n)$ or $O(mn^2)$ flops).

Random.seed!(123)
n = 2000
A, B = randn(n, n), randn(n, n)

# slow way to evaluate this thing
@benchmark tr(transpose($A) * $B)

BenchmarkTools.Trial: 36 samples with 1 evaluation.
 Range (min … max):  117.910 ms … 172.937 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     138.821 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   140.982 ms ±  14.105 ms  ┊ GC (mean ± σ):  1.78% ± 2.50%

           █▃          ▃     ▃        ▃    █ ▃                ▃  
  ▇▁▁▁▇▁▇▇▁██▁▁▇▇▁▇▇▇▁▇█▇▇▇▇▁█▁▁▇▁▁▇▁▁█▇▁▁▁█▁█▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁█ ▁
  118 ms           Histogram: frequency by time          173 ms <

 Memory estimate: 30.52 MiB, allocs estimate: 2.

But $\text{trace}(\mathbf{A}^T \mathbf{B}) = <\text{vec}(\mathbf{A}), \text{vec}(\mathbf{B})>$. The latter is level-2 operation with $O(mn)$ flops.

@benchmark dot($A, $B)

BenchmarkTools.Trial: 1330 samples with 1 evaluation.
 Range (min … max):  2.419 ms …  31.788 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.662 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.727 ms ± 993.942 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▁▂▃▄▂▆▄▅▄▆█▇▃▅▃▂▂   ▁                      
  ▃▂▁▂▂▁▂▁▂▂▂▃▃▃▄▄▆▇█████████████████▇▇██▆▅▄▃▄▃▄▄▂▃▃▂▃▂▁▂▂▂▁▂ ▄
  2.42 ms         Histogram: frequency by time        5.02 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Example 3. Similarly $\text{diag}(\mathbf{A}^T \mathbf{B})$ can be calculated in $O(mn)$ flops.

# slow way to evaluate this thing: O(n^3)
@benchmark diag(transpose($A) * $B)

BenchmarkTools.Trial: 39 samples with 1 evaluation.
 Range (min … max):  107.914 ms … 186.092 ms  ┊ GC (min … max): 0.00% … 5.03%
 Time  (median):     130.929 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   130.798 ms ±  16.272 ms  ┊ GC (mean ± σ):  1.95% ± 2.72%

   ▃█   ▃   █      ▃  █    ▃▃   █   ▃                            
  ▇██▁▇▇█▇▁▇█▁▁▇▁▇▇█▇▁█▇▁▇▇██▁▇▁█▇▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  108 ms           Histogram: frequency by time          186 ms <

 Memory estimate: 30.53 MiB, allocs estimate: 3.

# smarter: O(n^2)
@benchmark Diagonal(vec(sum($A .* $B, dims=1)))

BenchmarkTools.Trial: 459 samples with 1 evaluation.
 Range (min … max):   8.016 ms … 18.824 ms  ┊ GC (min … max):  0.00% … 39.49%
 Time  (median):      9.045 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   10.880 ms ±  3.106 ms  ┊ GC (mean ± σ):  19.35% ± 19.71%

  ▃█    ▇                                                      
  ████▇▇█▇▆▄▃▂▂▂▃▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▃▅▆▅▅▄▂▄▃▂▂▃▇▃▅▃▄▁▃▂▁▂ ▃
  8.02 ms         Histogram: frequency by time        17.2 ms <

 Memory estimate: 30.53 MiB, allocs estimate: 5.

To get rid of allocation of intermediate array at all, we can just write a double loop or use dot function.

using LoopVectorization

function diag_matmul!(d, A, B)
    m, n = size(A)
    @assert size(B) == (m, n) "A and B should have same size"
    fill!(d, 0)
    @avx for j in 1:n, i in 1:m
        d[j] += A[i, j] * B[i, j]
    end
#     for j in 1:n
#         @views d[j] = dot(A[:, j], B[:, j])
#     end
    Diagonal(d)
end

d = zeros(eltype(A), size(A, 2))
@benchmark diag_matmul!($d, $A, $B)

┌ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]
└ @ Base loading.jl:1423

BenchmarkTools.Trial: 1113 samples with 1 evaluation.
 Range (min … max):  3.406 ms …   7.880 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.288 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.468 ms ± 797.065 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▅█▃▅▆▂▂             ▂                                     
  ▄▄█████████▆▇▆▆▆▅▅▆▆▆▇██▇█▆▇▄▄▆▆▅▅▄▄▃▃▄▃▅▅▄▄▅▄▃▄▃▃▃▃▃▃▁▃▁▃▂ ▄
  3.41 ms         Histogram: frequency by time         6.6 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Memory hierarchy and level-3 fraction¶

Key to high performance is effective use of memory hierarchy. True on all architectures.

Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.

Numbers everyone should know

Operation	Time
L1 cache reference	0.5 ns
L2 cache reference	7 ns
Main memory reference	100 ns
Read 1 MB sequentially from memory	250,000 ns
Read 1 MB sequentially from SSD	1,000,000 ns
Read 1 MB sequentially from disk	20,000,000 ns

Source: https://gist.github.com/jboner/2841832

For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.
Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?
Answer: use high-level BLAS as much as possible.

BLAS	Dimension	Mem. Refs.	Flops	Ratio
Level 1: $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$	$\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$	$3n$	$2n$	3:2
Level 2: $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$	$\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{A} \in \mathbb{R}^{n \times n}$	$n^2$	$2n^2$	1:2
Level 3: $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$	$\mathbf{A}, \mathbf{B}, \mathbf{C} \in\mathbb{R}^{n \times n}$	$4n^2$	$2n^3$	2:n

Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. Surface-to-volume effect.
See Dongarra slides.

A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called level-3 fraction.
To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the Quora question, especially the video. Bottomline is

Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.

Effect of data layout¶

Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.
Storage mode: column-major (Fortran, Matlab, R, Julia) vs row-major (C/C++).
Cache line is the minimum amount of cache which can be loaded and stored to memory.
- x86 CPUs: 64 bytes
- ARM CPUs: 32 bytes

Accessing column-major stored matrix by rows ($ij$ looping) causes lots of cache misses.
Take matrix multiplication as an example $$ \mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}. $$ Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops.
- jki or kji looping:
```
# inner most loop
  for i = 1:m
      C[i, j] = C[i, j] + A[i, k] * B[k, j]
  end
```
- ikj or kij looping:
```
# inner most loop        
  for j = 1:n
      C[i, j] = C[i, j] + A[i, k] * B[k, j]
  end
```
- ijk or jik looping:
```
# inner most loop        
  for k = 1:p
      C[i, j] = C[i, j] + A[i, k] * B[k, j]
  end
```
We pay attention to the innermost loop, where the vector calculation occurs. The associated stride when accessing the three matrices in memory (assuming column-major storage) is

Variant	A Stride	B Stride	C Stride
$jki$ or $kji$	Unit	0	Unit
$ikj$ or $kij$	0	Non-Unit	Non-Unit
$ijk$ or $jik$	Non-Unit	Unit	0

Apparently the variants $jki$ or $kji$ are preferred.

"""
    matmul_by_loop!(A, B, C, order)

Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.
"""
function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)
    
    m = size(A, 1)
    p = size(A, 2)
    n = size(B, 2)
    fill!(C, 0)
    
    if order == "jki"
        @inbounds for j = 1:n, k = 1:p, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kji"
        @inbounds for k = 1:p, j = 1:n, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ikj"
        @inbounds for i = 1:m, k = 1:p, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kij"
        @inbounds for k = 1:p, i = 1:m, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ijk"
        @inbounds for i = 1:m, j = 1:n, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "jik"
        @inbounds for j = 1:n, i = 1:m, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
end

using Random

Random.seed!(123)
m, p, n = 2000, 100, 2000
A = rand(m, p)
B = rand(p, n)
C = zeros(m, n);

$jki$ and $kji$ looping:

using BenchmarkTools

@benchmark matmul_by_loop!($A, $B, $C, "jki")

BenchmarkTools.Trial: 68 samples with 1 evaluation.
 Range (min … max):  69.088 ms … 84.099 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     73.383 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   74.120 ms ±  3.892 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅▂█▂      ▂▂▅         ▂   ▂    ▂                            
  ▅██████▁▁▅▅███▁▅▅▅▅▅▅▅▅█▅█▁█▅▅▁▅█▅▅█▅▅▁▅▁▁▅▁▁▁▅▁▁▁▅▁▁▁▁▁▅▁▅ ▁
  69.1 ms         Histogram: frequency by time          84 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark matmul_by_loop!($A, $B, $C, "kji")

BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  439.908 ms … 466.728 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     450.413 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   453.000 ms ±   9.185 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ █          █     █  ███          █  █                  ██ █  
  █▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁███▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁█ ▁
  440 ms           Histogram: frequency by time          467 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

$ikj$ and $kij$ looping:

@benchmark matmul_by_loop!($A, $B, $C, "ikj")

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.553 s …   2.570 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.562 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.562 s ± 12.348 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.55 s         Histogram: frequency by time        2.57 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark matmul_by_loop!($A, $B, $C, "kij")

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.785 s …   2.823 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.804 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.804 s ± 27.003 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.79 s         Histogram: frequency by time        2.82 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

$ijk$ and $jik$ looping:

@benchmark matmul_by_loop!($A, $B, $C, "ijk")

BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  502.145 ms … 564.525 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     524.902 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   527.456 ms ±  21.776 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██     █ █ █                    █      ██   █               █  
  ██▁▁▁▁▁█▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁██▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  502 ms           Histogram: frequency by time          565 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark matmul_by_loop!($A, $B, $C, "ijk")

BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  504.969 ms … 525.714 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     509.849 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   512.127 ms ±   6.036 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁       ▁   █▁▁      ▁  ▁                ▁                  ▁  
  █▁▁▁▁▁▁▁█▁▁▁███▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  505 ms           Histogram: frequency by time          526 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product).

@benchmark mul!($C, $A, $B)

BenchmarkTools.Trial: 637 samples with 1 evaluation.
 Range (min … max):  7.231 ms …  12.005 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.430 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.838 ms ± 847.058 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▆▄▂▂▂▂▃▃ ▂▁▁ ▁ ▁  ▁                                        
  ██████████████▁███▆██▇▆▆▅▇▇█▅▅▇▄▅▅▇▄▅▇▅▄▄▆█▄▄▇▁▅▇▄▆▄▁▇▆▄▅▁▅ █
  7.23 ms      Histogram: log(frequency) by time      10.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

# direct call of BLAS wrapper function
@benchmark LinearAlgebra.BLAS.gemm!('N', 'N', 1.0, $A, $B, 0.0, $C)

BenchmarkTools.Trial: 625 samples with 1 evaluation.
 Range (min … max):  7.213 ms …  11.916 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.524 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.981 ms ± 969.501 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█▇▄▃▃▄▂▁▁▁▁ ▂▁                                             
  ████████████████▇▆▇▇█▆▅▆▇▅▆▇▅▄▆▆▆▅▅█▄▆█▇▄▁▅▄▄▆▇▅▄▄▄▄▄▆▄▅▄▅▄ █
  7.21 ms      Histogram: log(frequency) by time      11.5 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Exercise: Annotate the loop in matmul_by_loop! by @avx and benchmark again.

BLAS in R¶

Tip for R user. Standard R distribution from CRAN uses a very out-dated BLAS/LAPACK library.

using RCall

R"""
library(dplyr)
library(bench)
bench::mark($A %*% $B) %>%
  print(width = Inf)
""";

┌ Warning: RCall.jl: 
│ Attaching package: ‘dplyr’
│ 
│ The following objects are masked from ‘package:stats’:
│ 
│     filter, lag
│ 
│ The following objects are masked from ‘package:base’:
│ 
│     intersect, setdiff, setequal, union
│ 
└ @ RCall /Users/huazhou/.julia/packages/RCall/6kphM/src/io.jl:172

# A tibble: 1 × 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 `#JL`$A %*% `#JL`$B    257ms    284ms      3.52    30.5MB     3.52     2     2
  total_time result                memory             time          
    <bch:tm> <list>                <list>             <list>        
1      568ms <dbl [2,000 × 2,000]> <Rprofmem [2 × 3]> <bench_tm [2]>
  gc              
  <list>          
1 <tibble [2 × 3]>

┌ Warning: RCall.jl: Warning: Some expressions had a GC in every iteration; so filtering is disabled.
└ @ RCall /Users/huazhou/.julia/packages/RCall/6kphM/src/io.jl:172

Re-build R from source using OpenBLAS or MKL will immediately boost linear algebra performance in R. Google build R using MKL to get started. Similarly we can build Julia using MKL.
Matlab uses MKL. Usually it's very hard to beat Matlab in terms of linear algebra.

Avoid memory allocation: some examples¶

Transposing matrix is an expensive memory operation.
- In R, the command
```
t(A) %*% x
```
  will first transpose A then perform matrix multiplication, causing unnecessary memory allocation
- Julia is smart to avoid transposing matrix if possible.

using Random, LinearAlgebra, BenchmarkTools
Random.seed!(123)

n = 1000
A = rand(n, n)
x = rand(n);

typeof(transpose(A))

Transpose{Float64, Matrix{Float64}}

fieldnames(typeof(transpose(A)))

(:parent,)

# same data in tranpose(A) and original matrix A
pointer(transpose(A).parent), pointer(A)

(Ptr{Float64} @0x00007fa5c7000000, Ptr{Float64} @0x00007fa5c7000000)

# dispatch to BLAS
# does *not* actually transpose the matrix
@benchmark transpose($A) * $x

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   91.285 μs … 374.502 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     122.037 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   138.212 μs ±  33.536 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▁█▄ ▄                                                 
  ▁▁▁▁▁▁▂▃▅█████▇▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁ ▂
  91.3 μs          Histogram: frequency by time          231 μs <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

# pre-allocate result
out = zeros(size(A, 2))
@benchmark mul!($out, transpose($A), $x)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   93.211 μs … 915.734 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     146.899 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   187.108 μs ±  95.276 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▆█▇▅▄▃▄▄▄▃▄▄▄▃▃▃▄▃▃▂▂▂▁▁▁▁▁▂▁▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁              ▂
  ▆▇███████████████████████████████████████████████████▇▇▇▇▆▆▇▅ █
  93.2 μs       Histogram: log(frequency) by time        490 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# or call BLAS wrapper directly
@benchmark LinearAlgebra.BLAS.gemv!('T', 1.0, $A, $x, 0.0, $out)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   91.027 μs … 417.073 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     115.689 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   130.632 μs ±  31.795 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▃▅██▆▄▄▄▂▁▁ ▁▁▁▁▂▂▂▁▁▁▂▂▂▂▁▁▁▁▁▁▁▁▁▁                    ▂
  ▄▄▆▅▆██████████████████████████████████████████▇▇█▇▇▇▇▆▆▆▆▆▄▆ █
  91 μs         Histogram: log(frequency) by time        246 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Broadcasting in Julia achieves vectorized code without creating intermediate arrays.

Suppose we want to calculate elementsize maximum of absolute values of two large arrays. In R or Matlab, the command
```
max(abs(X), abs(Y))
```
will create two intermediate arrays and then one result array.

using RCall
Random.seed!(123)
X, Y = rand(1000, 1000), rand(1000, 1000)

R"""
library(dplyr)
library(bench)
bench::mark(max(abs($X), abs($Y))) %>%
  print(width = Inf)
""";

# A tibble: 1 × 13
  expression                           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 max(abs(`#JL`$X), abs(`#JL`$Y))   6.23ms   6.81ms      134.    15.3MB     69.0
  n_itr  n_gc total_time result    memory             time           
  <int> <dbl>   <bch:tm> <list>    <list>             <list>         
1    35    18      261ms <dbl [1]> <Rprofmem [2 × 3]> <bench_tm [53]>
  gc               
  <list>           
1 <tibble [53 × 3]>

In Julia, dot operations are fused so no intermediate arrays are created.

# no intermediate arrays created, only result array created
@benchmark max.(abs.($X), abs.($Y))

BenchmarkTools.Trial: 2266 samples with 1 evaluation.
 Range (min … max):  1.266 ms … 8.482 ms  ┊ GC (min … max):  0.00% … 74.34%
 Time  (median):     1.810 ms             ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.200 ms ± 1.435 ms  ┊ GC (mean ± σ):  20.51% ± 21.67%

  ▃▃▂▇▇█▅▂▁                                           ▁▁▁   ▁
  █████████▇▆▃▁▁▁▄▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇█▇▆▆▆████▇█ █
  1.27 ms     Histogram: log(frequency) by time     7.36 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

Pre-allocating result array gets rid of memory allocation at all.

# no memory allocation at all!
Z = zeros(size(X)) # zero matrix of same size as X
@benchmark $Z .= max.(abs.($X), abs.($Y)) # .= (vs =) is important!

BenchmarkTools.Trial: 3528 samples with 1 evaluation.
 Range (min … max):  1.158 ms …   3.392 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.342 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.405 ms ± 230.774 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █▅▃▄▃▂ ▁                                                   
  ▂████████████▇█▆▆▄▅▄▄▃▃▃▃▃▂▃▂▂▃▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  1.16 ms         Histogram: frequency by time        2.26 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

View-1) avoids creating extra copy of matrix data.

Random.seed!(123)
A = randn(1000, 1000)

# sum entries in a sub-matrix
@benchmark sum($A[1:2:500, 1:2:500])

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   67.632 μs …   8.946 ms  ┊ GC (min … max):  0.00% … 94.81%
 Time  (median):     307.151 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   347.393 μs ± 500.222 μs  ┊ GC (mean ± σ):  10.34% ±  7.02%

                     █▆▄▄▂▁▂▂▁▁▁                                 
  ▅▄▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▆████████████▇▆▆▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁ ▃
  67.6 μs          Histogram: frequency by time          648 μs <

 Memory estimate: 488.36 KiB, allocs estimate: 2.

# view avoids creating a separate sub-matrix
@benchmark sum(@view $A[1:2:500, 1:2:500])

BenchmarkTools.Trial: 7035 samples with 1 evaluation.
 Range (min … max):  628.360 μs …  1.365 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     683.832 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   707.902 μs ± 66.509 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄    ▆▁ █▄▁▁▇▃▃▂▁▂▁▁▁ ▁▁                                     ▂
  █▁▅▁▅██▆█████████████████████▇▇██▇█▇█▇▇▆▆█▇▇▇▇▆▇▇▇▆▇▆▆▆▅▆▅▆▆ █
  628 μs        Histogram: log(frequency) by time       998 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The @views macro, which can be useful in some operations.

@benchmark @views sum($A[1:2:500, 1:2:500])

BenchmarkTools.Trial: 7070 samples with 1 evaluation.
 Range (min … max):  628.356 μs …  1.556 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     682.459 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   704.399 μs ± 60.192 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃     ▅  █▁ ▆▄▂▂▂▁▂▁▁▁▁  ▁                                   ▁
  █▁▄▃▃▄█▄▇█████████████████████▇█▇▇█▇▇▇▇▇▇▆▆▆▇▇▇▇▇▆▇▅▆▄▅▅▄▄▅▄ █
  628 μs        Histogram: log(frequency) by time       975 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.