A list of "easy" linear systems

Consider $\mathbf{A} \mathbf{x} = \mathbf{b}$, $\mathbf{A} \in \mathbb{R}^{n \times n}$. Or, consider matrix inverse (if you want). $\mathbf{A}$ can be huge. Keep massive data in mind: 1000 Genome Project, NetFlix, Google PageRank, finance, spatial statistics, ... We should be alert to many easy linear systems.

Don't blindly use A \ b and inv in Julia or solve function in R. Don't waste computing resources by bad choices of algorithms!

Diagonal matrix

Diagonal $\mathbf{A}$: $n$ flops. Use Diagonal type of Julia.

In [1]:
using BenchmarkTools

srand(280)
n = 1000
A = diagm(randn(n)) # a full matrix
b = randn(n)

@which A \ b
Out[1]:
\(A::AbstractArray{T,2} where T, B::Union{AbstractArray{T,1}, AbstractArray{T,2}} where T) at linalg/generic.jl:820
In [2]:
# check `istril(A)` and `istriu(A)`, then call `Diagonal(A) \ b`
@benchmark A \ b
Out[2]:
BenchmarkTools.Trial: 
  memory estimate:  15.95 KiB
  allocs estimate:  5
  --------------
  minimum time:     729.697 μs (0.00% GC)
  median time:      743.364 μs (0.00% GC)
  mean time:        778.562 μs (0.06% GC)
  maximum time:     2.414 ms (62.58% GC)
  --------------
  samples:          6392
  evals/sample:     1
In [3]:
# O(n) computation, no extra array allocation
@benchmark Diagonal(A) \ b
Out[3]:
BenchmarkTools.Trial: 
  memory estimate:  15.95 KiB
  allocs estimate:  5
  --------------
  minimum time:     3.791 μs (0.00% GC)
  median time:      4.701 μs (0.00% GC)
  mean time:        5.935 μs (14.60% GC)
  maximum time:     267.824 μs (95.83% GC)
  --------------
  samples:          10000
  evals/sample:     8

Bidiagonal, tridiagonal, and banded matrices

Bidiagonal, tridiagonal, or banded $\mathbf{A}$: Band LU, band Cholesky, ... roughly $O(n)$ flops.

In [4]:
srand(280) 

n  = 1000
dv = randn(n)
ev = randn(n - 1)
b  = randn(n)
# symmetric tridiagonal matrix
A  = SymTridiagonal(dv, ev)
Out[4]:
1000×1000 SymTridiagonal{Float64}:
 0.126238   0.618244    ⋅        …    ⋅          ⋅          ⋅      
 0.618244  -2.34688    1.10206        ⋅          ⋅          ⋅      
  ⋅         1.10206    1.91661        ⋅          ⋅          ⋅      
  ⋅          ⋅        -0.447244       ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
 ⋮                               ⋱                                 
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅            0.709662    ⋅          ⋅      
  ⋅          ⋅          ⋅            0.542815  -0.363838    ⋅      
  ⋅          ⋅          ⋅           -0.363838  -1.04034    0.321148
  ⋅          ⋅          ⋅             ⋅         0.321148  -0.276537
In [5]:
# convert to a full matrix
Afull = full(A)

# LU decomposition (2/3) n^3 flops!
@benchmark Afull \ b
Out[5]:
BenchmarkTools.Trial: 
  memory estimate:  7.65 MiB
  allocs estimate:  8
  --------------
  minimum time:     12.552 ms (0.00% GC)
  median time:      13.446 ms (0.00% GC)
  mean time:        13.404 ms (4.16% GC)
  maximum time:     16.647 ms (0.00% GC)
  --------------
  samples:          373
  evals/sample:     1
In [6]:
# specialized algorithm for tridiagonal matrix, O(n) flops
@benchmark A \ b
Out[6]:
BenchmarkTools.Trial: 
  memory estimate:  23.89 KiB
  allocs estimate:  6
  --------------
  minimum time:     12.511 μs (0.00% GC)
  median time:      14.714 μs (0.00% GC)
  mean time:        17.326 μs (8.28% GC)
  maximum time:     2.081 ms (97.27% GC)
  --------------
  samples:          10000
  evals/sample:     1

Triangular matrix

Triangular $\mathbf{A}$: $n^2$ flops to solve linear system.

In [7]:
srand(280)

n = 1000
A = tril(randn(n, n))
b = randn(n)

# check istril() then triangular solve
@benchmark A \ b
Out[7]:
BenchmarkTools.Trial: 
  memory estimate:  7.95 KiB
  allocs estimate:  2
  --------------
  minimum time:     762.721 μs (0.00% GC)
  median time:      946.558 μs (0.00% GC)
  mean time:        1.051 ms (0.02% GC)
  maximum time:     7.309 ms (0.00% GC)
  --------------
  samples:          4721
  evals/sample:     1
In [8]:
# triangular solve directly; save the cost of istril()
@benchmark LowerTriangular(A) \ b
Out[8]:
BenchmarkTools.Trial: 
  memory estimate:  7.97 KiB
  allocs estimate:  3
  --------------
  minimum time:     320.558 μs (0.00% GC)
  median time:      380.211 μs (0.00% GC)
  mean time:        430.062 μs (0.07% GC)
  maximum time:     4.834 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Block diagonal matrix

Block diagonal: Suppose $n = \sum_i n_i$. $(\sum_i n_i)^3$ vs $\sum_i n_i^3$.

Julia has a blkdiag function that generates a sparse matrix. Anyone interested writing a BlockDiagonal.jl package?

In [9]:
srand(280)

B  = 10 # number of blocks
ni = 100
A  = blkdiag([sprandn(ni, ni, 0.01) for b in 1:B]...)
Out[9]:
1000×1000 SparseMatrixCSC{Float64,Int64} with 969 stored entries:
  [31  ,    1]  =  2.07834
  [53  ,    1]  =  -1.11883
  [58  ,    1]  =  -0.66448
  [14  ,    4]  =  1.11793
  [96  ,    5]  =  1.22813
  [81  ,    8]  =  -0.919643
  [48  ,    9]  =  1.0185
  [49  ,    9]  =  -0.996332
  [15  ,   10]  =  1.30841
  [28  ,   10]  =  -0.818757
  ⋮
  [956 ,  987]  =  -0.900804
  [967 ,  987]  =  -0.438788
  [971 ,  991]  =  0.176756
  [929 ,  992]  =  -1.17384
  [974 ,  993]  =  1.59235
  [967 ,  994]  =  0.542169
  [994 ,  995]  =  0.627832
  [998 ,  997]  =  0.60382
  [935 ,  998]  =  0.342675
  [947 ,  998]  =  0.482228
  [975 , 1000]  =  0.991598

Kronecker product

Use $$ \begin{eqnarray*} (\mathbf{A} \otimes \mathbf{B})^{-1} &=& \mathbf{A}^{-1} \otimes \mathbf{B}^{-1} \\ (\mathbf{C}^T \otimes \mathbf{A}) \text{vec}(\mathbf{B}) &=& \text{vec}(\mathbf{A} \mathbf{B} \mathbf{C}). \end{eqnarray*} $$ to avoid forming and doing computation on the potentially huge Kronecker $\mathbf{A} \otimes \mathbf{B}$.

Sparse matrix

Sparsity: sparse matrix decomposition or iterative method.

  • The easiest recognizable structure. Familiarize yourself with the sparse matrix computation tools in Julia, Matlab, R (Matrix package), MKL (sparse BLAS), ... as much as possible.
In [10]:
srand(280)

n = 1000
# a sparse pd matrix, about 0.5% non-zero entries
A = sprand(n, n, 0.002)
A = A + A' + (2n) * I
b = randn(n)
Afull = full(A)
countnz(A) / length(A) # sparsity
Out[10]:
0.005096
In [11]:
# dense matrix-vector multiplication
@benchmark Afull * b
Out[11]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     100.168 μs (0.00% GC)
  median time:      186.690 μs (0.00% GC)
  mean time:        201.362 μs (0.00% GC)
  maximum time:     2.580 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [12]:
# sparse matrix-vector multiplication
@benchmark A * b
Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     6.943 μs (0.00% GC)
  median time:      7.599 μs (0.00% GC)
  mean time:        8.241 μs (2.32% GC)
  maximum time:     483.364 μs (95.82% GC)
  --------------
  samples:          10000
  evals/sample:     5
In [13]:
# dense Cholesky decomposition
@benchmark cholfact(Afull)
Out[13]:
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  8
  --------------
  minimum time:     9.110 ms (0.00% GC)
  median time:      11.609 ms (0.00% GC)
  mean time:        11.827 ms (8.51% GC)
  maximum time:     68.000 ms (85.06% GC)
  --------------
  samples:          422
  evals/sample:     1
In [14]:
# sparse Cholesky decomposition
@benchmark cholfact(A)
Out[14]:
BenchmarkTools.Trial: 
  memory estimate:  1.33 MiB
  allocs estimate:  53
  --------------
  minimum time:     3.326 ms (0.00% GC)
  median time:      3.434 ms (0.00% GC)
  mean time:        3.631 ms (0.99% GC)
  maximum time:     7.742 ms (0.00% GC)
  --------------
  samples:          1375
  evals/sample:     1
In [15]:
# solve via dense Cholesky
xchol = cholfact(Afull) \ b
@benchmark cholfact(Afull) \ b
Out[15]:
BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  10
  --------------
  minimum time:     10.132 ms (0.00% GC)
  median time:      11.326 ms (0.00% GC)
  mean time:        11.478 ms (5.48% GC)
  maximum time:     16.560 ms (0.00% GC)
  --------------
  samples:          435
  evals/sample:     1
In [16]:
# solve via sparse Cholesky
xcholsp = cholfact(A) \ b
vecnorm(xchol - xcholsp)
Out[16]:
4.020546956752686e-18
In [17]:
@benchmark cholfact(A) \ b
Out[17]:
BenchmarkTools.Trial: 
  memory estimate:  1.36 MiB
  allocs estimate:  64
  --------------
  minimum time:     3.825 ms (0.00% GC)
  median time:      3.928 ms (0.00% GC)
  mean time:        4.094 ms (0.92% GC)
  maximum time:     6.442 ms (0.00% GC)
  --------------
  samples:          1220
  evals/sample:     1
In [18]:
# sparse solve via conjugate gradient
using IterativeSolvers

xcg, = cg(A, b)
vecnorm(xcg - xchol)
INFO: Recompiling stale cache file /Users/huazhou/.julia/lib/v0.6/IterativeSolvers.ji for module IterativeSolvers.
Out[18]:
0.022750177100322466
In [19]:
@benchmark cg(A, b)
Out[19]:
BenchmarkTools.Trial: 
  memory estimate:  33.30 KiB
  allocs estimate:  27
  --------------
  minimum time:     31.181 μs (0.00% GC)
  median time:      34.950 μs (0.00% GC)
  mean time:        40.390 μs (8.18% GC)
  maximum time:     2.927 ms (94.92% GC)
  --------------
  samples:          10000
  evals/sample:     1

Easy plus low rank

Easy plus low rank: $\mathbf{U} \in \mathbb{R}^{n \times r}$, $\mathbf{V} \in \mathbb{R}^{r \times n}$, $r \ll n$. Woodbury formula $$ (\mathbf{A} + \mathbf{U} \mathbf{V}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{I}_r + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V}^T \mathbf{A}^{-1}, $$

In [20]:
using BenchmarkTools, WoodburyMatrices

srand(280)
n = 1000
r = 5

A = Diagonal(rand(n))
B = randn(n, r)
D = Diagonal(rand(r))
b = randn(n)
# W = A + B * D * B'
W = SymWoodbury(A, B, D)
Wfull = Symmetric(full(W))
Out[20]:
1000×1000 Symmetric{Float64,Array{Float64,2}}:
  1.8571     0.513107    0.872146   …   0.764278    -0.241331    0.54921 
  0.513107   4.57505    -0.636972      -1.86465     -1.92237    -1.72569 
  0.872146  -0.636972    4.81387        1.99357      1.99337     3.66327 
 -0.516414  -0.996711   -0.0919924      0.262832     0.612402    0.621834
  0.193686   1.68244    -0.770028      -0.723437    -1.4868     -1.32247 
  1.6567     0.0634435  -0.901968   …  -0.241872    -0.0356772  -0.39826 
  0.553996  -0.274515    2.21265        0.219437     2.20382     2.60902 
  0.402356   1.89288    -1.13032       -0.771441    -1.96862    -1.93483 
 -1.07744   -1.63881     1.78016        0.96551      1.7292      1.91326 
 -2.21617   -2.90695    -2.55971       -0.47867      0.855389   -0.933916
  1.29975    0.779828    4.12459    …   1.87358      0.737112    2.84136 
 -0.80833    1.44882     1.67581       -0.139063    -0.107873    0.818132
 -2.32469   -4.83109    -2.31796       -0.0346402    2.65564     0.591075
  ⋮                                 ⋱                                    
  0.334995   0.0914829   1.60026        0.0937123    1.40804     1.92919 
  0.390722   2.36157    -1.58383       -1.92435     -1.3291     -1.88114 
  1.19218    0.845478    1.9362     …  -0.0890571    1.36046     2.01013 
  0.915526   0.889395   -0.606443      -0.428052    -0.817555   -1.2017  
 -0.778472   2.1444      1.50696        0.00689644  -1.28104    -0.141234
  0.275366  -0.25866     0.416593       0.481534     0.0874531   0.316543
 -0.289749  -1.16146     0.550825       0.698152     0.789054    0.949917
  0.439213  -0.379216    1.0951     …   0.626399     0.624574    0.8568  
 -0.592534  -0.235847   -1.11586       -0.601497    -0.0651787  -0.573737
  0.764278  -1.86465     1.99357        3.21548      0.932773    1.97505 
 -0.241331  -1.92237     1.99337        0.932773     3.43991     2.8539  
  0.54921   -1.72569     3.66327        1.97505      2.8539      4.85234 
In [21]:
# solve via Cholesky
@benchmark cholfact(Wfull) \ b
Out[21]:
BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  9
  --------------
  minimum time:     8.913 ms (0.00% GC)
  median time:      9.119 ms (0.00% GC)
  mean time:        9.485 ms (6.82% GC)
  maximum time:     12.387 ms (15.54% GC)
  --------------
  samples:          527
  evals/sample:     1
In [22]:
# solve using Woodbury formula
@benchmark W \ b
Out[22]:
BenchmarkTools.Trial: 
  memory estimate:  76.16 KiB
  allocs estimate:  26
  --------------
  minimum time:     19.164 μs (0.00% GC)
  median time:      24.648 μs (0.00% GC)
  mean time:        34.932 μs (24.29% GC)
  maximum time:     4.617 ms (98.44% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [23]:
# multiplication without using Woodbury structure
@benchmark Wfull * b
Out[23]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     193.856 μs (0.00% GC)
  median time:      203.398 μs (0.00% GC)
  mean time:        205.489 μs (0.00% GC)
  maximum time:     531.061 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [24]:
# multiplication using Woodbury structure
@benchmark W * b
Out[24]:
BenchmarkTools.Trial: 
  memory estimate:  24.06 KiB
  allocs estimate:  5
  --------------
  minimum time:     4.910 μs (0.00% GC)
  median time:      6.397 μs (0.00% GC)
  mean time:        8.787 μs (20.79% GC)
  maximum time:     374.038 μs (96.19% GC)
  --------------
  samples:          10000
  evals/sample:     7

Easy plus border

Easy plus border: For $\mathbf{A}$ pd and $\mathbf{V}$ full row rank, $$ \begin{pmatrix} \mathbf{A} & \mathbf{V}^T \\ \mathbf{V} & \mathbf{0} \end{pmatrix}^{-1} = \begin{pmatrix} \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{V}^T (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \mathbf{V} \mathbf{A}^{-1} & \mathbf{A}^{-1} \mathbf{V}^T (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \\ (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \mathbf{V} \mathbf{A}^{-1} & - (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \end{pmatrix}. $$

Orthogonal matrix

Orthogonal $\mathbf{A}$: $n^2$ flops at most. Permutation matrix, Householder matrix, Jacobi matrix, ... take less.

Toeplitz matrix

Toeplitz systems: $$ \mathbf{T} = \begin{pmatrix} r_0 & r_1 & r_2 & r_3 \\ r_{-1} & r_0 & r_1 & r_2 \\ r_{-2} & r_{-1} & r_0 & r_1 \\ r_{-3} & r_{-2} & r_{-1} & r_0 \end{pmatrix}. $$ $\mathbf{T} \mathbf{x} = \mathbf{b}$, where $\mathbf{T}$ is pd and Toeplitz, can be solved in $O(n^2)$ flops. Durbin algorithm (Yule-Walker equation), Levinson algorithm (general $\mathbf{b}$), Trench algorithm (inverse). These matrices occur in auto-regressive models and econometrics.

Circulant matrix

Circulant systems: Toeplitz matrix with wraparound $$ C(\mathbf{z}) = \begin{pmatrix} z_0 & z_4 & z_3 & z_2 & z_1 \\ z_1 & z_0 & z_4 & z_3 & z_2 \\ z_2 & z_1 & z_0 & z_4 & z_3 \\ z_3 & z_2 & z_1 & z_0 & z_4 \\ z_4 & z_3 & z_2 & z_1 & z_0 \end{pmatrix}, $$ FFT type algorithms: DCT (discrete cosine transform) and DST (discrete sine transform).

Vandermonde matrix

Vandermonde matrix: such as in interpolation and approximation problems $$ \mathbf{V}(x_0,\ldots,x_n) = \begin{pmatrix} 1 & 1 & \cdots & 1 \\ x_0 & x_1 & \cdots & x_n \\ \vdots & \vdots & & \vdots \\ x_0^n & x_1^n & \cdots & x_n^n \end{pmatrix}. $$ $\mathbf{V} \mathbf{x} = \mathbf{b}$ or $\mathbf{V}^T \mathbf{x} = \mathbf{b}$ can be solved in $O(n^2)$ flops.

Cauchy-like matrix

Cauchy-like matrices: $$ \Omega \mathbf{A} - \mathbf{A} \Lambda = \mathbf{R} \mathbf{S}^T, $$ where $\Omega = \text{diag}(\omega_1,\ldots,\omega_n)$ and $\Lambda = \text{diag}(\lambda_1,\ldots, \lambda_n)$. $O(n)$ flops for LU and QR.

Structured-rank matrix

Structured-rank problems: semiseparable matrices (LU and QR takes $O(n)$ flops), quasiseparable matrices, ...

In [25]:
versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)