A list of "easy" linear systems

Consider $\mathbf{A} \mathbf{x} = \mathbf{b}$, $\mathbf{A} \in \mathbb{R}^{n \times n}$. Or, consider matrix inverse (if you want). $\mathbf{A}$ can be huge. Keep massive data in mind: 1000 Genome Project, NetFlix, Google PageRank, finance, spatial statistics, ... We should be alert to many easy linear systems.

Don't blindly use A \ b and inv in Julia or solve function in R. Don't waste computing resources by bad choices of algorithms!

Diagonal matrix

Diagonal $\mathbf{A}$: $n$ flops. Use Diagonal type of Julia.

In [1]:
using BenchmarkTools

srand(280)
n = 1000
A = diagm(randn(n)) # a full matrix
b = randn(n)

@which A \ b
Out[1]:
\(A::AbstractArray{T<:Any,2}, B::Union{AbstractArray{T<:Any,1},AbstractArray{T<:Any,2}}) at linalg/generic.jl:349
In [2]:
# check `istril(A)` and `istriu(A)`, then call `Diagonal(A) \ b`
@benchmark A \ b
Out[2]:
BenchmarkTools.Trial: 
  memory estimate:  15.97 KiB
  allocs estimate:  5
  --------------
  minimum time:     862.959 μs (0.00% GC)
  median time:      1.123 ms (0.00% GC)
  mean time:        1.500 ms (0.09% GC)
  maximum time:     8.415 ms (28.06% GC)
  --------------
  samples:          3278
  evals/sample:     1
In [3]:
# O(n) computation, no extra array allocation
@benchmark Diagonal(A) \ b
Out[3]:
BenchmarkTools.Trial: 
  memory estimate:  15.98 KiB
  allocs estimate:  6
  --------------
  minimum time:     11.530 μs (0.00% GC)
  median time:      15.592 μs (0.00% GC)
  mean time:        18.934 μs (7.33% GC)
  maximum time:     2.677 ms (97.20% GC)
  --------------
  samples:          10000
  evals/sample:     1

Bidiagonal, tridiagonal, and banded matrices

Bidiagonal, tridiagonal, or banded $\mathbf{A}$: Band LU, band Cholesky, ... roughly $O(n)$ flops.

In [4]:
srand(280) 

n  = 1000
dv = randn(n)
ev = randn(n - 1)
b  = randn(n)
# symmetric tridiagonal matrix
A  = SymTridiagonal(dv, ev)
Out[4]:
1000×1000 SymTridiagonal{Float64}:
 0.126238   0.618244    ⋅        …    ⋅          ⋅          ⋅      
 0.618244  -2.34688    1.10206        ⋅          ⋅          ⋅      
  ⋅         1.10206    1.91661        ⋅          ⋅          ⋅      
  ⋅          ⋅        -0.447244       ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
 ⋮                               ⋱                                 
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅            0.709662    ⋅          ⋅      
  ⋅          ⋅          ⋅            0.542815  -0.363838    ⋅      
  ⋅          ⋅          ⋅           -0.363838  -1.04034    0.321148
  ⋅          ⋅          ⋅             ⋅         0.321148  -0.276537
In [5]:
# convert to a full matrix
Afull = full(A)

# LU decomposition (2/3) n^3 flops!
@benchmark Afull \ b
Out[5]:
BenchmarkTools.Trial: 
  memory estimate:  7.65 MiB
  allocs estimate:  8
  --------------
  minimum time:     12.699 ms (0.00% GC)
  median time:      12.888 ms (0.00% GC)
  mean time:        13.249 ms (5.58% GC)
  maximum time:     15.763 ms (22.91% GC)
  --------------
  samples:          377
  evals/sample:     1
In [6]:
# specialized algorithm for tridiagonal matrix, O(n) flops
@benchmark A \ b
Out[6]:
BenchmarkTools.Trial: 
  memory estimate:  23.97 KiB
  allocs estimate:  9
  --------------
  minimum time:     15.968 μs (0.00% GC)
  median time:      18.489 μs (0.00% GC)
  mean time:        23.130 μs (9.54% GC)
  maximum time:     4.355 ms (98.65% GC)
  --------------
  samples:          10000
  evals/sample:     1

Triangular matrix

Triangular $\mathbf{A}$: $n^2$ flops.

In [7]:
srand(280)

n = 1000
A = tril(randn(n, n))
b = randn(n)

# check istril() then triangular solve
@benchmark A \ b
Out[7]:
BenchmarkTools.Trial: 
  memory estimate:  7.95 KiB
  allocs estimate:  2
  --------------
  minimum time:     825.379 μs (0.00% GC)
  median time:      1.068 ms (0.00% GC)
  mean time:        1.383 ms (0.00% GC)
  maximum time:     5.308 ms (0.00% GC)
  --------------
  samples:          3547
  evals/sample:     1
In [8]:
# triangular solve directly
@benchmark LowerTriangular(A) \ b
Out[8]:
BenchmarkTools.Trial: 
  memory estimate:  7.97 KiB
  allocs estimate:  3
  --------------
  minimum time:     330.742 μs (0.00% GC)
  median time:      360.519 μs (0.00% GC)
  mean time:        373.948 μs (0.12% GC)
  maximum time:     2.113 ms (79.74% GC)
  --------------
  samples:          10000
  evals/sample:     1

Block diagonal matrix

Block diagonal: Suppose $n = \sum_i n_i$. $(\sum_i n_i)^3$ vs $\sum_i n_i^3$.

Julia has a blkdiag function that generates a full matrix. Anyone interested writing a BlockDiagonal.jl package?

In [9]:
srand(280)

B  = 10 # number of blocks
ni = 100
A  = blkdiag([sprandn(ni, ni, 0.01) for b in 1:B]...)
Out[9]:
1000×1000 sparse matrix with 969 Float64 nonzero entries:
	[31  ,    1]  =  2.07834
	[53  ,    1]  =  -1.11883
	[58  ,    1]  =  -0.66448
	[14  ,    4]  =  1.11793
	[96  ,    5]  =  1.22813
	[81  ,    8]  =  -0.919643
	[48  ,    9]  =  1.0185
	[49  ,    9]  =  -0.996332
	[15  ,   10]  =  1.30841
	[28  ,   10]  =  -0.818757
	⋮
	[956 ,  987]  =  -0.900804
	[967 ,  987]  =  -0.438788
	[971 ,  991]  =  0.176756
	[929 ,  992]  =  -1.17384
	[974 ,  993]  =  1.59235
	[967 ,  994]  =  0.542169
	[994 ,  995]  =  0.627832
	[998 ,  997]  =  0.60382
	[935 ,  998]  =  0.342675
	[947 ,  998]  =  0.482228
	[975 , 1000]  =  0.991598

Kronecker product

$$ \begin{eqnarray*} (\mathbf{A} \otimes \mathbf{B})^{-1} &=& \mathbf{A}^{-1} \otimes \mathbf{B}^{-1} \\ (\mathbf{C}^T \otimes \mathbf{A}) \text{vec}(\mathbf{B}) &=& \text{vec}(\mathbf{A} \mathbf{B} \mathbf{C}). \end{eqnarray*} $$

Sparse matrix

Sparsity: sparse matrix decomposition or iterative method.

  • The easiest recognizable structure. Familiarize yourself with the sparse matrix computation tools in Julia, Matlab, R (Matrix package), MKL (sparse BLAS), ... as much as possible.
In [10]:
srand(280)

n = 1000
# a sparse pd matrix, about 0.5% non-zero entries
A = sprand(n, n, 0.002)
A = A + A' + (2n) * I
b = randn(n)
Afull = full(A)
countnz(A) / length(A)
Out[10]:
0.005096
In [11]:
# dense matrix-vector multiplication
@benchmark Afull * b
Out[11]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     104.680 μs (0.00% GC)
  median time:      113.368 μs (0.00% GC)
  mean time:        121.991 μs (0.00% GC)
  maximum time:     883.279 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [12]:
# sparse matrix-vector multiplication
@benchmark A * b
Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     7.386 μs (0.00% GC)
  median time:      8.020 μs (0.00% GC)
  mean time:        9.298 μs (5.49% GC)
  maximum time:     704.186 μs (97.68% GC)
  --------------
  samples:          10000
  evals/sample:     4
In [13]:
# dense Cholesky decomposition
@benchmark cholfact(Afull)
Out[13]:
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  8
  --------------
  minimum time:     9.102 ms (0.00% GC)
  median time:      9.297 ms (0.00% GC)
  mean time:        9.668 ms (7.72% GC)
  maximum time:     15.955 ms (0.00% GC)
  --------------
  samples:          515
  evals/sample:     1
In [14]:
# sparse Cholesky decomposition
@benchmark cholfact(A)
Out[14]:
BenchmarkTools.Trial: 
  memory estimate:  1.33 MiB
  allocs estimate:  53
  --------------
  minimum time:     3.328 ms (0.00% GC)
  median time:      3.348 ms (0.00% GC)
  mean time:        3.435 ms (0.68% GC)
  maximum time:     5.019 ms (8.21% GC)
  --------------
  samples:          1450
  evals/sample:     1
In [15]:
# solve via dense Cholesky
xchol = cholfact(Afull) \ b
@benchmark cholfact(Afull) \ b
Out[15]:
BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  10
  --------------
  minimum time:     9.729 ms (0.00% GC)
  median time:      11.628 ms (0.00% GC)
  mean time:        12.500 ms (6.75% GC)
  maximum time:     28.351 ms (15.33% GC)
  --------------
  samples:          399
  evals/sample:     1
In [16]:
# solve via sparse Cholesky
xcholsp = cholfact(A) \ b
vecnorm(xchol - xcholsp)
Out[16]:
4.020546956752686e-18
In [17]:
@benchmark cholfact(A) \ b
Out[17]:
BenchmarkTools.Trial: 
  memory estimate:  1.36 MiB
  allocs estimate:  65
  --------------
  minimum time:     3.818 ms (0.00% GC)
  median time:      4.016 ms (0.00% GC)
  mean time:        4.354 ms (0.65% GC)
  maximum time:     7.580 ms (14.44% GC)
  --------------
  samples:          1140
  evals/sample:     1
In [18]:
# sparse solve via conjugate gradient
using IterativeSolvers

xcg, = cg(A, b)
vecnorm(xcg - xchol)
Out[18]:
4.416452077969768e-18
In [19]:
@benchmark cg(A, b)
Out[19]:
BenchmarkTools.Trial: 
  memory estimate:  262.66 KiB
  allocs estimate:  44
  --------------
  minimum time:     88.647 μs (0.00% GC)
  median time:      108.999 μs (0.00% GC)
  mean time:        153.431 μs (17.26% GC)
  maximum time:     4.921 ms (82.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Easy plus low rank

Easy plus low rank: $\mathbf{U} \in \mathbb{R}^{n \times r}$, $\mathbf{V} \in \mathbb{R}^{r \times n}$, $r \ll n$. Woodbury formula $$ (\mathbf{A} + \mathbf{U} \mathbf{V}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{I}_r + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V}^T \mathbf{A}^{-1}, $$

In [20]:
using BenchmarkTools, WoodburyMatrices

srand(280)
n = 1000
r = 5

A = Diagonal(rand(n))
B = randn(n, r)
D = Diagonal(rand(r))
b = randn(n)
# W = A + B * D * B'
W = SymWoodbury(A, B, D)
Wfull = Symmetric(full(W))
Out[20]:
1000×1000 Symmetric{Float64,Array{Float64,2}}:
  1.8571     0.513107    0.872146   …   0.764278    -0.241331    0.54921 
  0.513107   4.57505    -0.636972      -1.86465     -1.92237    -1.72569 
  0.872146  -0.636972    4.81387        1.99357      1.99337     3.66327 
 -0.516414  -0.996711   -0.0919924      0.262832     0.612402    0.621834
  0.193686   1.68244    -0.770028      -0.723437    -1.4868     -1.32247 
  1.6567     0.0634435  -0.901968   …  -0.241872    -0.0356772  -0.39826 
  0.553996  -0.274515    2.21265        0.219437     2.20382     2.60902 
  0.402356   1.89288    -1.13032       -0.771441    -1.96862    -1.93483 
 -1.07744   -1.63881     1.78016        0.96551      1.7292      1.91326 
 -2.21617   -2.90695    -2.55971       -0.47867      0.855389   -0.933916
  1.29975    0.779828    4.12459    …   1.87358      0.737112    2.84136 
 -0.80833    1.44882     1.67581       -0.139063    -0.107873    0.818132
 -2.32469   -4.83109    -2.31796       -0.0346402    2.65564     0.591075
  ⋮                                 ⋱                                    
  0.334995   0.0914829   1.60026        0.0937123    1.40804     1.92919 
  0.390722   2.36157    -1.58383       -1.92435     -1.3291     -1.88114 
  1.19218    0.845478    1.9362     …  -0.0890571    1.36046     2.01013 
  0.915526   0.889395   -0.606443      -0.428052    -0.817555   -1.2017  
 -0.778472   2.1444      1.50696        0.00689644  -1.28104    -0.141234
  0.275366  -0.25866     0.416593       0.481534     0.0874531   0.316543
 -0.289749  -1.16146     0.550825       0.698152     0.789054    0.949917
  0.439213  -0.379216    1.0951     …   0.626399     0.624574    0.8568  
 -0.592534  -0.235847   -1.11586       -0.601497    -0.0651787  -0.573737
  0.764278  -1.86465     1.99357        3.21548      0.932773    1.97505 
 -0.241331  -1.92237     1.99337        0.932773     3.43991     2.8539  
  0.54921   -1.72569     3.66327        1.97505      2.8539      4.85234 
In [21]:
# solve via Cholesky
@benchmark cholfact(Wfull) \ b
Out[21]:
BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  9
  --------------
  minimum time:     8.638 ms (0.00% GC)
  median time:      9.277 ms (0.00% GC)
  mean time:        10.303 ms (7.86% GC)
  maximum time:     24.259 ms (17.84% GC)
  --------------
  samples:          484
  evals/sample:     1
In [22]:
# solve using Woodbury formula
@benchmark W \ b
Out[22]:
BenchmarkTools.Trial: 
  memory estimate:  76.19 KiB
  allocs estimate:  27
  --------------
  minimum time:     21.532 μs (0.00% GC)
  median time:      27.018 μs (0.00% GC)
  mean time:        37.297 μs (16.91% GC)
  maximum time:     3.759 ms (98.48% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [23]:
# multiplication without using Woodbury structure
@benchmark Wfull * b
Out[23]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     178.935 μs (0.00% GC)
  median time:      197.530 μs (0.00% GC)
  mean time:        206.288 μs (0.00% GC)
  maximum time:     442.637 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [24]:
# multiplication using Woodbury structure
@benchmark W * b
Out[24]:
BenchmarkTools.Trial: 
  memory estimate:  24.13 KiB
  allocs estimate:  7
  --------------
  minimum time:     4.865 μs (0.00% GC)
  median time:      6.401 μs (0.00% GC)
  mean time:        9.800 μs (21.91% GC)
  maximum time:     556.756 μs (97.87% GC)
  --------------
  samples:          10000
  evals/sample:     6

Easy plus border

Easy plus border: For $\mathbf{A}$ pd and $\mathbf{V}$ full row rank, $$ \begin{pmatrix} \mathbf{A} & \mathbf{V}^T \\ \mathbf{V} & \mathbf{0} \end{pmatrix}^{-1} = \begin{pmatrix} \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{V}^T (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \mathbf{V} \mathbf{A}^{-1} & \mathbf{A}^{-1} \mathbf{V}^T (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \\ (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \mathbf{V} \mathbf{A}^{-1} & - (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \end{pmatrix}. $$

Orthogonal matrix

Orthogonal $\mathbf{A}$: $n^2$ flops at most. Permutation matrix, Householder matrix, Jacobi matrix, ... take less.

Toeplitz matrix

Toeplitz systems: $$ \mathbf{T} = \begin{pmatrix} r_0 & r_1 & r_2 & r_3 \\ r_{-1} & r_0 & r_1 & r_2 \\ r_{-2} & r_{-1} & r_0 & r_1 \\ r_{-3} & r_{-2} & r_{-1} & r_0 \end{pmatrix}. $$ $\mathbf{T} \mathbf{x} = \mathbf{b}$, where $\mathbf{T}$ is pd and Toeplitz, can be solved in $O(n^2)$ flops. Durbin algorithm (Yule-Walker equation), Levinson algorithm (general $\mathbf{b}$), Trench algorithm (inverse). These matrices occur in auto-regressive models and econometrics.

Circulant matrix

Circulant systems: Toeplitz matrix with wraparound $$ C(\mathbf{z}) = \begin{pmatrix} z_0 & z_4 & z_3 & z_2 & z_1 \\ z_1 & z_0 & z_4 & z_3 & z_2 \\ z_2 & z_1 & z_0 & z_4 & z_3 \\ z_3 & z_2 & z_1 & z_0 & z_4 \\ z_4 & z_3 & z_2 & z_1 & z_0 \end{pmatrix}, $$ FFT type algorithms: DCT (discrete cosine transform) and DST (discrete sine transform).

Vandermonde matrix

Vandermonde matrix: such as in interpolation and approximation problems $$ \mathbf{V}(x_0,\ldots,x_n) = \begin{pmatrix} 1 & 1 & \cdots & 1 \\ x_0 & x_1 & \cdots & x_n \\ \vdots & \vdots & & \vdots \\ x_0^n & x_1^n & \cdots & x_n^n \end{pmatrix}. $$ $\mathbf{V} \mathbf{x} = \mathbf{b}$ or $\mathbf{V}^T \mathbf{x} = \mathbf{b}$ can be solved in $O(n^2)$ flops.

Cauchy-like matrix

Cauchy-like matrices: $$ \Omega \mathbf{A} - \mathbf{A} \Lambda = \mathbf{R} \mathbf{S}^T, $$ where $\Omega = \text{diag}(\omega_1,\ldots,\omega_n)$ and $\Lambda = \text{diag}(\lambda_1,\ldots, \lambda_n)$. $O(n)$ flops for LU and QR.

Structured-rank matrix

Structured-rank problems: semiseparable matrices (LU and QR takes $O(n)$ flops), quasiseparable matrices, ...