Iterative Methods for Solving Linear Equations

Introduction

So far we have considered direct methods for solving linear equations.

  • Direct methods (flops fixed a priori) vs iterative methods:
    • Direct method (GE/LU, Cholesky, QR, SVD): good for dense, small or moderate sized $\mathbf{A}$
    • Iterative methods (Jacobi, Gauss-Seidal, SOR, conjugate-gradient, GMRES): good for large, sparse, structured linear system, parallel computing, warm start

PageRank problem (HW3)

  • $\mathbf{A} \in \{0,1\}^{n \times n}$ the connectivity matrix of webpages with entries $$ \begin{eqnarray*} a_{ij} = \begin{cases} 1 & \text{if page $i$ links to page $j$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*} $$ $n \approx 10^9$ in May 2017.

  • $r_i = \sum_j a_{ij}$ is the out-degree of page $i$.

  • Larry Page imagines a random surfer wandering on internet according to following rules:

    • From a page $i$ with $r_i>0$
      • with probability $p$, (s)he randomly chooses a link on page $i$ (uniformly) and follows that link to the next page
      • with probability $1-p$, (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page
    • From a page $i$ with $r_i=0$ (a dangling page), (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page

The process defines a Markov chain on the space of $n$ pages. Stationary distribution of this Markov chain gives the ranks (probabilities) of each page.

  • Stationary distribution is the top left eigenvector of the transition matrix $\mathbf{P}$ corresponding to eigenvalue 1. Equivalently it can be cast as a linear equation.

  • Gene Golub: Largest matrix computation in world.

  • GE/LU will take $2 \times (10^9)^3/3/10^{12} \approx 6.66 \times 10^{14}$ seconds $\approx 20$ million years on a tera-flop supercomputer!

  • Iterative methods come to the rescue.

Jacobi method

Solve $\mathbf{A} \mathbf{x} = \mathbf{b}$.

  • Jacobi iteration: $$ x_i^{(t+1)} = \frac{b_i - \sum_{j=1}^{i-1} a_{ij} x_j^{(t)} - \sum_{j=i+1}^n a_{ij} x_j^{(t)}}{a_{ii}}. $$

  • With splitting: $\mathbf{A} = \mathbf{L} + \mathbf{D} + \mathbf{U}$, Jacobi iteration can be written as $$ \mathbf{D} \mathbf{x}^{(t+1)} = - (\mathbf{L} + \mathbf{U}) \mathbf{x}^{(t)} + \mathbf{b}, $$ i.e., $$ \mathbf{x}^{(t+1)} = - \mathbf{D}^{-1} (\mathbf{L} + \mathbf{U}) \mathbf{x}^{(t)} + \mathbf{D}^{-1} \mathbf{b} = - \mathbf{D}^{-1} \mathbf{A} \mathbf{x}^{(t)} + \mathbf{x}^{(t)} + \mathbf{D}^{-1} \mathbf{b}. $$

  • One round costs $2n^2$ flops with an unstructured $\mathbf{A}$. Gain over GE/LU if converges in $o(n)$ iterations. Saving is huge for sparse or structured $\mathbf{A}$. By structured, we mean the matrix-vector multiplication $\mathbf{A} \mathbf{v}$ is fast ($O(n)$ or $O(n \log n)$).

Gauss-Seidel method

  • Gauss-Seidel (GS) iteration: $$ x_i^{(t+1)} = \frac{b_i - \sum_{j=1}^{i-1} a_{ij} x_j^{(t+1)} - \sum_{j=i+1}^n a_{ij} x_j^{(t)}}{a_{ii}}. $$

  • With splitting, $(\mathbf{D} + \mathbf{L}) \mathbf{x}^{(t+1)} = - \mathbf{U} \mathbf{x}^{(t)} + \mathbf{b}$, i.e., $$ \mathbf{x}^{(t+1)} = - (\mathbf{D} + \mathbf{L})^{-1} \mathbf{U} \mathbf{x}^{(t)} + (\mathbf{D} + \mathbf{L})^{-1} \mathbf{b}. $$

  • GS converges for any $\mathbf{x}^{(0)}$ for symmetric and pd $\mathbf{A}$.

  • Convergence rate of Gauss-Seidel is the spectral radius of the $(\mathbf{D} + \mathbf{L})^{-1} \mathbf{U}$.

Successive over-relaxation (SOR)

  • SOR: $$ x_i^{(t+1)} = \omega \left( b_i - \sum_{j=1}^{i-1} a_{ij} x_j^{(t+1)} - \sum_{j=i+1}^n a_{ij} x_j^{(t)} \right) / a_{ii} + (1-\omega) x_i^{(t)}, $$ i.e., $$ \mathbf{x}^{(t+1)} = (\mathbf{D} + \omega \mathbf{L})^{-1} [(1-\omega) \mathbf{D} - \omega \mathbf{U}] \mathbf{x}^{(t)} + (\mathbf{D} + \omega \mathbf{L})^{-1} (\mathbf{D} + \mathbf{L})^{-1} \omega \mathbf{b}. $$

  • Need to pick $\omega \in [0,1]$ beforehand, with the goal of improving convergence rate.

Conjugate gradient method

  • Conjugate gradient and its variants are the top-notch iterative methods for solving huge, structured linear systems.

  • Solving $\mathbf{A} \mathbf{x} = \mathbf{b}$ is equivalent to minimizing the quadratic function $\frac{1}{2} \mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T \mathbf{x}$.

  • To do later, when we study optimization.

Kershaw's results for a fusion problem.

Method Number of Iterations
Gauss Seidel 208,000
Block SOR methods 765
Incomplete Cholesky conjugate gradient 25

Numerical examples

The IterativeSolvers.jl package implements many commonly used iterative solvers.

In [1]:
using BenchmarkTools, IterativeSolvers

srand(280)

n = 4000
# a sparse pd matrix, about 0.5% non-zero entries
A = sprand(n, n, 0.002)
A = A + A' + (2n) * I
@show typeof(A)
b = randn(n)
Afull = full(A)
@show typeof(Afull)
# actual sparsity level
countnz(A) / length(A)
typeof(A) = SparseMatrixCSC{Float64,Int64}
typeof(Afull) = Array{Float64,2}
Out[1]:
0.00425125
In [2]:
# dense matrix-vector multiplication
@benchmark Afull * b
Out[2]:
BenchmarkTools.Trial: 
  memory estimate:  31.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     5.310 ms (0.00% GC)
  median time:      7.455 ms (0.00% GC)
  mean time:        7.456 ms (0.00% GC)
  maximum time:     12.503 ms (0.00% GC)
  --------------
  samples:          664
  evals/sample:     1
In [3]:
# sparse matrix-vector multiplication
@benchmark A * b
Out[3]:
BenchmarkTools.Trial: 
  memory estimate:  31.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     79.649 μs (0.00% GC)
  median time:      85.236 μs (0.00% GC)
  mean time:        96.287 μs (1.80% GC)
  maximum time:     2.318 ms (94.31% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [4]:
# dense solve via Cholesky
xchol = cholfact(Afull) \ b
@benchmark cholfact(Afull) \ b
Out[4]:
BenchmarkTools.Trial: 
  memory estimate:  122.10 MiB
  allocs estimate:  11
  --------------
  minimum time:     286.107 ms (4.42% GC)
  median time:      293.181 ms (4.39% GC)
  mean time:        308.517 ms (5.01% GC)
  maximum time:     377.636 ms (3.77% GC)
  --------------
  samples:          17
  evals/sample:     1
In [5]:
# Jacobi solution is fairly close to Cholesky solution
xjacobi, = jacobi(A, b)
@show vecnorm(xjacobi - xchol)
vecnorm(xjacobi - xchol) = 9.531929722637207e-10
Out[5]:
9.531929722637207e-10
In [6]:
@benchmark jacobi(A, b)
Out[6]:
BenchmarkTools.Trial: 
  memory estimate:  367.49 MiB
  allocs estimate:  32020
  --------------
  minimum time:     1.023 s (8.27% GC)
  median time:      1.084 s (8.39% GC)
  mean time:        1.100 s (9.56% GC)
  maximum time:     1.235 s (10.20% GC)
  --------------
  samples:          5
  evals/sample:     1
In [7]:
# Gauss-Seidel solution is fairly close to Cholesky solution
xgs, = gauss_seidel(A, b)
vecnorm(xgs - xchol)
Out[7]:
2.7213775308664887e-10
In [8]:
@benchmark gauss_seidel(A, b)
Out[8]:
BenchmarkTools.Trial: 
  memory estimate:  122.25 MiB
  allocs estimate:  20
  --------------
  minimum time:     835.558 ms (1.57% GC)
  median time:      887.137 ms (1.49% GC)
  mean time:        900.582 ms (2.45% GC)
  maximum time:     1.012 s (1.70% GC)
  --------------
  samples:          6
  evals/sample:     1
In [9]:
# symmetric SOR with ω=0.75
xsor, = ssor(A, b, 0.75)
vecnorm(xsor - xchol)
Out[9]:
7.468068025961207e-9
In [10]:
@benchmark sor(A, b, 0.75)
Out[10]:
BenchmarkTools.Trial: 
  memory estimate:  122.68 MiB
  allocs estimate:  49
  --------------
  minimum time:     3.998 s (2.44% GC)
  median time:      4.064 s (1.34% GC)
  mean time:        4.064 s (1.34% GC)
  maximum time:     4.129 s (0.28% GC)
  --------------
  samples:          2
  evals/sample:     1
In [11]:
# conjugate gradient
xcg, = cg(A, b)
vecnorm(xcg - xchol)
Out[11]:
7.904644616768145e-17
In [12]:
@benchmark cg(A, b)
Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  846.56 KiB
  allocs estimate:  65
  --------------
  minimum time:     506.442 μs (0.00% GC)
  median time:      553.696 μs (0.00% GC)
  mean time:        683.236 μs (10.74% GC)
  maximum time:     3.829 ms (74.87% GC)
  --------------
  samples:          7246
  evals/sample:     1