versioninfo()
Conjugate gradient is the top-notch iterative method for solving large, structured linear systems $\mathbf{A} \mathbf{x} = \mathbf{b}$, where $\mathbf{A}$ is pd.
Earlier we talked about Jacobi, Gauss-Seidel, and successive over-relaxation (SOR) as the classical iterative solvers. They are rarely used in practice due to slow convergence.
Kershaw's results for a fusion problem.
Method | Number of Iterations |
---|---|
Gauss Seidel | 208,000 |
Block SOR methods | 765 |
Incomplete Cholesky conjugate gradient | 25 |
Hestenes and Stiefel (1952), Methods of conjugate gradients for solving linear systems, Jounral of Research of the National Bureau of Standards.
A set of vectors $\{\mathbf{x}^{(0)},\ldots,\mathbf{x}^{(l)}\}$ is said to be conjugate with respect to $\mathbf{A}$ if $$ \begin{eqnarray*} \mathbf{p}_i^T \mathbf{A} \mathbf{p}_j = 0, \quad \text{for all } i \ne j. \end{eqnarray*} $$ For example, eigen-vectors of $\mathbf{A}$ are conjugate to each other. Why?
Conjugate direction method: Given a set of conjugate vectors $\{\mathbf{p}^{(0)},\ldots,\mathbf{p}^{(l)}\}$, at iteration $t$, we search along the conjugate direction $\mathbf{p}^{(t)}$ $$ \begin{eqnarray*} \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} + \alpha^{(t)} \mathbf{p}^{(t)}, \end{eqnarray*} $$ where $$ \begin{eqnarray*} \alpha^{(t)} = - \frac{\mathbf{r}^{(t)T} \mathbf{p}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}} \end{eqnarray*} $$ is the optimal step length.
Theorem: In conjugate direction method, $\mathbf{x}^{(t)}$ converges to the solution in at most $n$ steps.
Intuition: Look at graph.
Conjugate gradient method. Idea: generate $\mathbf{p}^{(t)}$ using only $\mathbf{p}^{(t-1)}$ $$ \begin{eqnarray*} \mathbf{p}^{(t)} = - \mathbf{r}^{(t)} + \beta^{(t)} \mathbf{p}^{(t-1)}, \end{eqnarray*} $$ where $\beta^{(t)}$ is determined by the conjugacy condition $\mathbf{p}^{(t-1)T} \mathbf{A} \mathbf{p}^{(t)} = 0$ $$ \begin{eqnarray*} \beta^{(t)} = \frac{\mathbf{r}^{(t)T} \mathbf{A} \mathbf{p}^{(t-1)}}{\mathbf{p}^{(t-1)T} \mathbf{A} \mathbf{p}^{(t-1)}}. \end{eqnarray*} $$
CG algorithm (preliminary version):
While $\mathbf{r}^{(0)} \ne \mathbf{0}$
Remark: The initial conjugate direction $\mathbf{p}^{(0)} \gets - \mathbf{r}^{(0)}$ is crucial.
Theorem: With CG algorithm
CG algorithm (economical version): saves one matrix-vector multiplication.
Computation cost per iteration is one matrix vector multiplication: $\mathbf{A} \mathbf{p}^{(t)}$.
Consider PageRank problem, $\mathbf{A}$ has dimension $n \approx 10^{10}$ but is highly structured (sparse + low rank). Each matrix vector multiplication takes $O(n)$.
Theorem: If $\mathbf{A}$ has $r$ distinct eigenvalues, $\mathbf{x}^{(t)}$ converges to solution $\mathbf{x}^*$ in at most $r$ steps.
Summary of conjugate gradient method for solving $\mathbf{A} \mathbf{x} = \mathbf{b}$ or equivalently minimizing $\frac 12 \mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T \mathbf{x}$:
Two important bounds for conjugate gradient algorithm:
Let $\lambda_1 \le \cdots \le \lambda_n$ be the ordered eigenvalues of a pd $\mathbf{A}$.
$$
\begin{eqnarray*}
\|\mathbf{x}^{(t+1)} - \mathbf{x}^*\|_{\mathbf{A}}^2 &\le& \left( \frac{\lambda_{n-t} - \lambda_1}{\lambda_{n-t} + \lambda_1} \right)^2 \|\mathbf{x}^{(0)} - \mathbf{x}^*\|_{\mathbf{A}}^2 \\
\|\mathbf{x}^{(t+1)} - \mathbf{x}^*\|_{\mathbf{A}}^2 &\le& 2 \left( \frac{\sqrt{\kappa(\mathbf{A})}-1}{\sqrt{\kappa(\mathbf{A})}+1} \right)^{t} \|\mathbf{x}^{(0)} - \mathbf{x}^*\|_{\mathbf{A}}^2,
\end{eqnarray*}
$$
where $\kappa(\mathbf{A}) = \lambda_n/\lambda_1$ is the condition number of $\mathbf{A}$.
Messages:
Pre-conditioning: Change of variables $\widehat{\mathbf{x}} = \mathbf{C} \mathbf{x}$ via a nonsingular $\mathbf{C}$ and solve $$ (\mathbf{C}^{-T} \mathbf{A} \mathbf{C}^{-1}) \widehat{\mathbf{x}} = \mathbf{C}^{-T} \mathbf{b}. $$ Choose $\mathbf{C}$ such that
Preconditioned CG does not make use of $\mathbf{C}$ explicitly, but rather the matrix $\mathbf{M} = \mathbf{C}^T \mathbf{C}$.
Preconditioned CG (PCG) algorithm:
While $\mathbf{r}^{(0)} \ne \mathbf{0}$
Remark: Only extra cost in the pre-conditioned CG algorithm is the need to solve the linear system $\mathbf{M} \mathbf{y} = \mathbf{r}$.
Pre-conditioning is more like an art than science. Some choices include
Preconditioners.jl wraps a bunch of preconditioners.
We use the Wathen matrix (sparse and positive definite) as a test matrix.
using BenchmarkTools, MatrixDepot, IterativeSolvers, LinearAlgebra, SparseArrays
# Wathen matrix of dimension 30401 x 30401
A = matrixdepot("wathen", 100)
using UnicodePlots
spy(A)
# sparsity level
count(!iszero, A) / length(A)
# rhs
b = ones(size(A, 1))
# solve Ax=b by CG
xcg = cg(A, b);
@benchmark cg($A, $b)
Compute the incomplete cholesky preconditioner:
using Preconditioners
@time p = CholeskyPreconditioner(A, 2)
Pre-conditioned conjugate gradient:
# solver Ax=b by PCG
xpcg = cg(A, b, Pl=p)
# same answer?
norm(xcg - xpcg)
# PCG is >10 fold faster than CG
@benchmark cg($A, $b, Pl=$p)
We leant about CG/PCG, which is for solving $\mathbf{A} \mathbf{x} = \mathbf{b}$, $\mathbf{A}$ pd.
MINRES (minimum residual method): symmetric indefinite $\mathbf{A}$.
Bi-CG (bi-conjugate gradient): unsymmetric $\mathbf{A}$.
Bi-CGSTAB (Bi-CG stabilized): improved version of Bi-CG.
GMRES (generalized minimum residual method): current de facto method for unsymmetric $\mathbf{A}$. E.g., PageRank problem.
Lanczos method: top eigen-pairs of a large symmetric matrix.
Arnoldi method: top eigen-pairs of a large unsymmetric matrix.
Lanczos bidiagonalization algorithm: top singular triplets of large matrix.
LSQR: least square problem $\min \|\mathbf{y} - \mathbf{X} \beta\|_2^2$. Algebraically equivalent to applying CG to the normal equation $(\mathbf{X}^T \mathbf{X} + \lambda^2 I) \beta = \mathbf{X}^T \mathbf{y}$.
LSMR: least square problem $\min \|\mathbf{y} - \mathbf{X} \beta\|_2^2$. Algebraically equivalent to applying MINRES to the normal equation $(\mathbf{X}^T \mathbf{X} + \lambda^2 I) \beta = \mathbf{X}^T \mathbf{y}$.
pcg
, bicg
, bicgstab
, gmres
, ...eigs
, svds
, ...Pre-conditioner:
cholinc
, luinc
, ...
Get familiar with the reverse communication interface (RCI) for utilizing iterative solvers:
x = gmres(A, b)
x = gmres(@Afun, b)
eigs(A)
eigs(@Afun)
eigs
and svds
in the Arpack.jl package. Numerical examples later.
IterativeSolvers.jl
package. CG numerical examples
Least squares examples:
using BenchmarkTools, IterativeSolvers, LinearAlgebra, Random, SparseArrays
Random.seed!(280) # seed
n, p = 10000, 5000
X = sprandn(n, p, 0.001) # iid standard normals with sparsity 0.01
β = ones(p)
y = X * β + randn(n)
β̂_qr = X \ y
# least squares by QR
@benchmark $X \ $y
β̂_lsqr = lsqr(X, y)
@show norm(β̂_qr - β̂_lsqr)
# least squares by lsqr
@benchmark lsqr($X, $y)
β̂_lsmr = lsmr(X, y)
@show norm(β̂_qr - β̂_lsmr)
# least squares by lsmr
@benchmark lsmr($X, $y)
In many applications, it is advantageous to define linear maps indead of forming the actual (sparse) matrix. For a linear map, we need to specify how it acts on right- and left-multiplication on a vector. The LinearMaps.jl
package is exactly for this purpose and interfaces nicely with IterativeSolvers.jl
, Arnoldi.jl
and other iterative solver packages.
Applications:
Consider the differencing operator that takes differences between neighboring pixels.
using LinearMaps, IterativeSolvers
# Overwrite y with A * x
# left difference assuming periodic boundary conditions
function leftdiff!(y::AbstractVector, x::AbstractVector)
N = length(x)
length(y) == N || throw(DimensionMismatch())
@inbounds for i in 1:N
y[i] = x[i] - x[mod1(i - 1, N)]
end
return y
end
# Overwrite y with A' * x
# minus right difference
function mrightdiff!(y::AbstractVector, x::AbstractVector)
N = length(x)
length(y) == N || throw(DimensionMismatch())
@inbounds for i in 1:N
y[i] = x[i] - x[mod1(i + 1, N)]
end
return y
end
# define linear map
D = LinearMap{Float64}(leftdiff!, mrightdiff!, 100; ismutating=true)
Linear maps can be used like a regular matrix.
@show size(D)
v = ones(size(D, 2)) # vector of all 1s
@show D * v
@show D' * v;
If we form the corresponding dense matrix, it will look like
Matrix(D)
If we form the corresponding sparse matrix, it will look like
using SparseArrays
sparse(D)
using UnicodePlots
spy(sparse(D))
Compute top singular values using iterative method (Arnoldi).
using Arpack
Arpack.svds(D, nsv=3)
using LinearAlgebra
# check solution against the direct method for SVD
svdvals(Matrix(D))
Compute top eigenvalues of the Gram matrix D'D
using iterative method (Arnoldi).
Arpack.eigs(D'D, nev=3, which=:LM)
Chapter 5 of Numerical Optimization by Jorge Nocedal and Stephen Wright (1999).
Sections 11.3-11.5 of Matrix Computations by Gene Golub and Charles Van Loan (2013).