Conjugate gradient is the top-notch iterative method for solving large, structured linear systems $\mathbf{A} \mathbf{x} = \mathbf{b}$, where $\mathbf{A}$ is pd.
Earlier we talked about Jacobi, Gauss-Seidel, and successive over-relaxation (SOR) as the classical iterative solvers. They are rarely used in practice due to slow convergence.
Kershaw's results for a fusion problem.
Method | Number of Iterations |
---|---|
Gauss Seidel | 208,000 |
Block SOR methods | 765 |
Incomplete Cholesky conjugate gradient | 25 |
Hestenes and Stiefel (1952), Methods of conjugate gradients for solving linear systems, Jounral of Research of the National Bureau of Standards.
A set of vectors $\{\mathbf{x}^{(0)},\ldots,\mathbf{x}^{(l)}\}$ is said to be conjugate with respect to $\mathbf{A}$ if $$ \begin{eqnarray*} \mathbf{p}_i^T \mathbf{A} \mathbf{p}_j = 0, \quad \text{for all } i \ne j. \end{eqnarray*} $$
Conjugate direction method: Given a set of conjugate vectors $\{\mathbf{p}^{(0)},\ldots,\mathbf{p}^{(l)}\}$, at iteration $t$, we search along the conjugate direction $\mathbf{p}^{(t)}$ $$ \begin{eqnarray*} \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} + \alpha^{(t)} \mathbf{p}^{(t)}, \end{eqnarray*} $$ where $$ \begin{eqnarray*} \alpha^{(t)} = - \frac{\mathbf{r}^{(t)T} \mathbf{p}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}} \end{eqnarray*} $$ is the optimal step length.
Theorem: In conjugate direction method, $\mathbf{x}^{(t)}$ converges to the solution in at most $n$ steps.
Intuition: Look at graph.
Conjugate gradient method. Idea: generate $\mathbf{p}^{(t)}$ using only $\mathbf{p}^{(t-1)}$ $$ \begin{eqnarray*} \mathbf{p}^{(t)} = - \mathbf{r}^{(t)} + \beta^{(t)} \mathbf{p}^{(t-1)}, \end{eqnarray*} $$ where $\beta^{(t)}$ is determined by the conjugacy condition $\mathbf{p}^{(t-1)T} \mathbf{A} \mathbf{p}^{(t)} = 0$ $$ \begin{eqnarray*} \beta^{(t)} = \frac{\mathbf{r}^{(t)T} \mathbf{A} \mathbf{p}^{(t-1)}}{\mathbf{p}^{(t-1)T} \mathbf{A} \mathbf{p}^{(t-1)}}. \end{eqnarray*} $$
CG algorithm (preliminary version):
Theorem: With CG algorithm
CG algorithm (economical version): saves one matrix-vector multiplication.
Computation cost per iteration is one matrix vector multiplication: $\mathbf{A} \mathbf{p}^{(t)}$.
Consider PageRank problem, $\mathbf{A}$ has dimension $n \approx 10^{10}$ but is highly structured (sparse + low rank). Each matrix vector multiplication takes $O(n)$.
Theorem: If $\mathbf{A}$ has $r$ distinct eigenvalues, $\mathbf{x}^{(t)}$ converges to solution $\mathbf{x}^*$ in at most $r$ steps.
Summary of conjugate gradient method for solving $\mathbf{A} \mathbf{x} = \mathbf{b}$ or equivalently minimizing $\frac 12 \mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T \mathbf{x}$:
Two important bounds for conjugate gradient algorithm:
Let $\lambda_1 \le \cdots \le \lambda_n$ be the ordered eigenvalues of a pd $\mathbf{A}$.
$$
\begin{eqnarray*}
\|\mathbf{x}^{(t+1)} - \mathbf{x}^*\|_{\mathbf{A}}^2 &\le& \left( \frac{\lambda_{n-t} - \lambda_1}{\lambda_{n-t} + \lambda_1} \right)^2 \|\mathbf{x}^{(0)} - \mathbf{x}^*\|_{\mathbf{A}}^2 \\
\|\mathbf{x}^{(t+1)} - \mathbf{x}^*\|_{\mathbf{A}}^2 &\le& 2 \left( \frac{\sqrt{\kappa(\mathbf{A})}-1}{\sqrt{\kappa(\mathbf{A})}+1} \right)^{t} \|\mathbf{x}^{(0)} - \mathbf{x}^*\|_{\mathbf{A}}^2,
\end{eqnarray*}
$$
where $\kappa(\mathbf{A}) = \lambda_n/\lambda_1$ is the condition number of $\mathbf{A}$.
Messages:
Pre-conditioning: Change of variables $\widehat{\mathbf{x}} = \mathbf{C} \mathbf{x}$ via a nonsingular $\mathbf{C}$ and solve $$ (\mathbf{C}^{-T} \mathbf{A} \mathbf{C}^{-1}) \widehat{\mathbf{x}} = \mathbf{C}^{-T} \mathbf{b}. $$ Choose $\mathbf{C}$ such that
Preconditioned CG does not make use of $\mathbf{C}$ explicitly, but rather the matrix $\mathbf{M} = \mathbf{C}^T \mathbf{C}$.
Preconditioned CG (PCG) algorithm:
While $\mathbf{r}^{(0)} \ne \mathbf{0}$
Remark: Only extra cost in the pre-conditioned CG algorithm is the need to solve the linear system $\mathbf{M} \mathbf{y} = \mathbf{r}$.
Pre-conditioning is more like an art than science. Some choices include
We leant about CG/PCG, which is for solving $\mathbf{A} \mathbf{x} = \mathbf{b}$, $\mathbf{A}$ pd.
MINRES (minimum residual method): symmetric indefinite $\mathbf{A}$.
Bi-CG (bi-conjugate gradient): unsymmetric $\mathbf{A}$.
Bi-CGSTAB (Bi-CG stabilized): improved version of Bi-CG.
GMRES (generalized minimum residual method): current de facto method for unsymmetric $\mathbf{A}$. E.g., PageRank problem.
Lanczos method: top eigen-pairs of a large symmetric matrix.
Arnoldi method: top eigen-pairs of a large unsymmetric matrix.
Lanczos bidiagonalization algorithm: top singular triplets of large matrix.
LSQR: least square problem $\min \|\mathbf{y} - \mathbf{X} \beta\|_2^2$. Algebraically equivalent to applying CG to the normal equation $(\mathbf{X}^T \mathbf{X} + \lambda^2 I)\mathbf{x} = \mathbf{X}^T \mathbf{y}$.
LSMR: least square problem $\min \|\mathbf{y} - \mathbf{X} \beta\|_2^2$. Algebraically equivalent to applying MINRES to the normal equation $(\mathbf{X}^T \mathbf{X} + \lambda^2 I)\mathbf{x} = \mathbf{X}^T \mathbf{y}$.
pcg
, bicg
, bicgstab
, gmres
, ...eigs
, svds
, ...Pre-conditioner:
cholinc
, luinc
, ...
Get familiar with the reverse communication interface (RCI) for utilizing iterative solvers:
x = gmres(A, b)
x = gmres(@Afun, b)
eigs(A)}
eigs(@Afun)
IterativeSolvers.jl
package. CG numerical examples
Least squares examples:
using BenchmarkTools, IterativeSolvers
srand(280) # seed
n, p = 1000, 500
X = sprandn(n, p, 0.01)
β = ones(p)
y = X * β + randn(n)
β̂_qr = X \ y
@benchmark X \ y
β̂_lsqr = lsqr(X, y)
@show vecnorm(β̂_qr - β̂_lsqr)
@benchmark lsqr(X, y)
β̂_lsmr = lsqr(X, y)
@show vecnorm(β̂_qr - β̂_lsmr)
@benchmark lsmr(X, y)