Consider maximizing log-likelihood function $L(\mathbf{\theta})$, $\theta \in \Theta \subset \mathbb{R}^p$.
Gradient (or score) of $L$: $$ \nabla L(\theta) = \begin{pmatrix} \frac{\partial L(\theta)}{\partial \theta_1} \\ \vdots \\ \frac{\partial L(\theta)}{\partial \theta_p} \end{pmatrix} $$
Differential of $L$: $$ dL(\theta) = [\nabla L(\theta)]^T $$
Hessian of $L$:
$$
\nabla^2 L(\theta) = d^2 L(\theta) = \begin{pmatrix}
\frac{\partial^2 L(\theta)}{\partial \theta_1 \partial \theta_1} & \dots & \frac{\partial^2 L(\theta)}{\partial \theta_1 \partial \theta_p} \\
\vdots & \ddots & \vdots \\
\frac{\partial^2 L(\theta)}{\partial \theta_p \partial \theta_1} & \dots & \frac{\partial^2 L(\theta)}{\partial \theta_p \partial \theta_p}
\end{pmatrix}
$$
Observed information matrix:
Newton's method was originally developed for finding roots of nonlinear equations $f(\mathbf{x}) = \mathbf{0}$ (KL 5.4).
Newton's method, aka Newton-Raphson method, is considered the gold standard for its fast (quadratic) convergence $$ \frac{\|\mathbf{\theta}^{(t+1)} - \mathbf{\theta}^*\|}{\|\mathbf{\theta}^{(t)} - \mathbf{\theta}^*\|^2} \to \text{constant}. $$
Idea: iterative quadratic approximation.
Taylor expansion around the current iterate $\mathbf{\theta}^{(t)}$ $$ L(\mathbf{\theta}) \approx L(\mathbf{\theta}^{(t)}) + dL(\mathbf{\theta}^{(t)}) (\mathbf{\theta} - \mathbf{\theta}^{(t)}) + \frac 12 (\mathbf{\theta} - \mathbf{\theta}^{(t)})^T d^2L(\mathbf{\theta}^{(t)}) (\mathbf{\theta} - \mathbf{\theta}^{(t)}) $$ and then maximize the quadratic approximation.
To maximize the quadratic function, we equate its gradient to zero $$ \nabla L(\theta^{(t)}) + [d^2L(\theta^{(t)})] (\theta - \theta^{(t)}) = \mathbf{0}_p, $$ which suggests the next iterate $$ \begin{eqnarray*} \theta^{(t+1)} &=& \theta^{(t)} - [d^2L(\theta^{(t)})]^{-1} \nabla L(\theta^{(t)}) \\ &=& \theta^{(t)} + [-d^2L(\theta^{(t)})]^{-1} \nabla L(\theta^{(t)}). \end{eqnarray*} $$ We call this naive Newton's method.
Some issues with the (naive) Newton's iteration
In summary, a Newton-type algorithm iterates according to $$ \boxed{ \theta^{(t+1)} = \theta^{(t)} + s [\mathbf{A}^{(t)}]^{-1} \nabla L(\theta^{(t)}) = \theta^{(t)} + s \Delta \theta^{(t)} } $$ where $\mathbf{A}^{(t)}$ is a pd approximation of $-d^2L(\theta^{(t)})$ and $s$ is a step length.
For strictly concave $L$, $-d^2L(\theta^{(t)})$ is always positive definite. Line search is still needed to guarantee convergence.
Line search strategy: step-halving ($s=1,1/2,\ldots$), golden section search, cubic interpolation, Amijo rule, ... Note the Newton direction $\Delta \theta^{(t)}$ only need to be calculated once. Cost of line search mainly lies in objective function evaluation.
How to approximate $-d^2L(\theta)$? More of an art than science. Often requires problem specific analysis.
Taking $\mathbf{A} = \mathbf{I}$ leads to the method of steepest ascent, aka gradient ascent.
Fisher's scoring method: replace $-d^2L(\theta)$ by the expected Fisher information matrix $$ \mathbf{I}(\theta) = \mathbf{E}[-d^2L(\theta)] = \mathbf{E}[\nabla L(\theta) dL(\theta)] \succeq \mathbf{0}_{p \times p}, $$ which is psd under exchangeability of expectation and differentiation.
Therefore the Fisher's scoring algorithm iterates according to $$ \boxed{ \theta^{(t+1)} = \theta^{(t)} + s [\mathbf{I}(\theta^{(t)})]^{-1} \nabla L(\theta^{(t)})}. $$
Let's consider a concrete example: logistic regression.
The goal is to predict whether a credit card transaction is fraud ($y_i=1$) or not ($y_i=0$). Predictors ($\mathbf{x}_i$) include: time of transaction, last location, merchant, ...
$y_i \in \{0,1\}$, $\mathbf{x}_i \in \mathbb{R}^{p}$. Model $y_i \sim $Bernoulli$(p_i)$.
Logistic regression. Density $$ \begin{eqnarray*} f(y_i|p_i) &=& p_i^{y_i} (1-p_i)^{1-y_i} \\ &=& e^{y_i \ln p_i + (1-y_i) \ln (1-p_i)} \\ &=& e^{y_i \ln \frac{p_i}{1-p_i} + \ln (1-p_i)}, \end{eqnarray*} $$ where $$ \begin{eqnarray*} \mathbf{E} (y_i) = p_i &=& \frac{e^{\mathbf{x}_i^T \beta}}{1+ e^{\mathbf{x}_i^T \beta}} \quad \text{(mean function, inverse link function)} \\ \mathbf{x}_i^T \beta &=& \ln \left( \frac{p_i}{1-p_i} \right) \quad \text{(logit link function)}. \end{eqnarray*} $$
Given data $(y_i,\mathbf{x}_i)$, $i=1,\ldots,n$,
Let's consider the more general class of generalized linear models (GLM).
Family | Canonical Link | Variance Function |
---|---|---|
Normal | $\eta=\mu$ | 1 |
Poisson | $\eta=\log \mu$ | $\mu$ |
Binomial | $\eta=\log \left( \frac{\mu}{1 - \mu} \right)$ | $\mu (1 - \mu)$ |
Gamma | $\eta = \mu^{-1}$ | $\mu^2$ |
Inverse Gaussian | $\eta = \mu^{-2}$ | $\mu^3$ |
$Y$ belongs to an exponential family with density $$ p(y|\theta,\phi) = \exp \left\{ \frac{y\theta - b(\theta)}{a(\phi)} + c(y,\phi) \right\}. $$
Score, Hessian, information
Fisher scoring method $$ \beta^{(t+1)} = \beta^{(t)} + s [\mathbf{I}(\beta^{(t)})]^{-1} \nabla L_n(\beta^{(t)}) $$ IRWLS with weights $w_i = [\mu_i(\eta_i)]^2/\sigma_i^2$ and some working responses $z_i$.
For canonical link, $\theta = \eta$, the second term of Hessian vanishes and Hessian coincides with Fisher information matrix. Convex problem 😄 $$ \text{Fisher's scoring = Newton's method}. $$
Non-canonical link, non-convex problem 😞 $$ \text{Fisher's scoring algorithm} \ne \text{Newton's method}. $$ Example: Probit regression (binary response with probit link). $$ \begin{eqnarray*} y_i &\sim& \text{Bernoulli}(p_i) \\ p_i &=& \Phi(\mathbf{x}_i^T \beta) \\ \eta_i &=& \mathbf{x}_i^T \beta = \Phi^{-1}(p_i). \end{eqnarray*} $$ where $\Phi(\cdot)$ is the cdf of a standard normal.
Julia, R and Matlab implement the Fisher scoring method, aka IRWLS, for GLMs.
Now we finally get to the problem Gauss faced in 1800!
Relocate Ceres by fitting 41 observations to a 6-parameter (nonlinear) orbit.
Nonlinear least squares (curve fitting): $$ \text{minimize} \,\, f(\beta) = \frac{1}{2} \sum_{i=1}^n [y_i - \mu_i(\mathbf{x}_i, \beta)]^2 $$ For example, $y_i =$ dry weight of onion and $x_i=$ growth time, and we want to fit a 3-parameter growth curve $$ \mu(x, \beta_1,\beta_2,\beta_3) = \frac{\beta_3}{1 + e^{-\beta_1 - \beta_2 x}}. $$
"Score" and "information matrices" $$ \begin{eqnarray*} \nabla f(\beta) &=& - \sum_{i=1}^n [y_i - \mu_i(\beta)] \nabla \mu_i(\beta) \\ d^2 f(\beta) &=& \sum_{i=1}^n \nabla \mu_i(\beta) d \mu_i(\beta) - \sum_{i=1}^n [y_i - \mu_i(\beta)] d^2 \mu_i(\beta) \\ \mathbf{I}(\beta) &=& \sum_{i=1}^n \nabla \mu_i(\beta) d \mu_i(\beta) = \mathbf{J}(\beta)^T \mathbf{J}(\beta), \end{eqnarray*} $$ where $\mathbf{J}(\beta)^T = [\nabla \mu_1(\beta), \ldots, \nabla \mu_n(\beta)] \in \mathbb{R}^{p \times n}$.
Gauss-Newton (= "Fisher's scoring algorithm") uses $\mathbf{I}(\beta)$, which is always psd. $$ \boxed{ \beta^{(t+1)} = \beta^{(t)} + s \mathbf{I} (\beta^{(t)})^{-1} \nabla L(\beta^{(t)}) } $$
Levenberg-Marquardt method, aka damped least squares algorithm (DLS), adds a ridge term to the approximate Hessian $$ \boxed{ \beta^{(t+1)} = \beta^{(t)} + s [\mathbf{I} (\beta^{(t)}) + \tau \mathbf{I}_p]^{-1} \nabla L(\beta^{(t)}) } $$ bridging between Gauss-Newton and steepest descent.
Other approximation to Hessians: nonlinear GLMs.
See KL 14.4 for examples.