# 7 Numerical optimization

The main application of numerical optimization in statistics is computation of parameter estimates. Typically by maximizing the likelihood function or by maximizing or minimizing another estimation criterion. The focus of this chapter is on optimization algorithms for (penalized) maximum likelihood estimation. Out of tradition we formulate all results and algorithms in terms of minimization and not maximization.

The generic optimization problem considered is the minimization of
a function \(H : \Theta \to \mathbb{R}\) for \(\Theta \subseteq \mathbb{R}^p\) an
open set and \(H\) twice differentiable. In applications we consider \(H = -\ell\),
the negative log-likelihood function, or \(H = -\ell + J\), where \(J : \Theta \to \mathbb{R}\)
is a *penalty function*, likewise twice differentiable, that does not
depend upon data.

Most statistical optimization problems share a common structure that we we should pay attention to. Given data points \(x_1, \ldots, x_N\) the negative log-likelihood

\[ -\ell(\theta) = - \sum_{i=1}^N \log(f_{\theta}(x_i)) \]

is a sum over the data, and the more data we have the more computationally demanding it is to evaluate \(-\ell\) and its derivatives. There are exceptions, though, when a sufficient statistic can be computed upfront, but generally we must expect runtime to grow with data size.

We should also note that blindly pursuing the global maximum of the likelihood can lead us astray if our model is not well behaved. Kiefer (1978) demonstrated that for the Gaussian mixture model introduced in Section 6.4.1, the likelihood (with unknown variance parameters) is unbounded. Yet Kiefer showed that for \(H = - \ell\) there exists a consistent sequence of solutions to the equation \(\nabla H(\theta) = 0\) as \(N \to \infty\). That is, the unknown parameters can be estimated consistently by computing stationary points of \(H\)—even if these do not globally minimize \(H\). In situations where \(H\) is not convex and where \(H\) may be unbounded below, it might still be possible to compute a good parameter estimate as a local minimizer. So even if we phrase the objective as a global optimization problem, we may be equally interested in local minima or even just stationary points; solutions to \(\nabla H(\theta) = 0\).

Optimization is a computational tool used in statistics for adapting models to data – but the minimizer of \(H\) for a particular data set is not intrinsically interesting. It is typically unnecessary to compute a (local) minimizer to high numerical precision. If the numerical error is already orders of magnitudes smaller than the statistical uncertainty of the parameter estimate being computed, further optimization makes no relevant difference. Rather than computing a single numerically accurate minimizer we should use our computational resources to explore many models and many starting values for the numerical optimizations, and we should assess the various model fits and quantify the model and parameter uncertainties.

For the generic minimization problem considered in this chapter, the practical
challenge when implementing algorithms in R is typically
to implement efficient evaluation of \(H\) and its derivatives. In particular,
efficient evaluations of \(-\ell\). Several choices of standard optimization
algorithms are possible and some are already implemented and available in R.
For many practical purposes the BFGS-algorithms as implemented via `optim()`

work well and require only the computation of gradients. It is, of course,
paramount that \(H\) and \(\nabla H\) are correctly implemented, and
efficiency of the algorithms is largely determined by the efficiency of
the implementation of \(H\) and \(\nabla H\) as well as by the choice of parametrization.

## 7.1 Peppered Moths

We revisit in this section the peppered moth example from Section 6.2.1,
where we implemented the log-likelihood for general multinomial cell collapsing
and computed the maximum-likelihood estimate via
`optim()`

. We used the default Nelder-Mead algorithm, which does not require
derivatives. In this section we implement the gradient as well and
illustrate how to optimize a function via `optim()`

using the conjugate
gradient algorithm and the BFGS-algorithm.

From the expression for the log-likelihood in (6.1) it follows that the gradient equals

\[ \nabla \ell(\theta) = \sum_{j = 1}^{K_0} \frac{ x_j }{ M(p(\theta))_j}\nabla M(p(\theta))_j = \sum_{j = 1}^{K_0} \sum_{k \in A_j} \frac{ x_j}{ M(p(\theta))_j} \nabla p_k(\theta). \]

Letting \(j(k)\) be defined by \(k \in A_{j(k)}\) we see that the gradient can also be written as

\[ \nabla \ell(\theta) = \sum_{k=1}^K \frac{x_{j(k)}}{ M(p(\theta))_{j(k)}} \nabla p_k(\theta) = \mathbf{\tilde{x}}(\theta) \mathrm{D}p(\theta), \]

where \(\mathrm{D}p(\theta)\) is the Jacobian of the parametrization \(\theta \mapsto p(\theta)\), and \(\mathbf{\tilde{x}}(\theta)\) is the vector with

\[ \mathbf{\tilde{x}}(\theta)_k = \frac{ x_{j(k)}}{M(p(\theta))_{j(k)}}. \]

```
grad_loglik <- function(par, x, prob, Dprob, group) {
p <- prob(par)
if(is.null(p)) return(rep(NA, length(par)))
- (x[group] / M(p, group)[group]) %*% Dprob(par)
}
```

The Jacobian needs to be implemented for the specific example of peppered moths.

```
Dprob_pep <- function(p) {
p[3] <- 1 - p[1] - p[2]
matrix(
c(2 * p[1], 0,
2 * p[2], 2 * p[1],
2* p[3] - 2 * p[1], -2 * p[1],
0, 2 * p[2],
-2 * p[2], 2 * p[3] - 2 * p[2],
-2 * p[3], -2 * p[3]),
ncol = 2, nrow = 6, byrow = TRUE)
}
```

We can then use the conjugate gradient algorithm to compute the maximum-likelihood estimate.

```
optim(c(0.3, 0.3), loglik, grad_loglik, x = c(85, 196, 341),
prob = prob_pep, Dprob = Dprob_pep, group = c(1, 1, 1, 2, 2, 3),
method = "CG")
```

```
## $par
## [1] 0.07084 0.18874
##
## $value
## [1] 600.5
##
## $counts
## function gradient
## 131 25
##
## $convergence
## [1] 0
##
## $message
## NULL
```

The peppered Moth example is very simple.
The log-likelihood can easily be computed, and we used this
problem to illustrate ways of implementing a
likelihood in R and how to use `optim()`

to maximize it.

One of the likelihood implementations was very problem specific
while the other more abstract and general, and we used the same general and abstract
approach to implement the gradient above. The gradient could then
be used for other optimization algorithms, still using `optim`

, such as
conjugate gradient. In fact, you can use conjugate gradient without
computing and implementing the gradient.

```
optim(c(0.3, 0.3), loglik, x = c(85, 196, 341),
prob = prob_pep, group = c(1, 1, 1, 2, 2, 3),
method = "CG")
```

```
## $par
## [1] 0.07084 0.18874
##
## $value
## [1] 600.5
##
## $counts
## function gradient
## 110 15
##
## $convergence
## [1] 0
##
## $message
## NULL
```

If we do not implement a gradient, a numerical gradient is
used by `optim()`

. This can result in a slower algorithm than if the
gradient is implemented, but more seriously, in can result in convergence
problems. This is because there is a subtle tradeoff between numerical
accuracy and accuracy of the finite difference approximation used to
approximate the gradient. We did not
experience convergence problems in the example above, but one way to remedy such problems
is to set the `parscale`

or `fnscale`

entries in the `control`

list argument to `optim()`

.

In the following chapter the peppered moth example is used to illustrate the EM algorithm. It is important to understand that the EM algorithm does not rely on the ability to compute the likelihood or the gradient of the likelihood for that matter. In many real applications of the EM algorithm the computation of the likelihood is challenging or even impossible, thus most standard optimization algorithms will not be directly applicable.

## 7.2 Algorithms and convergence

In the subsequent sections of this chapter we will investigate the implementation of several standard optimization algorithms—including the conjugate gradient algorithm used for the peppered moth example in the previous section. In this section we develop a minimal theoretical framework for presenting and analyzing optimization algorithms and their convergence. This establishes a rudimentary theoretical understanding of sufficient conditions for convergence, but equally importantly it provides us with a precise framework for quantifying and measuring convergence of algorithms in practice.

A numerical optimization algorithm computes from an initial value \(\theta_0 \in \Theta\) a sequence \(\theta_1, \theta_2, \ldots \in \Theta\). One could hope for

\[ \theta_n \rightarrow \text{arg min}_{\theta} H(\theta) \]

for \(n \to \infty\), but less can typically be guaranteed. First,
the global minimizer may not exist or it may not be unique, in which
case the convergence itself is undefined or ambiguous. Second, \(\theta_n\)
can in general only be shown to converge to a *local* minimizer if anything.
Third, \(\theta_n\) may not converge, even if \(H(\theta_n)\)
converges to a local minimum.

To discuss properties of optimization algorithms in greater detail we
need ways to quantify, analyze and monitor their convergence. This can be
done in terms of \(\theta_n\), \(H(\theta_n)\) or \(\nabla H(\theta_n)\).
Focusing on \(H(\theta_n)\) we say that an algorithm is a *descent algorithm* if

\[H(\theta_0) \geq H(\theta_1) \geq H(\theta_2) \geq \ldots.\]

If all inequalities are sharp (except if some \(\theta_i\) is a local minimizer),
the algorithm is called a *strict* descent algorithm. The
sequence \(H(\theta_n)\) is convergent for any descent algorithm if \(H\)
is bounded from below, e.g., if the *level set*

\[\mathrm{lev}(\theta_0) = \{\theta \in \Theta \mid H(\theta) \leq H(\theta_0)\}\]

is compact. However, even for a strict descent algorithm,
\(H(\theta_n)\) may descent in smaller and smaller steps toward a limit that is
not a (local) minimum. A good optimization algorithm needs to
guarantee more than descent – it needs to guarantee *sufficient* descent
in each step.

Many algorithms can be phrased as

\[\theta_n = \Phi(\theta_{n-1})\]

for a map \(\Phi : \Theta \to \Theta\). When \(\Phi\) is continuous and \(\theta_n \rightarrow \theta_{\infty}\) it follows that

\[\Phi(\theta_n) \rightarrow \Phi(\theta_{\infty}).\]

Since \(\Phi(\theta_n) = \theta_{n+1} \rightarrow \theta_{\infty}\) we see that

\[\Phi(\theta_{\infty}) = \theta_{\infty}.\]

That is, \(\theta_{\infty}\) is a *fixed point* of \(\Phi\). If \(\theta_n\) is
not convergent, any accumulation point of \(\theta_n\) will still be a
fixed point of \(\Phi\). Thus we can search for potential accumulation
points by searching for fixed points of the map \(\Phi: \Theta \to \Theta\).

Mathematics is full of *fixed point theorems* that: i) give conditions under which a map
has a fixed point; and ii) in some cases guarantee that the iterates \(\Phi^{\circ n}(\theta_0)\)
converge to a fixed point. The most prominent result is Banach’s fixed point
theorem.
It states that if \(\Phi\) is a *contraction*, that is,

\[\| \Phi(\theta) - \Phi(\theta')\| \leq c \|\theta - \theta'\|\]

for a constant \(c \in [0,1)\) (using any norm), then \(\Phi\) has a unique fixed point and \(\theta_n = \Phi^{\circ n}(\theta_0)\) converges to that fixed point for any starting point \(\theta_0 \in \Theta\).

We will show how a simple gradient descent algorithm can be analyzed – both as a descent algorithm and through Banach’s fixed point theorem. This will demonstrate basic proof techniques as well as typical conditions on \(H\) that can give convergence results for optimization algorithms. We will only scratch the surface here with the intention that it can motivate the algorithms introduced in subsequent sections and chapters as well as empirical techniques for practical assessment of convergence.

Indeed, theory rarely provides us with sharp quantitative results on the rate of convergence and we will need computational techniques to monitor and measure convergence of algorithms in practice. Otherwise we cannot compare the efficiency of different algorithms.

### 7.2.1 Gradient descent

We will assume in this section that \(\Theta = \mathbb{R}^p\). This will simplify the analysis a bit, but with minor modifications it can be generalized to the case where \(\Theta\) is open and convex.

Suppose that \(D^2H(\theta)\) has *numerical radius* uniformly bounded by \(L\), that is,

\[|\gamma^T D^2H(\theta) \gamma| \leq L \|\gamma\|_2^2\]

for all \(\theta \in \Theta\) and \(\gamma \in \mathbb{R}^p\). The
*gradient descent* algorithm with a fixed step length \(1/(L + 1)\) is given by
\[\begin{align}
\theta_{n} & = \theta_{n-1} - \frac{1}{L + 1} \nabla H(\theta_{n-1}). \tag{7.1}
\end{align}\]
Fixing \(n\) there is by Taylor’s theorem a
\(\tilde{\theta} = \alpha \theta_{n} + (1- \alpha)\theta_{n-1}\) (where \(\alpha \in [0,1]\))
on the line between \(\theta_n\) and \(\theta_{n-1}\) such that

\[\begin{align*} H(\theta_n) & = H(\theta_{n-1}) - \frac{1}{L+1} \|\nabla H(\theta_{n-1})\|_2^2 + \\ & \qquad \frac{1}{(L+1)^2} \nabla H(\theta_{n-1})^T D^2H(\tilde{\theta}) \nabla H(\theta_{n-1}) \\ & \leq H(\theta_{n-1}) - \frac{1}{L+1} \|\nabla H(\theta_{n-1})\|_2^2 + \frac{L}{(L+1)^2} \|\nabla H(\theta_{n-1})\|_2^2 \\ & = H(\theta_{n-1}) - \frac{1}{(L+1)^2} \|\nabla H(\theta_{n-1})\|_2^2. \end{align*}\]

This shows that \(H(\theta_n) \leq H(\theta_{n-1})\), and if \(\theta_{n-1}\) is not a stationary point, \(H(\theta_n) < H(\theta_{n-1})\). That is, the algorithm is a descent algorithm, and unless it hits a stationary point it is a strict descent algorithm.

It furthermore follows that

\[\begin{align} H(\theta_n) & = (H(\theta_n) - H(\theta_{n-1})) + (H(\theta_{n-1}) - H(\theta_{n-2})) + ... \\ & \qquad + (H(\theta_1) - H(\theta_0)) + H(\theta_0) \\ & \leq H(\theta_0) - \frac{1}{(L + 1)^2} \sum_{k=1}^n \|\nabla H(\theta_{k-1})\|_2^2. \end{align}\]

If \(H\) is bounded below, \(\sum_{k=1}^{\infty} \|\nabla H(\theta_{k-1})\|_2^2 < \infty\) and hence

\[\|\nabla H(\theta_{n})\|_2 \rightarrow 0\]

for \(n \to \infty\). For any accumulation point, \(\theta_{\infty}\), of the sequence \(\theta_n\), it follows by continuity of \(\nabla H\) that

\[\nabla H(\theta_{\infty}) = 0,\]

and \(\theta_{\infty}\) is a stationary point. If \(H\) has a *unique*
stationary point in \(\mathrm{lev}(\theta_0)\), \(\theta_{\infty}\) is a (local)
minimizer, and

\[\theta_n \rightarrow \theta_{\infty}\]

for \(n \to \infty\).

The gradient descent algorithm (7.1) is given by the map

\[\Phi_{\nabla}(\theta) = \theta - \frac{1}{L + 1} \nabla H(\theta).\]

The gradient descent map, \(\Phi_{\nabla}\), has \(\theta\) as fixed point if and only if

\[\nabla H(\theta) = 0,\]

that is, if and only if \(\theta\) is a stationary point.

We will show that \(\Phi_{\nabla}\) is a contraction on \(\Theta\) under the assumption that the eigenvalues of the (symmetric) matrix \(D^2H(\theta)\) for all \(\theta \in \Theta\) are contained in an interval \([l, L]\) with \(0 < l \leq L\). If \(\theta, \theta' \in \Theta\) we find by Taylor’s theorem that \[\nabla H(\theta) = \nabla H(\theta') + D^2H(\tilde{\theta})(\theta - \theta')\] for some \(\tilde{\theta}\) on the line between \(\theta\) and \(\theta'\). For the gradient descent map this gives that

\[\begin{align*} \|\Phi_{\nabla}(\theta) - \Phi_{\nabla}(\theta')\|_2 & = \left\|\theta - \theta' - \frac{1}{L+1}\left(\nabla H(\theta) - \nabla H(\theta')\right)\right\|_2 \\ & = \left\|\theta - \theta' - \frac{1}{L+1}\left( D^2H(\tilde{\theta})(\theta - \theta')\right)\right\|_2 \\ & = \left\|\left(I - \frac{1}{L+1} D^2H(\tilde{\theta}) \right) (\theta - \theta')\right\|_2 \\ & \leq \left(1 - \frac{l}{L + 1}\right) \|\theta - \theta'\|_2, \end{align*}\]

since the eigenvalues of \(I - \frac{1}{L+1} D^2H(\tilde{\theta})\) are all between \(1 - L/(L + 1)\) and \(1 - l/(L+1)\). This shows that \(\Phi_{\nabla}\) is a contraction for the \(2\)-norm with \(c = 1 - l/(L + 1) < 1\). From Banach’s fixed point theorem it follows that there is a unique stationary point, \(\theta_{\infty}\), and \(\theta_n \rightarrow \theta_{\infty}\). Since \(D^2H(\theta_{\infty})\) is positive definite, \(\theta_{\infty}\) is a (global) minimizer.

To summarize, if \(D^2H(\theta)\) has uniformly bounded numerical radius, and if \(H\) has a unique stationary point \(\theta_{\infty}\) in the compact set \(\mathrm{lev}(\theta_0)\), then the gradient descent algorithm is a strict descent algorithm that converges toward that (local) minimizer \(\theta_{\infty}.\) The fixed step length was key to this analysis. The upper bound on the numerical radius of \(D^2H(\theta)\) guarantees descent with a fixed step length, and the fixed step length then guarantees sufficient descent.

When the eigenvalues of \(D^2H(\theta)\) are in \([l, L]\) for \(0 < l \leq L\),
\(H\) is a *strongly convex function* with a unique global minimizer
\(\theta_{\infty}\) and all level sets \(\mathrm{lev}(\theta_0)\) compact.
Convergence can then also be established via Banach’s fixed point theorem.
The constant \(c = 1 - l/(L + 1) = 1 - \kappa^{-1}\) with \(\kappa = (L + 1) / l\)
then quantifies the rate of the convergence, as will be discussed in greater
detail in the next section. The constant \(\kappa\) is an upper bound on the
conditioning number
of the matrix \(D^2H(\theta)\) uniformly in \(\theta\), and a large value of
\(\kappa\) means that \(c\) is close to 1 and the convergence can be slow.
A large conditioning number for the second derivative indicates that the graph
of \(H\) looks like a narrow ravine, in which case we can expect the
gradient descent algorithm to be slow.

### 7.2.2 Convergence rate

When \(\theta_n = \Phi^{\circ n}(\theta_0)\) for a contraction \(\Phi\), Banach’s fixed point theorem tells us that there is a unique fixed point \(\theta_{\infty}.\) That \(\Phi\) is a contraction further implies that

\[\|\theta_n - \theta_{\infty}\| = \|\Phi(\theta_{n-1}) - \theta_{\infty}\| \leq c \|\theta_{n-1} - \theta_{\infty}\| \leq c^n \|\theta_0 - \theta_{\infty}\|.\]

That is, \(\|\theta_n - \theta_{\infty}\| \to 0\) with at least geometric rate \(c < 1\).

In the numerical optimization literature, convergence at a geometric rate is known as linear convergence (the number of correct digits increases linearly with the number of iterations), and a linearly convergent algorithm is often quantified in terms of its asymptotic convergence rate.

**Definition 7.1 **Let \((\theta_n)_{n \geq 1}\) be a convergent sequence in \(\mathbb{R}^p\) with
limit \(\theta_{\infty}\). Let \(\| \cdot \|\) be a norm on \(\mathbb{R}^p\). We say
that the convergence is linear if

\[\limsup_{n \to \infty} \frac{\|\theta_{n} - \theta_{\infty}\|}{\|\theta_{n-1} - \theta_{\infty}\|} = r\]

for some \(r \in (0, 1)\), in which case \(r\) is called the asymptotic rate.

Convergence of an algorithm can be faster or slower than linear. If

\[\limsup_{n \to \infty} \frac{\|\theta_{n} - \theta_{\infty}\|}{\|\theta_{n-1} - \theta_{\infty}\|} = 1\]

we say that the convergence is sublinear, and if

\[\limsup_{n \to \infty} \frac{\|\theta_{n} - \theta_{\infty}\|}{\|\theta_{n-1} - \theta_{\infty}\|} = 0\]

we say that the convergence is superlinear. There is also a refined notion of convergence order for superlinearly convergent algorithms that more precisely describes the convergence.

For a contraction, \(\theta_n = \Phi^{\circ n}(\theta_0)\) converges linearly or superlinearly. If it converges linearly, the asymptotic rate is bounded by \(c\), but this is a global constant and may be pessimistic. The following lemma provides a local upper bound on the asymptotic rate in terms of the derivative of \(\Phi\) in the limit \(\theta_{\infty}\).

**Lemma 7.1 **Let \(\Phi : \mathbb{R}^p \to \mathbb{R}^p\) be twice continuously differentiable. If
\(\theta_n = \Phi^{\circ n}(\theta_0)\) converges linearly toward a fixed point \(\theta_{\infty}\)
of \(\Phi\) then

\[r_{\max} = \sup_{\gamma: \|\gamma \| = 1} \|D \Phi(\theta_{\infty}) \gamma\|\]

is an upper bound on the asymptotic rate.

*Proof*. By Taylor’s theorem,

\[\begin{align*} \theta_{n} = \theta_{\infty} + D \Phi(\theta_{\infty})(\theta_{n-1} - \theta_{\infty}) + o(\|\theta_{n-1} - \theta_{\infty}\|_2). \end{align*}\]

Rearranging yields

\[\begin{align*} \limsup_{n \to \infty} \frac{\| \theta_{n} - \theta_{\infty}\|}{\| \theta_{n-1} - \theta_{\infty}\|} & = \limsup_{n \to \infty} \frac{\| D \Phi(\theta_{\infty})(\theta_{n-1} - \theta_{\infty})\|}{\| \theta_{n-1} - \theta_{\infty}\|} \\ & = \limsup_{n \to \infty} \| D \Phi(\theta_{\infty})\left(\frac{\theta_{n-1} - \theta_{\infty}}{\| \theta_{n-1} - \theta_{\infty}\|} \right)\| \\ & \leq r_{\max}. \end{align*}\]

Note that it is possible that \(r_{\max} \geq 1\) even if the convergence is linear, in which case \(r_{\max}\) is a useless upper bound. Moreover, the actual rate can depend on the starting point \(\theta_0\), and even when \(r_{\max} < 1\) it only quantifies the worst case convergence rate.

To estimate the actual convergence rate, note that linear convergence with rate \(r\) implies that for any \(\delta > 0\)

\[ \| \theta_n - \theta_{\infty} \| \leq (r + \delta) \| \theta_{n-1} - \theta_{\infty} \| \leq \ldots \leq (r + \delta)^{n - n_0} \| \theta_{n_0} - \theta_{\infty} \| \]

\(n \geq n_0\) with \(n_0\) sufficiently large (depending on \(\delta\)). That is,

\[\log \|\theta_{n} - \theta_{\infty}\| \leq n \log(r + \delta) + d,\]

for \(n \geq n_0\). In practice, we run the algorithm for \(M\) iterations, so that \(\theta_M = \theta_{\infty}\) up to computer precision, and we plot \(\log \|\theta_{n} - \theta_{M}\|\) as a function of \(n\). The decay should be approximately linear for \(n \geq n_0\) for some \(n_0\), in which case the slope is about \(\log(r) < 0\), which can then be estimated by least squares. A slower-than-linear or faster-than-linear decay indicate that the algorithm converges sublinearly or superlinearly, respectively.

It is, of course, possible to quantify convergence in terms of the sequences \(H(\theta_n)\) and \(\nabla H(\theta_n)\) instead. For a descent algorithm with \(H(\theta_n) \rightarrow H(\theta_\infty)\) linearly for \(n \to \infty\), we can plot

\[\log(H(\theta_n) - H(\theta_M))\]

as a function of \(n\), and use least squares to estimate the asymptotic rate of convergence of \(H(\theta_n)\).

Using \(\nabla H(\theta_n)\) to quantify convergence is particularly appealing as we know that the limit is \(0\). This also means that we can monitor its convergence while the algorithm is running and not only after it has converged. If \(\nabla H(\theta_n)\) converges linearly in the norm \(\|\cdot\|\), that is,

\[\limsup_{n \to \infty} \frac{\|\nabla H(\theta_n)\|}{\|\nabla H(\theta_{n-1})\|} = r,\]

we have for any \(\delta > 0\) that

\[\log \|\nabla H(\theta_n)\| \leq n \log(r + \delta) + d\]

for \(n \geq n_0\) and \(n_0\) sufficiently large. Again, least squares can be used to estimate the asymptotic rate of convergence of \(\|\nabla H(\theta_n)\|\).

The concepts of linear convergence and asymptotic rate are, unfortunately, not independent of the norm used, nor does linear convergence of \(\theta_n\) in \(\|\cdot\|\) imply linear convergence of \(H(\theta_n)\) or \(\|\nabla H(\theta_n)\|\). We can show, though, that if \(\theta_n\) converges linearly in the norm \(\| \cdot \|\), the other two sequences converge linearly along an arithmetic subsequence. Similarly, it can be shown that in any other norm than \(\| \cdot \|\), \(\theta_n\) is also linearly convergent along an arithmetic subsequence.

**Theorem 7.1 **Suppose that \(H\) is three times continuously
differentiable with \(\nabla H (\theta_{\infty}) = 0\) and \(D^2 H(\theta_{\infty})\)
positive definite. If \(\theta_n\) converges linearly toward \(\theta_{\infty}\)
in any norm then for some \(k \geq 1\), the subsequences \(H(\theta_{nk})\) and
\(\|\nabla H(\theta_{nk})\|\) converge linearly toward \(H(\theta_{\infty})\)
and \(0\), respectively.

*Proof*. Let \(G = D^2 H(\theta_{\infty})\) and \(g_n = \theta_n - \theta_{\infty}\).
Since \(G\) is positive definite it defines a norm, and since all norms
on \(\mathbb{R}^p\) are equivalent there are constants \(a, b > 0\) such that

\[a \| x \|^2 \leq x^T G x \leq b \| x \|^2.\]

By Taylor’s theorem,

\[H(\theta_{(n + 1)k}) - H(\theta_{\infty}) = g_{(n + 1)k}^T G g_{(n + 1)k} + o(\|g_{(n + 1)k}\|^2).\]

From this,

\[\begin{align*} \limsup_{n \to \infty} \frac{H(\theta_{(n + 1)k}) - H(\theta_{\infty})}{H(\theta_{nk}) - H(\theta_{\infty})} & = \limsup_{n \to \infty} \frac{g_{(n + 1)k}^T G g_{(n + 1)k}}{g_{nk}^T G g_{nk}} \\ & \leq \limsup_{n \to \infty} \frac{b \| g_{(n + 1)k}\|^2}{a \| g_{nk}\|^2} = \frac{b}{a} r^{2k} \end{align*}\]

where \(r \in (0, 1)\) is the asymptotic rate of \(\theta_n \to \theta_{\infty}\). By choosing \(k\) sufficiently large, the right hand side above is strictly smaller than 1, which gives an upper bound on the rate along the subsequence \(H(\theta_{nk})\). A similar argument gives a lower bound strictly larger than 0, which shows that the convergence is not superlinear.

Using Taylor’s theorem on \(\nabla H(\theta_{nk})\) yields a similar proof for the subsequence \(\|\nabla H(\theta_{nk})\|\) – with \(G\) replaced by \(G^2\) in the bounds.

The bounds in the proof are often pessimistic since \(g_n\) and \(g_{n+1}\) can point in completely different directions. Exercise ?? shows, however, that if \(\theta_n\) approaches \(\theta_{\infty}\) nicely along a fixed direction (making \(g_n\) and \(g_{n+1}\) almost collinear), \(H(\theta_n)\) as well as \(\|\nabla H(\theta_n)\|\) inherit linear convergence from \(\theta_n\) and even with the same rate.

All rates discussed hithereto are *per iteration*,
which is natural when investigating a single algorithm. However, different
algorithms may spend different amounts of time per iteration, and it does not
make sense to make a comparison of per iteration rates across different
algorithms. We therefore need to be able to convert between per iteration
and per time unit rates. If one iteration takes \(\delta\) time units (seconds, say)
the *per time unit* rate is

\[r^{1/\delta}.\]

If \(t_n\) denotes the runtime of \(n\) iterations, we could estimate \(\delta\) as \(t_M / M\) for \(M\) sufficiently large.

We will throughout systematically investigate convergence as a function of time \(t_n\) instead of iterations \(n\), and we will estimate rates per time unit directly by least squares regression on \(t_n\) instead of \(n\).

### 7.2.3 Stopping criteria

One important practical question remains unanswered even if we understand the theoretical convergence properties of an algorithm well and have good methods for measuring its convergence rates. No algorithm can run for an infinite number of iterations, thus all algorithms need a criterion for when to stop, but the choice of an appropriate stopping criterion is notoriously difficult. We present four of the most commonly used criteria and discuss benefits and deficits for each of them.

**Maximum number of iterations:** Stop when

\[n = M\]

for a fixed maximum number of iterations \(M\). This is arguably the simplest criterion, but, obviously, reaching a maximum number of iterations provides no evidence in itself that \(H(\theta_M)\) is sufficiently close to a (local) minimum. For a specific problem we could from experience know of a sufficiently large \(M\) so that the algorithm has typically converged after \(M\) iterations, but the most important use of this criterion is in combination with another criterion so that it works as a safeguard against an infinite loop.

The other three stopping criteria all depend on choosing a
*tolerance parameter* \(\varepsilon > 0\), which will play different roles in the three
criteria. The three criteria can be used individually and in combinations,
but unfortunately neither of them nor their combination provide convergence
guarantees. It is nevertheless common say that an algorithm “has converged”
when the stopping criterion is satisfied, but since none of the criteria
are sufficient for convergence this can be a bit misleading.

**Small relative change:** Stop when

\[\|\theta_n - \theta_{n-1}\| \leq \varepsilon(\|\theta_n\| + \varepsilon).\]

The idea is that when \(\theta_n \simeq \theta_{n-1}\) the sequence has approximately reached the limit \(\theta_{\infty}\) and we can stop. It is possible to use an absolute criterion such as \(\|\theta_n - \theta_{n-1}\| < \varepsilon\), but then the criterion would be sensitive to a rescaling of the parameters. Thus fixing a reasonable tolerance parameter across many problems makes more sense for the relative then the absolute criterion. The main reason for adding \(\varepsilon\) on the right hand size is to make the criterion well behaved even if \(\|\theta_n\|\) is close to zero.

The main benefit of this criterion is that it does not require evaluations of the objective function. Some algorithms, such as the EM-algorithm, do not need to evaluate the objective function, and it may even be difficult to do so. In those cases this criterion can be used. The main deficit is that a single iteration with little change in the parameter can happen for many reasons besides convergence, and it does not imply that neither \(\|\theta_n - \theta_{\infty}\|\) nor \(H(\theta_{n}) - H(\theta_{\infty})\) are small.

**Small relative descent:** Stop when

\[H(\theta_{n-1}) - H(\theta_n) \leq \varepsilon (|H(\theta_n)| + \varepsilon).\]

This criterion only makes sense if the algorithm is a descent algorithm. As discussed above, an absolute criterion would be sensitive to rescaling of the objective function, and the added \(\varepsilon\) on the right hand side is to ensure a reasonable behavior if \(H(\theta_n)\) is close to zero.

This criterion is natural for descent algorithms; we stop when the algorithm does not decrease the value of the objective function sufficiently. The use of a relative criterion makes is possible to choose a tolerance parameter that works reasonably well for many problems. A conventional choice is \(\varepsilon \simeq 10^{-8}\) (and often chosen as the square root of the machine epsilon), though the theoretical support for this choice is weak. The deficit of the algorithm is as for the criterion above: a small descent does not imply that \(H(\theta_{n}) - H(\theta_{\infty})\) is small, and it could happen if the algorithm enters a part of \(\Theta\) where \(H\) is very flat, say.

**Small gradient:** Stop when

\[\|\nabla H(\theta_n)\| \leq \varepsilon.\]

This criterion directly measures if \(\theta_n\) is close to being a stationary point (and not if \(\theta_n\) is close to a stationary point). A small value of \(\|\nabla H(\theta_n)\|\) is still no guarantee that \(\|\theta_n - \theta_{\infty}\|\) or \(H(\theta_{n}) - H(\theta_{\infty})\) are small. The criterion also requires the computation of the gradient. In addition, different norms can be used, and if different coordinates of the gradient are of different orders of magnitude this should be reflected by the norm. Alternatively, the parameters should be rescaled.

Neither of the four criteria above gives a theoretical convergence guarantee in terms of how close \(H(\theta_{n})\) is to \(H(\theta_{\infty})\) or how close \(\theta_n\) is to \(\theta_{\infty}\). In some special cases it is, however, possible to make stronger statements. If \(H\) is , there is a constant \(C\) such that

\[\|\theta_n - \theta_{\infty}\| \leq C \|\nabla H(\theta_n)\|,\]

and the small gradient stopping criterion will, actually, imply that \(\theta_n\) is close to \(\theta_{\infty}\)—if \(\epsilon\) is small relative to \(C\). Another approach is based on a lower envelope. If there is a continuous function \(\tilde{H}\) satisfying \(H(\theta_{\infty}) = \tilde{H}(\theta_{\infty})\) and

\[H(\theta_{\infty}) \geq \tilde{H}(\theta)\]

for all \(\theta \in \Theta\) (or just in \(\mathrm{lev}(\theta_0)\) for a descent algorithm) then

\[0 \leq H(\theta_{n}) - H(\theta_{\infty}) \leq H(\theta_{n}) - \tilde{H}(\theta_{n}),\]

and the right hand side directly quantifies convergence. Convex duality theory gives for many convex optimization problems such a function, \(\tilde{H}\), and in those cases a convergence criterion based on \(H(\theta_{n}) - \tilde{H}(\theta_{n})\) actually comes with a theoretical guarantee on how close we are to the minimum. We will not pursue the necessary convex duality theory here, but it is useful to know that in some cases we can do better than the ad hoc criteria above.

## 7.3 Descent direction algorithms

The negative gradient of \(H\) in \(\theta\) is the direction of steepest descent. Since the goal is to minimize \(H\), it is natural to move away from \(\theta\) in the direction of \(-\nabla H(\theta)\). However, other directions than the negative gradient can also be suitable descent directions.

**Definition 7.2 **A *descent direction* in \(\theta\) is a vector \(\rho \in \mathbb{R}^p\) such that

\[ \nabla H(\theta)^T \rho < 0. \]

When \(\theta\) is not a stationary point,

\[ \nabla H(\theta)^T(-\nabla H(\theta)) = - \| \nabla H(\theta)^T \|_2^2 < 0 \]

and \(-\nabla H(\theta)^T\) is a descent direction in \(\theta\) according to the definition.

Given \(\theta_n\) and a descent direction \(\rho_n\) in \(\theta_n\) we can define

\[ \theta_{n+1} = \theta_n + \gamma \rho_n \]

for a suitably chosen \(\gamma > 0\). By Taylor’s theorem

\[ H(\theta_{n+1}) = H(\theta_n) + \gamma \nabla H(\theta_n) \rho_n + o(\gamma), \]

which means that

\[ H(\theta_{n+1}) < H(\theta_n) \]

if \(\gamma\) is small enough.

One strategy for choosing \(\gamma\) is to minimize the univariate function

\[ \gamma \mapsto H(\theta_n + \gamma \rho_n), \]

which is an example of a *line search* method. Such a minimization
would give the maximal possible descent in the direction \(\rho_n\),
and as we have argued, if \(\rho_n\) is a descent direction, a minimizer \(\gamma > 0\)
guarantees descent of \(H\). However, unless the minimization can be
done analytically it is often computationally too expensive.
Less will also do, and as shown in Section 7.2.1,
if the Hessian has uniformly bounded numerical radius it is possible to
fix one (sufficiently small) step length that will guarantee descent.

### 7.3.1 Line search

We consider algorithms of the form

\[ \theta_{n+1} = \theta_n + \gamma_{n} \rho_n \]

starting in \(\theta_n\) and with \(\rho_n\) a descent direction in \(\theta_n\). The step lengths, \(\gamma_n\), should be chosen so as to give sufficient descent in each iteration.

The function \(h(\gamma) = H(\theta_{n} + \gamma \rho_{n})\) is a univariate and differentiable function,

\[ h : [0,\infty) \to \mathbb{R}, \]

that gives the value of \(H\) in the descent direction \(\rho_n\). We find that

\[ h'(\gamma) = \nabla H(\theta_{n} + \gamma \rho_{n})^T \rho_{n}, \]

and the maximal descent in direction \(\rho_n\) can be found by solving \(h'(\gamma) = 0\) for \(\gamma\). As mentioned above, less will do. First note that

\[ h'(0) = \nabla H(\theta_{n})^T \rho_{n} < 0, \]

so \(h\) has a negative slope in \(0\). It descents in a sufficiently small interval \([0, \varepsilon)\), and it is even true that for any \(c \in (0, 1)\) there is an \(\varepsilon > 0\) such that

\[ h(\gamma) \leq h(0) + c \gamma h'(0) \]

for \(\gamma \in [0, \varepsilon)\). We note that this inequality can
be checked easily for any given \(\gamma > 0\), and is known as the
*sufficient descent* condition. Sufficient descent is not enough
in itself as the step length could be arbitrarily small, and the algorithm
could effectively get stuck.

To prevent too small steps we can enforce another condition. Very close
to \(0\), \(h\) will have almost the same slope, \(h'(0)\), as it has in \(0\). If we
therefore require that the slope in \(\gamma\) should be larger than \(\tilde{c} h'(0)\)
for some \(\tilde{c} \in (0, 1)\), \(\gamma\) is forced away from \(0\). This is
known as the *curvature condition*.

The combined conditions on \(\gamma\),

\[ h(\gamma) \leq h(0) + c \gamma h'(0) \]

for a \(c \in (0, 1)\) and

\[ h'(\gamma) \geq \tilde{c} h'(0) \]

for a \(\tilde{c} \in (c, 1)\) are known collectively as
the *Wolfe conditions*. It can be shown that if \(h\) is bounded below there
exists a step length satisfying the Wolfe conditions (Lemma 3.1 in Nocedal and Wright (2006)).

Even when choosing \(\gamma_{n}\) to fulfill the Wolfe conditions there is no guarantee that \(\theta_n\) will converge let alone converge toward a global minimizer. The best we can hope for in general is that

\[ \|\nabla H(\theta_n)\|_2 \rightarrow 0 \]

for \(n \to \infty\), and this will happen under some relatively weak conditions on \(H\) (Theorem 3.2 Nocedal and Wright (2006)) under the assumption that

\[ \frac{\nabla H(\theta_n)^T \rho_n}{\|\nabla H(\theta_n)\|_2 \| \rho_n\|_2} \leq - \delta < 0. \]

That is, the angle between the descent direction and the gradient should be uniformly bounded away from \(90^{\circ}\).

A practical way of searching for a step length is via *backtracking*.
Choosing a \(\gamma_0\) and a constant \(d \in (0, 1)\) we
can search through the sequence of step lengths

\[ \gamma_0, d \gamma_0, d^2 \gamma_0, d^3 \gamma_0, \ldots \]

and stop the first time we find a step length satisfying the Wolfe conditions.

Using backtracking, we can actually dispense of the curvature condition and simply check the sufficient descent condition

\[ H(\theta_{n} + d^k \gamma_0 \rho_{n}) \leq H(\theta_n) + cd^k \gamma_0 \nabla H(\theta_{n})^T \rho_{n} \]

for \(c \in (0, 1)\). The implementation of backtracking requires the choice of the three parameters: \(\gamma_0 > 0\), \(d \in (0, 1)\) and \(c \in (0, 1)\). A good choice depends quite a lot on the algorithm used for choosing the descent direction, but choosing \(c\) too close to 1 can make the algorithm take too small steps, and taking \(d\) too small can likewise generate small step lengths. Thus \(d = 0.8\) or \(d = 0.9\) and \(c = 0.1\) or even smaller are sensible choices. For some algorithms, like the Newton algorithm to be dealt with below, there is a natural choice of \(\gamma_0 = 1\). But for other algorithms a good choice depends crucially on the scale of the parameters, and there is then no general advice on choosing \(\gamma_0\).

### 7.3.2 Gradient descent

We implement gradient descent with backtracking below as the function `GD()`

.
For gradient descent, the sufficient descent condition amounts to
choosing the smallest \(k \geq 0\) such that

\[H(\theta_{n} - d^k \gamma_0 \nabla H(\theta_{n})) \leq H(\theta_n) - cd^k \gamma_0 \|\nabla H(\theta_{n})\|_2^2.\]

The `GD()`

function takes the starting point, \(\theta_0\), the objective function,
\(H\), and its gradient, \(\nabla H\), as arguments. The four parameters \(d\), \(c\), \(\gamma_0\) and
\(\varepsilon\) that control the algorithm can also be specified as additional arguments,
but are given some reasonable default values. The implementation uses the *squared* norm of
the gradient as a stopping criterion, but it also has a maximum number of
iterations as a safeguard. Note that if the maximum number is reached, a
warning is printed.

Finally, we include a callback argument (the `cb`

argument).
If a function is passed to this argument, it will be evaluated in each iteration
of the algorithm. This gives us the possibility of logging or
printing values of variables during evaluation, which can be highly useful
for understanding the inner workings of the algorithm. Monitoring or logging
intermediate values during the evaluation of code is referred to as
*tracing*.

```
GD <- function(
par,
H,
gr,
d = 0.8,
c = 0.1,
gamma0 = 0.01,
epsilon = 1e-4,
maxit = 1000,
cb = NULL
) {
for(i in 1:maxit) {
value <- H(par)
grad <- gr(par)
h_prime <- sum(grad^2)
if(!is.null(cb)) cb()
# Convergence criterion based on gradient norm
if(h_prime <= epsilon) break
gamma <- gamma0
# Proposed descent step
par1 <- par - gamma * grad
# Backtracking while descent is insufficient
while(H(par1) > value - c * gamma * h_prime) {
gamma <- d * gamma
par1 <- par - gamma * grad
}
par <- par1
}
if(i == maxit)
warning("Maximum number, ", maxit, ", of iterations reached")
par
}
```

We will use the Poisson regression example to illustrate the use of gradient descent and other optimization algorithms, and we need to implement functions in R for computing the negative log-likelihood and its gradient.

The implementation below uses a function factory to produce a list containing a parameter vector, the negative log-likelihood function and its gradient. We anticipate that for the Newton algorithm in Section 7.4 we also need an implementation of the Hessian, which is thus included here as well.

We will exploit the `model.matrix()`

function to construct the model matrix
from the data via a formula, and the sufficient statistic is also
computed. The implementations use linear algebra and vectorized computations
relying on access to the model matrix and the sufficient statistics in
their enclosing environment.

```
poisson_model <- function(form, data, response) {
X <- model.matrix(form, data)
y <- data[[response]]
# The function drop() drops the dim attribute and turns, for instance,
# a matrix with one column into a vector
t_map <- drop(crossprod(X, y)) # More efficient than drop(t(X) %*% y)
n <- nrow(X)
p <- ncol(X)
H <- function(beta)
drop(sum(exp(X %*% beta)) - beta %*% t_map) /n
grad_H <- function(beta)
(drop(crossprod(X, exp(X %*% beta))) - t_map) / n
Hessian_H <- function(beta)
crossprod(X, drop(exp(X %*% beta)) * X) / n
list(par = rep(0, p), H = H, grad_H = grad_H, Hessian_H = Hessian_H)
}
```

We choose to normalize the log-likelihood by the number of observations \(n\) (the number of rows in the model matrix). This does have a small computational cost, but the resulting numerical values become less dependent upon \(n\), which makes it easier to choose sensible default values of various parameters for the numerical optimization algorithms. Gradient descent is very slow for the large Poisson model with individual store effects, so we consider only the simple model with two parameters.

```
veg_pois <- poisson_model(
~ log(normalSale),
vegetables,
response = "sale"
)
```

`pois_GD <- GD(veg_pois$par, veg_pois$H, veg_pois$grad_H)`

The gradient descent implementation is tested by comparing the minimizer to
the estimated parameters as computed by `glm()`

.

`rbind(pois_glm = coefficients(pois_model_null), pois_GD)`

```
## (Intercept) log(normalSale)
## pois_glm 1.461 0.9216
## pois_GD 1.460 0.9219
```

We get the same result up to the first two decimals. The convergence criterion on our gradient descent algorithm was quite loose (\(\varepsilon = 10^{-4}\), which means that the norm of the gradient is smaller than \(10^{-2}\) when the algorithm stops). This choice of \(\varepsilon\) in combination with \(\gamma_0 = 0.01\) implies that the algorithm stops when the gradient is so small that the changes are at most of norm \(10^{-4}\).

Comparing the resulting values of the negative log-likelihood shows agreement
up to the first five decimals, but we notice that the negative log-likelihood
for the parameters fitted using `glm()`

is just slightly smaller.

```
veg_pois$H(coefficients(pois_model_null))
veg_pois$H(pois_GD)
```

```
## [1] -124.406827879897
## [1] -124.406825325047
```

To investigate what actually went on inside the gradient descent
algorithm we will use the callback argument to trace the internals
of a call to `GD()`

. The `tracer()`

function from the CSwR package
can be used to construct a tracer object with a `tracer()`

function that we can pass as the callback
argument. The tracer object and its `tracer()`

function work by storing
information in the enclosing environment of `tracer()`

. When used as
the callback argument to, e.g., `GD()`

the `tracer()`

function will look up
variables in the evaluation environment of `GD()`

, store them and
print them if requested, and store runtime information as well.

When `GD()`

has returned, the trace information can be accessed
via the `summary()`

method for the tracer object. The tracer objects
and their `tracer()`

function should not be confused with the `trace()`

function from the R base package, but the tracer object’s `tracer()`

function
can be passed as the `tracer`

argument to `trace()`

to interactively
inject tracing code into any R function. Here, the tracer objects will
only be used together with a callback argument.

We use the tracer object with our gradient descent implementation and print trace information every 50th iteration.

```
GD_tracer <- tracer(c("value", "h_prime", "gamma"), N = 50)
pois_GD <- GD(
veg_pois$par,
veg_pois$H,
veg_pois$grad_H,
cb = GD_tracer$tracer
)
```

```
## n = 1: value = 1; h_prime = 14269; gamma = NA;
## n = 50: value = -123.9; h_prime = 15.46; gamma = 0.004096;
## n = 100: value = -124.3; h_prime = 3.134; gamma = 0.00512;
## n = 150: value = -124.4; h_prime = 0.6014; gamma = 0.00512;
## n = 200: value = -124.4; h_prime = 0.09806; gamma = 0.00512;
## n = 250: value = -124.4; h_prime = 0.01097; gamma = 0.00512;
## n = 300: value = -124.4; h_prime = 0.002376; gamma = 0.00512;
## n = 350: value = -124.4; h_prime = 0.0002165; gamma = 0.004096;
```

We see that the gradient descent algorithm runs for a little more than 350 iterations, and we can observe how the value of the negative log-likelihood is descending. We can also see that the step length \(\gamma\) bounces between \(0.004096 = 0.8^4 \times 0.01\) and \(0.00512 = 0.8^3 \times 0.01\), thus the backtracking takes 3 to 4 iterations to find a step length with sufficient descent.

The printed trace does not reveal the runtime information. The runtime is measured for each iteration of the algorithm and the cumulative runtime is of greater interest. This information can be computed and inspected after the algorithm has converged, and it is returned by the summary method for tracer objects.

```
## value h_prime gamma .time
## 372 -124.4 1.126e-04 0.005120 0.02222
## 373 -124.4 1.219e-04 0.005120 0.02227
## 374 -124.4 1.324e-04 0.005120 0.02233
## 375 -124.4 1.442e-04 0.005120 0.02238
## 376 -124.4 1.575e-04 0.005120 0.02244
## 377 -124.4 7.601e-05 0.004096 0.02250
```

The trace information is stored in a list. The summary method transforms the trace information into a data frame with one row per iteration. We can also access individual entries of the list of trace information via subsetting.

`GD_tracer[377] `

```
## $value
## [1] -124.4
##
## $h_prime
## [1] 7.601e-05
##
## $gamma
## [1] 0.004096
##
## $.time
## [1] 6.056e-05
```

### 7.3.3 Conjugate gradients

The gradient direction is not the best descent direction. It is too
local, and convergence can be quite slow. One of the better algorithms that
is still a *first order algorithm* (using only gradient information) is the nonlinear conjugate
gradient algorithm.
In the Fletcher–Reeves version of the algorithm
the descent direction is initialized as the negative gradient
\(\rho_0 = - \nabla H(\theta_{0})\) and then updated as

\[ \rho_{n} = - \nabla H(\theta_{n}) + \frac{\|\nabla H(\theta_n)\|_2^2}{\|\nabla H(\theta_{n-1})\|_2^2} \rho_{n-1}. \]

That is, the descent direction, \(\rho_{n}\), is the negative gradient but modified according to the previous descent direction. There is plenty of opportunity to vary the prefactor of \(\rho_{n-1}\), and the one presented here is what makes it the Fletcher–Reeves version. Other versions go by the names of their inventors such as Polak–Ribière or Hestenes–Stiefel.

In fact, \(\rho_{n}\) need not be a descent direction unless we put some
restrictions on the step lengths. One possibility is to require that
the step length \(\gamma_{n}\) satisfies the *strong* curvature condition

\[ |h'(\gamma)| = |\nabla H(\theta_n + \gamma \rho_n)^T \rho_n | \leq \tilde{c} |\nabla H(\theta_n)^T \rho_n| = \tilde{c} |h'(0)| \]

for a \(\tilde{c} < \frac{1}{2}\). Then \(\rho_{n + 1}\) can be shown to be a descent direction if \(\rho_{n}\) is.

We implement the conjugate gradient method in a slightly different way. Instead of introducing the more advanced curvature condition, we simply reset the algorithm to use the gradient direction in any case where a non-descent direction has been chosen. Resets of descent direction every \(p\)-th iteration is recommended anyway for the nonlinear conjugate gradient algorithm.

```
CG <- function(
par,
H,
gr,
d = 0.8,
c = 0.1,
gamma0 = 0.05,
epsilon = 1e-10,
maxit = 10000,
cb = NULL
) {
p <- length(par)
m <- 1
rho0 <- numeric(p)
for(i in 1:maxit) {
value <- H(par)
grad <- gr(par)
grad_norm_sq <- sum(grad^2)
if(!is.null(cb)) cb()
if(grad_norm_sq <= epsilon) break
gamma <- gamma0
# Descent direction
rho <- - grad + grad_norm_sq * rho0
h_prime <- drop(t(grad) %*% rho)
# Reset to gradient descent if m > p or rho is not a descent direction
if(m > p || h_prime >= 0) {
rho <- - grad
h_prime <- - grad_norm_sq
m <- 1
}
par1 <- par + gamma * rho
# Backtracking
while(H(par1) > value + c * gamma * h_prime) {
gamma <- d * gamma
par1 <- par + gamma * rho
}
rho0 <- rho / grad_norm_sq
par <- par1
m <- m + 1
}
if(i == maxit)
warning("Maximum number, ", maxit, ", of iterations reached")
par
}
```

```
CG_tracer <- tracer(c("value", "gamma", "grad_norm_sq"), N = 25)
pois_CG <- CG(
veg_pois$par,
veg_pois$H,
veg_pois$grad_H,
cb = CG_tracer$tracer
)
```

```
## n = 1: value = 1; gamma = NA; grad_norm_sq = 14269;
## n = 25: value = -124.29; gamma = 0.004295; grad_norm_sq = 16.782;
## n = 50: value = -124.4; gamma = 0.02048; grad_norm_sq = 0.27642;
## n = 75: value = -124.41; gamma = 0.004295; grad_norm_sq = 0.0061342;
## n = 100: value = -124.41; gamma = 0.02048; grad_norm_sq = 0.0001485;
## n = 125: value = -124.41; gamma = 0.004295; grad_norm_sq = 2.7515e-06;
## n = 150: value = -124.41; gamma = 0.0256; grad_norm_sq = 1.4018e-07;
## n = 175: value = -124.41; gamma = 0.004295; grad_norm_sq = 1.472e-09;
```

This algorithm is fast enough to fit the large Poisson regression model.

```
veg_pois_large <- poisson_model(
~ store + log(normalSale) - 1,
vegetables,
response = "sale"
)
```

```
CG_tracer <- tracer(c("value", "gamma", "grad_norm_sq"), N = 400)
pois_CG <- CG(
veg_pois_large$par,
veg_pois_large$H,
veg_pois_large$grad_H,
cb = CG_tracer$tracer
)
```

```
## n = 1: value = 1; gamma = NA; grad_norm_sq = 12737;
## n = 400: value = -128.4; gamma = 0.004295; grad_norm_sq = 1.4219;
## n = 800: value = -128.57; gamma = 0.0083886; grad_norm_sq = 0.088079;
## n = 1200: value = -128.59; gamma = 0.0014074; grad_norm_sq = 0.025125;
## n = 1600: value = -128.59; gamma = 0.0014074; grad_norm_sq = 2.4173e-06;
## n = 2000: value = -128.59; gamma = 0.004295; grad_norm_sq = 1.3028e-05;
## n = 2400: value = -128.59; gamma = 0.0011259; grad_norm_sq = 3.9599e-07;
```

```
## value gamma grad_norm_sq .time
## 2745 -128.6 0.050000 7.939e-10 9.792
## 2746 -128.6 0.020480 7.174e-09 9.794
## 2747 -128.6 0.001126 1.096e-08 9.800
## 2748 -128.6 0.016384 6.082e-09 9.802
## 2749 -128.6 0.002199 8.983e-10 9.806
## 2750 -128.6 0.002199 9.958e-11 9.811
```

Using `optim()`

with the conjugate gradient method.

```
bench::bench_time(
pois_optim_CG <- optim(
veg_pois_large$par,
veg_pois_large$H,
veg_pois_large$grad_H,
method = "CG",
control = list(maxit = 10000)
)
)
```

```
## process real
## 6.41s 6.48s
```

`pois_optim_CG[c("value", "counts")]`

```
## $value
## [1] -128.6
##
## $counts
## function gradient
## 11586 4952
```

## 7.4 Newton-type algorithms

The Newton algorithm is similar to gradient descent except that the gradient is replaced by

\[ \rho_n = - D^2 H(\theta_n)^{-1} \nabla H(\theta_n) \]

as a descent direction.

The Newton algorithm is typically much more efficient than gradient descent per iteration and will converge in few iterations. However, the storage of the \(p \times p\) Hessian, its computation, and the solution of the equation to compute \(\rho_n\) all scale like \(p^2\) and this can make the algorithm useless for large \(p\).

The standard Newton algorithm for a general objective function is implemented
in R via the function `nlm()`

. Note that its interface differs from `optim()`

in terms of the order and type of the arguments, and the gradient
(and Hessian) are passed to `nlm()`

as attributes to the objective function
rather than as arguments to `nlm()`

itself. For non-linear least squares
problems the `nls()`

function (with yet another interface) should be
considered.

We implement a standard Newton algorithm with backtracking below, which includes a
callback option allowing us to trace the internals of the algorithm. It is a
straightforward modification of `GD()`

where we add the Hessian to the argument
list and change the descent direction.

```
Newton <- function(
par,
H,
gr,
hess,
d = 0.8,
c = 0.1,
gamma0 = 1,
epsilon = 1e-10,
maxit = 50,
cb = NULL
) {
for(i in 1:maxit) {
value <- H(par)
grad <- gr(par)
grad_norm_sq <- sum(grad^2)
if(!is.null(cb)) cb()
if(grad_norm_sq <= epsilon) break
Hessian <- hess(par)
rho <- - drop(solve(Hessian, grad))
gamma <- gamma0
par1 <- par + gamma * rho
h_prime <- t(grad) %*% rho
while(H(par1) > value + c * gamma * h_prime) {
gamma <- d * gamma
par1 <- par + gamma * rho
}
par <- par1
}
if(i == maxit)
warning("Maximum number, ", maxit, ", of iterations reached")
par
}
```

There exists a variety of alternatives to the Newton algorithm that replace the Hessian by another matrix that can be easier to compute and update. That is, the descent directions are of the form

\[ \rho_n = - B_n \nabla H(\theta_n) \]

for a \(p \times p\) matrix \(B_n\). With \(B_n = D^2 H(\theta_n)^{-1}\) we get the standard Newton algorithm. Whether \(\rho_n\) is actually a descent direction depends on \(B_n\). A sufficient condition is that \(B_n\) is a positive definite matrix.

### 7.4.1 Poisson regression

We reconsider the Poisson regression model, `pois_model`

, from Section
6.3, which was also considered in Section 7.3.3.
There we used conjugate gradient descent to estimate the 353 parameters.
Recall that the `veg_pois_large`

list contains all the functions needed to
run the Newton algorithm—including the Hessian matrix.

```
Newton_tracer <- tracer(c("value", "gamma", "grad_norm_sq"), N = 0)
pois_Newton <- Newton(
veg_pois_large$par,
veg_pois_large$H,
veg_pois_large$grad_H,
veg_pois_large$Hessian_H,
cb = Newton_tracer$tracer
)
```

We compare the resulting parameter estimates to those in `pois_model`

,
computed using `glm.fit()`

.

`range(pois_Newton - coefficients(pois_model))`

`## [1] -4.979e-10 1.299e-06`

The differences between all parameters are very small. We can also compare the corresponding values of the objective function, and we include in this comparison the value found using conjugate gradient.

```
rbind(
pois_CG = veg_pois_large$H(pois_CG),
pois_Newton = veg_pois_large$H(pois_Newton),
pois_glm = veg_pois_large$H(coefficients(pois_model))
)
```

```
## [,1]
## pois_CG -128.58945047360265335
## pois_Newton -128.58945047446982812
## pois_glm -128.58945047447090815
```

We see that these values are also extremely close with the first 12 digits identical.

We finally study the trace information from running our Newton algorithm.

`summary(Newton_tracer)`

```
## value gamma grad_norm_sq .time
## 1 1.00 NA 1.274e+04 0.0000
## 2 -14.83 0.02252 4.475e+04 0.2013
## 3 -64.82 0.26214 8.476e+04 0.3465
## 4 -111.34 1.00000 1.144e+04 0.4904
## 5 -124.25 1.00000 9.723e+02 0.6324
## 6 -127.71 1.00000 8.048e+01 0.7748
## 7 -128.50 1.00000 5.133e+00 1.1225
## 8 -128.59 1.00000 9.585e-02 1.2767
## 9 -128.59 1.00000 7.168e-05 1.4207
## 10 -128.59 1.00000 4.645e-11 1.5631
```

We note the relatively few iterations used (10 in this case) to reach the stringent stopping criterion that the squared gradient norm is less than \(10^{-10}\). This took less than \(2\) seconds. Compare this to the \(12\)–\(15\) seconds it took the conjugate gradient algorithm to reach a similarly accurate value of the minimizer. By inspecting the trace information (not shown) for conjugate gradient it turns out that after the first \(1\)–\(1.5\) second (about 250 iterations) it reached roughly the same accuracy as Newton did in its first \(7\) iterations. Thereafter, conjugate gradient spend the rest of the time improving the accuracy slowly while Newton in just \(3\) additional iterations converged to a very accurate minimizer. This is a characteristic behavior of Newton, which has superlinear (in fact, quadratic) convergence.

It is natural to ask if the R function `glm.fit()`

, which also uses a Newton
algorithm (without backtracking), is faster than our implementation.

```
env_pois <- environment(veg_pois_large$H)
bench::bench_time(
glm.fit(env_pois$X, env_pois$y, family = poisson())
)
```

```
## process real
## 318ms 317ms
```

Using `glm.fit()`

is about a factor five faster on this example than
our Newton algorithm—but we should always be very careful when comparing
the runtimes of two different optimization algorithms. In this case they
have achieved about the same numerical precision, and by the comparison
above of the optimized value of the objective function,
the faster `glm.fit()`

even obtained the smallest negative
log-likelihood of the two.

### 7.4.2 Quasi-Newton algorithms

We turn to other descent direction algorithms that are more efficient than gradient descent by choosing the descent direction in a more clever way but less computationally demanding than the Newton algorithm that requires the computation of the full Hessian in each iteration.

We will only consider the application of the BFGS algorithm via the
implementation in the R function `optim()`

.

```
system.time(
pois_BFGS <- optim(
veg_pois_large$par,
veg_pois_large$H,
veg_pois_large$grad_H,
method = "BFGS",
control = list(maxit = 10000)
)
)
```

```
## user system elapsed
## 0.291 0.155 0.799
```

`range(pois_BFGS$par - coefficients(pois_model))`

`## [1] -0.00955 0.08481`

`pois_BFGS[c("value", "counts")]`

```
## $value
## [1] -128.6
##
## $counts
## function gradient
## 158 148
```

### 7.4.3 Sparsity

One of the benefits of the implementations of \(H\) and its derivatives as well as of the descent algorithms is that they can exploit sparsity of \(\mathbf{X}\) almost for free. The implementations have not done that in previous computations, because \(\mathbf{X}\) has been stored as a dense matrix. In reality, \(\mathbf{X}\) is a very sparse matrix (the vast majority of the matrix entries are zero), and if we convert it into a sparse matrix, all the matrix-vector products will be more runtime efficient. Sparse matrices are implemented in the R package Matrix.

```
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
```

Without changing any other code, we get an immediate
runtime improvement using, e.g., `optim()`

and the BFGS algorithm.

```
system.time(
pois_BFGS_sparse <- optim(
veg_pois_large$par,
veg_pois_large$H,
veg_pois_large$grad_H,
method = "BFGS",
control = list(maxit = 10000)
)
)
```

```
## user system elapsed
## 0.046 0.001 0.050
```

We should in real applications avoid constructing a dense intermediate
model matrix as a step toward constructing a sparse model matrix. This
is possible by constructing the sparse model matrix directly using
a function from the R package MatrixModels. Ideally, we should reimplement
`pois_model()`

to support an option for using sparse matrices, but
to focus on the runtime benefits of sparse matrices, we simply
change the matrix in the appropriate environment directly.

```
env_pois$X <- MatrixModels::model.Matrix(
~ store + log(normalSale) - 1,
data = vegetables,
sparse = TRUE
)
```

```
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
```

`class(env_pois$X)`

```
## [1] "dsparseModelMatrix"
## attr(,"package")
## [1] "MatrixModels"
```

The Newton implementation benefits enormously from using sparse matrices because the bottleneck is the computation of the Hessian.

```
Newton_tracer <- tracer(c("value", "h_prime", "gamma"), N = 0)
pois_Newton <- Newton(
veg_pois_large$par,
veg_pois_large$H,
veg_pois_large$grad_H,
veg_pois_large$Hessian_H,
cb = Newton_tracer$tracer
)
summary(Newton_tracer)
```

```
## value h_prime gamma .time
## 1 1.00 NA NA 0.000000
## 2 -14.83 -4.141e+03 0.02252 0.002340
## 3 -64.82 -4.030e+02 0.26214 0.003572
## 4 -111.34 -7.636e+01 1.00000 0.004310
## 5 -124.25 -2.104e+01 1.00000 0.004989
## 6 -127.71 -5.652e+00 1.00000 0.005718
## 7 -128.50 -1.333e+00 1.00000 0.006404
## 8 -128.59 -1.647e-01 1.00000 0.007076
## 9 -128.59 -4.160e-03 1.00000 0.007742
## 10 -128.59 -3.289e-06 1.00000 0.008419
```

To summarize the runtimes we have measured for the Poisson regression example, we found that the conjugate gradient algorithms took of the order of 10 seconds to converge. The Newton-type algorithms in this section were faster and took between 0.3 and 1.7 seconds to converge. The use of sparse matrices reduced the runtime of the quasi-Newton algorithm BFGS by a factor 3, but it reduced the runtime of the Newton algorithm by a factor 50 to about 0.03 seconds. One could be concerned that the construction of the sparse model matrix takes more time (which we did not measure), but if measured it turns out that for this example it takes about the same time to construct the dense model matrix as it takes to construct the sparse one.

runtime efficiency is not the only argument for using sparse matrices as they are also more memory efficient. It is memory (and time) inefficient to use dense intermediates, and for truly large scale problems impossible. Using sparse model matrices for regression models allows us to work with larger models that have more variables, more factor levels and more observations than if we use dense model matrices. For the Poisson regression model the memory used by either representation can be found.

```
object.size(env_pois$X)
object.size(as.matrix(env_pois$X))
```

```
## Sparse matrix memory usage:
## 123440 bytes
## Dense matrix memory usage:
## 3103728 bytes
```

We see that the dense matrix uses around a factor 30 more memory than the
sparse representation. In this case it means using around 3 MB for storing the
dense matrix instead of around 100 kB, which won’t be a problem
on a contemporary computer. However, going from using 3 GB for
storing a matrix to using 100 Mb could be the difference between
not being able to work with the matrix on a standard laptop to
running the computations with no problems. Using `model.Matrix`

makes
it possible to construct sparse model matrices directly and avoid
all dense intermediates.

The function `glm4()`

from the MatrixModels package for fitting regression models,
can exploit sparse model matrices direction, and can
thus be useful in cases where your model matrix becomes very large but sparse.
There are two main applications where the model matrix becomes sparse.
When you model the response using one or more factors, and possibly their
interactions, the model matrix will become particularly sparse if
the factors have many levels. Another case is when you model the response
via basis expansions of quantitative predictors and use basis functions
with local support. The B-splines form an important example of such a basis
with local support that results in a sparse model matrix.

## 7.5 Misc.

If \(\Phi\) is just *nonexpansive* (the
constant \(c\) above is one), this is no longer true, but replacing \(\Phi\)
by \(\alpha \Phi + (1 - \alpha) I\) for \(\alpha \in (0,1)\) we get
Krasnoselskii-Mann iterates of the form
\[\theta_n = \alpha \Phi(\theta_{n-1}) + (1 - \alpha) \theta_{n-1}\]
that will converge to a fixed point of \(\Phi\) provided it has one.

Banach’s fixed point theorem implies a convergence that is at least as fast as linear convergence with asymptotic rate \(c\). Moreover, if \(\Phi\) is just a contraction for \(n \geq n_0\) for some \(n_0\), then for \(n \geq n_0\) \[\|\theta_{n + 1} - \theta_{n}\| \leq c \| \theta_{n} - \theta_{n-1}\|.\] The convergence may be superlinear, but if it is linear, the rate is bounded by \(c\). To indicate if \(\Phi\) is asymptotically a contraction, we can introduce \[R_n = \frac{\|\theta_{n + 1} - \theta_{n}\|}{\|\theta_{n} - \theta_{n- 1}\|}\] and monitor its behavior as \(n \to \infty\). The constant \[r = \limsup_{n \to \infty} R_n\] is then asymptotically the smallest possible contraction constant. If convergence is sublinear and \(R_n \to 1\) the sequence is called logarithmically convergent (by definition), while \(r \in (0,1)\) is an indication of linear convergence with rate \(r\), and \(r = 0\) is an indication of superlinear convergence.

In practice, we can plot the ratio \(R_n\) as the algorithm is running, and use \(R_n\) as an estimate of the rate \(r\) for large \(n\). However, this can be a quite unstable method for estimating the rate.

Finally, it is also possible to estimate the order \(q\) as well as the rate \(r\) by using that \[\log \|\theta_{n} - \theta_{N}\| \simeq q \log \|\theta_{n-1} - \theta_{N}\| + \log(r)\] for \(n \geq n_0\). We can estimate \(q\) and \(\log(r)\) by fitting a linear function by least squares to these log-log transformed norms of errors.

Iteration, fixed points, convergence criteria. Ref to Nonlinear Parameter Optimization Using R Tools.