## 7.2 Descent direction algorithms

The negative gradient of \(H\) in \(\theta\) is the direction of steepest descent.
Starting from \(\theta_0\) and with the goal of minimizing \(H\), it is natural
to move away from \(\theta_0\) in the direction of \(-\nabla H(\theta_0)\).
Thus we could define

\[\theta_1 = \theta_0 - \gamma \nabla H(\theta_0)\]
for a suitably chosen \(\gamma > 0\). By Taylor’s theorem
\[H(\theta_1) = H(\theta_0) - \gamma \|\nabla H(\theta_0)\|^2_2 + o(\gamma),\]
which means that if \(\theta_0\) is not a stationary point (\(\nabla H(\theta_0) \neq 0\))
then
\[H(\theta_1) < H(\theta_0)\]
for \(\gamma\) small enough.

More generally, we define a *descent direction* in \(\theta_0\)
as a vector \(\rho_0 \in \mathbb{R}^p\) such that
\[\nabla H(\theta_0)^T \rho_0 < 0.\]
By the same kind of Taylor argument as above, \(H\) will descent for a sufficiently
small step size in the direction of any descent direction. And if \(\theta_0\)
is not a stationary point, \(-\nabla H(\theta_0)^T\) is a descent direction.

One strategy for choosing \(\gamma\) is to minimize the univariate
function
\[\gamma \mapsto H(\theta_0 + \gamma \rho_0),\]
which is an example of a *line search* method. Such a minimization
would give the maximal possible descent in the direction \(\rho_0\),
and as we have argued, if \(\rho_0\) is a descent direction, a minimizer \(\gamma > 0\)
guarantees descent of \(H\). However, unless the minimization can be
done analytically it is often computationally too expensive.
Less will also do, and as shown in Example 7.1,
if the Hessian has uniformly bounded numerical radius it is possible to
fix one (sufficiently small) step length that will guarantee descent.

### 7.2.1 Line search

We consider algorithms of the form \[\theta_{n+1} = \theta_n + \gamma_{n} \rho_n\] for descent directions \(\rho_n\) and starting in \(\theta_0\). The step lengths, \(\gamma_n\), are chosen so as to give sufficient descent in each iteration.

We let \(h(\gamma) = H(\theta_{n} + \gamma \rho_{n})\)
denote the univariate and differentiable function of \(\gamma\),
\[h : [0,\infty) \to \mathbb{R},\]
that gives the value of \(H\) in the direction of the descent direction
\(\rho_n\). We can observe that
\[h'(\gamma) = \nabla H(\theta_{n} + \gamma \rho_{n})^T \rho_{n},\]
and maximal descent in direction \(\rho_n\) can be found by solving
\(h'(\gamma) = 0\) for \(\gamma\). As mentioned above, less will do. First note
that
\[h'(0) = \nabla H(\theta_{n})^T \rho_{n} < 0,\]
so \(h\) has a negative slope in \(0\). It descents in a sufficiently
small interval \([0, \varepsilon)\), and it is even true that for any \(c \in (0, 1)\)
there is an \(\varepsilon > 0\) such that
\[h(\gamma) \leq h(0) + c \gamma h'(0)\]
for \(\gamma \in [0, \varepsilon)\). We note that this inequality can
be checked easily for any given \(\gamma > 0\), and is known as the
*sufficient descent* condition. Sufficient descent is not enough
in itself as the step length could be arbitrarily small, and the algorithm
could effectively get stuck.

To prevent too small steps we can enforce another condition. Very close
to \(0\), \(h\) will have almost the same slope, \(h'(0)\), as it has in \(0\). If we
therefore require that the slope in \(\gamma\) should be larger than \(\tilde{c} h'(0)\)
for some \(\tilde{c} \in (0, 1)\), \(\gamma\) is forced away from \(0\). This is
known as the *curvature condition*.

The combined conditions on \(\gamma\),
\[h(\gamma) \leq h(0) + c \gamma h'(0)\]
for a \(c \in (0, 1)\) and
\[h'(\gamma) \geq \tilde{c} h'(0)\]
for a \(\tilde{c} \in (c, 1)\) are known collectively as
the *Wolfe conditions*. It can be shown that if \(h\) is bounded below there
exists a step length satisfying the Wolfe conditions (Lemma 3.1 in Nocedal and Wright (2006)).

Even when choosing \(\gamma_{n}\) to fulfill the Wolfe conditions there is no guarantee that \(\theta_n\) will converge let alone converge toward a global minimizer. The best we can hope for in general is that \[\|\nabla H(\theta_n)\|_2 \rightarrow 0\] for \(n \to \infty\), and this will happen under some relatively weak conditions on \(H\) (Theorem 3.2 Nocedal and Wright (2006)) under the assumption that \[\frac{\nabla H(\theta_n)^T \rho_n}{\|\nabla H(\theta_n)\|_2 \| \rho_n\|_2} \leq - \delta < 0.\] That is, the angle between the descent direction and the gradient should be uniformly bounded away from \(90^{\circ}\).

A practical way of searching for a step length is via *backtracking*.
Choosing a \(\gamma_0\) and a constant \(d \in (0, 1)\) we
can search through the sequence of step lengths
\[\gamma_0, d \gamma_0, d^2 \gamma_0, d^3 \gamma_0, \ldots\]
and stop the first time we find a step length satisfying the Wolfe
conditions.

Using backtracking, we can actually dispense of the curvature condition and simply check the sufficient descent condition

\[H(\theta_{n} + d^k \gamma_0 \rho_{n}) \leq H(\theta_n) + cd^k \gamma_0 \nabla H(\theta_{n})^T \rho_{n}\]

for \(c \in (0, 1)\). The implementation of backtracking requires the choice of the three parameters: \(\gamma_0 > 0\), \(d \in (0, 1)\) and \(c \in (0, 1)\). A good choice depends quite a lot on the algorithm used for choosing the descent direction, but choosing \(c\) too close to 1 can make the algorithm take too small steps, and taking \(d\) too small can likewise generate small step lengths. Thus \(d = 0.8\) or \(d = 0.9\) and \(c = 0.1\) or even smaller are sensible choices. For some algorithms, like the Newton algorithm to be dealt with below, there is a natural choice of \(\gamma_0 = 1\). But for other algorithms a good choice depends crucially on the scale of the parameters, and there is then no general advice on choosing \(\gamma_0\).

### 7.2.2 Gradient descent

We implement gradient descent with backtracking below as the function `GD()`

.
For gradient descent, the sufficient descent condition amounts to
choosing the smallest \(k \geq 0\) such that

\[H(\theta_{n} - d^k \gamma_0 \nabla H(\theta_{n})) \leq H(\theta_n) - cd^k \gamma_0 \|\nabla H(\theta_{n})\|_2^2.\]

The `GD()`

function takes the starting point, \(\theta_0\), the objective function,
\(H\), and its gradient, \(\nabla H\), as arguments. The four parameters \(d\), \(c\), \(\gamma_0\) and
\(\varepsilon\) that control the algorithm can also be specified as additional arguments,
but are given some reasonable default values. The implementation uses the *squared* norm of
the gradient as a stopping criterion, but it also has a maximal number of
iterations as a safeguard. Note that if the maximal number is reached, a
warning is printed.

Finally, we include a callback argument (the `cb`

argument).
If a function is passed to this argument, it will be evaluated in each iteration
of the algorithm. This gives us the possibility of logging or
printing values of variables during evaluation, which can be highly useful
for understanding the inner workings of the algorithm. Monitoring or logging
intermediate values during the evaluation of code is referred to as
*tracing*.

```
GD <- function(
par,
H,
gr,
d = 0.8,
c = 0.1,
gamma0 = 0.01,
epsilon = 1e-4,
maxit = 1000,
cb = NULL
) {
for(i in 1:maxit) {
value <- H(par)
grad <- gr(par)
h_prime <- sum(grad^2)
if(!is.null(cb)) cb()
# Convergence criterion based on gradient norm
if(h_prime <= epsilon) break
gamma <- gamma0
# Proposed descent step
par1 <- par - gamma * grad
# Backtracking while descent is insufficient
while(H(par1) > value - c * gamma * h_prime) {
gamma <- d * gamma
par1 <- par - gamma * grad
}
par <- par1
}
if(i == maxit)
warning("Maximal number, ", maxit, ", of iterations reached")
par
}
```

We will use the Poisson regression example to illustrate the use of gradient descent and other optimization algorithms, and we need to implement functions in R for computing the negative log-likelihood and its gradient.

The implementation below uses a function factory to produce a list containing a parameter vector, the negative log-likelihood function and its gradient. We anticipate that for the Newton algorithm in Section 7.3 we also need an implementation of the Hessian, which is thus included here as well.

We will exploit the `model.matrix()`

function to construct the model matrix
from the data via a formula, and the sufficient statistic is also
computed. The implementations use linear algebra and vectorized computations
relying on access to the model matrix and the sufficient statistics in
their enclosing environment.

```
poisson_model <- function(form, data, response) {
X <- model.matrix(form, data)
y <- data[[response]]
# The function drop() drops the dim attribute and turns, for instance,
# a matrix with one column into a vector
t_map <- drop(crossprod(X, y)) # More efficient than drop(t(X) %*% y)
n <- nrow(X)
p <- ncol(X)
H <- function(beta)
drop(sum(exp(X %*% beta)) - beta %*% t_map) /n
grad_H <- function(beta)
(drop(crossprod(X, exp(X %*% beta))) - t_map) / n
Hessian_H <- function(beta)
crossprod(X, drop(exp(X %*% beta)) * X) / n
list(par = rep(0, p), H = H, grad_H = grad_H, Hessian_H = Hessian_H)
}
```

We choose to normalize the log-likelihood by the number of observations \(n\) (the number of rows in the model matrix). This does have a small computational cost, but the resulting numerical values become less dependent upon \(n\), which makes it easier to choose sensible default values of various parameters for the numerical optimization algorithms. Gradient descent is very slow for the large Poisson model with individual store effects, so we consider only the simple model with two parameters.

The gradient descent implementation is tested by comparing the minimizer to
the estimated parameters as computed by `glm()`

.

```
## (Intercept) log(normalSale)
## pois_glm 1.461440 0.9215699
## pois_GD 1.460352 0.9219358
```

We get the same result up to the first two decimals. The convergence criterion on our gradient descent algorithm was quite loose (\(\varepsilon = 10^{-4}\), which means that the norm of the gradient is smaller than \(10^{-2}\) when the algorithm stops). This choice of \(\varepsilon\) in combination with \(\gamma_0 = 0.01\) implies that the algorithm stops when the gradient is so small that the changes are at most of norm \(10^{-4}\).

Comparing the resulting values of the negative log-likelihood shows agreement
up to the first five decimals, but we notice that the negative log-likelihood
for the parameters fitted using `glm()`

is just slightly smaller.

```
## [1] -124.406827879897
## [1] -124.406825325047
```

To investigate what actually went on inside the gradient descent
algorithm we will use the callback argument to trace the internals
of a call to `GD()`

. The `tracer()`

function from the CSwR package
can be used to construct a tracer object with a `tracer()`

function that we can pass as the callback
argument. The tracer object and its `tracer()`

function work by storing
information in the enclosing environment of `tracer()`

. When used as
the callback argument to e.g. `GD()`

the `tracer()`

function will look up
variables in the evaluation environment of `GD()`

, store them and
print them if requested, and store run time information as well.

When `GD()`

has returned, the trace information can be accessed
via the `summary()`

method for the tracer object. The tracer objects
and their `tracer()`

function should not be confused with the `trace()`

function from the R base package, but the tracer object’s `tracer()`

function
can be passed as the `tracer`

argument to `trace()`

to interactively
inject tracing code into any R function. Here, the tracer objects will
only be used together with a callback argument.

We use the tracer object with our gradient descent implementation and print trace information every 50th iteration.

```
GD_tracer <- tracer(c("value", "h_prime", "gamma"), N = 50)
pois_GD <- GD(veg_pois$par, veg_pois$H, veg_pois$grad_H, cb = GD_tracer$tracer)
```

```
## n = 1: value = 1; h_prime = 14268.59; gamma = NA;
## n = 50: value = -123.9395; h_prime = 15.45722; gamma = 0.004096;
## n = 100: value = -124.3243; h_prime = 3.133931; gamma = 0.00512;
## n = 150: value = -124.3935; h_prime = 0.601431; gamma = 0.00512;
## n = 200: value = -124.4048; h_prime = 0.09805907; gamma = 0.00512;
## n = 250: value = -124.4065; h_prime = 0.0109742; gamma = 0.00512;
## n = 300: value = -124.4068; h_prime = 0.002375827; gamma = 0.00512;
## n = 350: value = -124.4068; h_prime = 0.000216502; gamma = 0.004096;
```

We see that the gradient descent algorithm runs for a little more than 350 iterations, and we can observe how the value of the negative log-likelihood is descending. We can also see that the step length \(\gamma\) bounces between \(0.004096 = 0.8^4 \times 0.01\) and \(0.00512 = 0.8^3 \times 0.01\), thus the backtracking takes 3 to 4 iterations to find a step length with sufficient descent.

The printed trace does not reveal the run time information. The run time is measured for each iteration of the algorithm and the cumulative run time is of greater interest. This information can be computed and inspected after the algorithm has converged, and it is returned by the summary method for tracer objects.

```
## value h_prime gamma .time
## 372 -124.4068 1.125779e-04 0.005120 0.06352470
## 373 -124.4068 1.218925e-04 0.005120 0.06364027
## 374 -124.4068 1.323878e-04 0.005120 0.06375466
## 375 -124.4068 1.441965e-04 0.005120 0.06387087
## 376 -124.4068 1.574572e-04 0.005120 0.06398542
## 377 -124.4068 7.600796e-05 0.004096 0.06411624
```

The trace information is stored in a list. The summary method transforms the trace information into a data frame with one row per iteration. We can also access individual entries of the list of trace information via subsetting.

```
## $value
## [1] -124.4068
##
## $h_prime
## [1] 7.600796e-05
##
## $gamma
## [1] 0.004096
##
## $.time
## [1] 0.000130819
```

### 7.2.3 Conjugate gradients

The gradient direction is not the best descent direction. It is too
local, and convergence can be quite slow. One of the better algorithms that
is still a *first order algorithm* (using only gradient information) is the nonlinear conjugate
gradient algorithm.
In the Fletcher–Reeves version of the algorithm
the descent direction is initialized as the negative gradient
\(\rho_0 = - \nabla H(\theta_{0})\) and then updated as
\[\rho_{n} = - \nabla H(\theta_{n}) + \frac{\|\nabla H(\theta_n)\|_2^2}{\|\nabla H(\theta_{n-1})\|_2^2} \rho_{n-1}.\]
That is, the descent direction, \(\rho_{n}\), is the negative gradient but modified according to
the previous descent direction. There is plenty of opportunity to vary the prefactor
of \(\rho_{n-1}\), and the one presented here is what makes it the Fletcher–Reeves
version. Other versions go by the names of their inventors such as Polak–Ribière
or Hestenes–Stiefel.

In fact, \(\rho_{n}\) need not be a descent direction unless we put some
restrictions on the step lengths. One possibility is to require that
the step length \(\gamma_{n}\) satisfies the *strong* curvature condition
\[|h'(\gamma)| = |\nabla H(\theta_n + \gamma \rho_n)^T \rho_n | \leq \tilde{c} |\nabla H(\theta_n)^T \rho_n| = \tilde{c} |h'(0)|\]
for a \(\tilde{c} < \frac{1}{2}\). Then \(\rho_{n + 1}\) can be shown to be a descent
direction if \(\rho_{n}\) is.

We implement the conjugate gradient method in a slightly different way. Instead of introducing the more advanced curvature condition, we simply reset the algorithm to use the gradient direction in any case where a non-descent direction has been chosen. Resets of descent direction every \(p\)-th iteration is recommended anyway for the nonlinear conjugate gradient algorithm.

```
CG <- function(
par,
H,
gr,
d = 0.8,
c = 0.1,
gamma0 = 1,
epsilon = 1e-6,
maxit = 1000,
cb = NULL
) {
p <- length(par)
m <- 1
rho0 <- numeric(p)
for(i in 1:maxit) {
value <- H(par)
grad <- gr(par)
grad_norm_sq <- sum(grad^2)
if(!is.null(cb)) cb()
if(grad_norm_sq <= epsilon) break
gamma <- gamma0
# Descent direction
rho <- - grad + grad_norm_sq * rho0
h_prime <- drop(t(grad) %*% rho)
# Reset to gradient descent if m > p or rho is not a descent direction
if(m > p || h_prime >= 0) {
rho <- - grad
h_prime <- - grad_norm_sq
m <- 1
}
par1 <- par + gamma * rho
# Backtracking
while(H(par1) > value + c * gamma * h_prime) {
gamma <- d * gamma
par1 <- par + gamma * rho
}
rho0 <- rho / grad_norm_sq
par <- par1
m <- m + 1
}
if(i == maxit)
warning("Maximal number, ", maxit, ", of iterations reached")
par
}
```

```
CG_tracer <- tracer(c("value", "gamma", "grad_norm_sq"), N = 10)
pois_CG <- CG(veg_pois$par, veg_pois$H, veg_pois$grad_H, cb = CG_tracer$tracer)
```

```
## n = 1: value = 1; gamma = NA; grad_norm_sq = 14269;
## n = 10: value = -123.15; gamma = 0.018014; grad_norm_sq = 129.78;
## n = 20: value = -123.92; gamma = 0.022518; grad_norm_sq = 77.339;
## n = 30: value = -124.23; gamma = 0.018014; grad_norm_sq = 22.227;
## n = 40: value = -124.41; gamma = 0.0092234; grad_norm_sq = 0.179;
## n = 50: value = -124.41; gamma = 0.10737; grad_norm_sq = 0.028232;
## n = 60: value = -124.41; gamma = 0.0092234; grad_norm_sq = 0.00021747;
## n = 70: value = -124.41; gamma = 0.0092234; grad_norm_sq = 1.7488e-06;
```

This algorithm is fast enough to fit the large Poisson regression model.

```
CG_tracer <- tracer(c("value", "gamma", "grad_norm_sq"), N = 100)
pois_CG <- CG(veg_pois$par, veg_pois$H, veg_pois$grad_H, cb = CG_tracer$tracer)
```

```
## n = 1: value = 1; gamma = NA; grad_norm_sq = 12737;
## n = 100: value = -127.9; gamma = 0.018014; grad_norm_sq = 1.676;
## n = 200: value = -128.28; gamma = 0.011529; grad_norm_sq = 2.5128;
## n = 300: value = -128.55; gamma = 0.0037779; grad_norm_sq = 0.068176;
## n = 400: value = -128.59; gamma = 0.022518; grad_norm_sq = 0.0028747;
## n = 500: value = -128.59; gamma = 0.0092234; grad_norm_sq = 0.0652;
## n = 600: value = -128.59; gamma = 0.018014; grad_norm_sq = 0.00020555;
## n = 700: value = -128.59; gamma = 0.0019343; grad_norm_sq = 5.3952e-06;
## n = 800: value = -128.59; gamma = 0.005903; grad_norm_sq = 3.7118e-06;
## n = 900: value = -128.59; gamma = 0.0037779; grad_norm_sq = 1.0621e-06;
```

```
## value gamma grad_norm_sq .time
## 899 -128.5894 0.005902958 5.214667e-06 11.65685
## 900 -128.5894 0.003777893 1.062092e-06 11.67087
## 901 -128.5894 0.068719477 2.119915e-05 11.67842
## 902 -128.5894 0.107374182 2.047029e-04 11.68518
## 903 -128.5894 0.004722366 7.056181e-06 11.70461
## 904 -128.5894 0.003777893 8.881615e-07 11.71881
```

Using `optim()`

with the conjugate gradient method.

```
system.time(
pois_optim_CG <- optim(
veg_pois$par,
veg_pois$H,
veg_pois$grad_H,
method = "CG",
control = list(maxit = 10000)
)
)
```

```
## user system elapsed
## 10.593 0.032 10.650
```

```
## $value
## [1] -128.5895
##
## $counts
## function gradient
## 10674 4549
```

### 7.2.4 Peppered Moths

Returning to the peppered moth from Section 6.2.1 we implemented in that section the log-likelihood for general multinomial cell collapsing and applied the implementation to compute the maximum-likelihood estimate. In this section we implement the gradient as well. From the expression for the log-likelihood in (6.2) it follows that the gradient equals

\[\nabla \ell(\theta) = \sum_{j = 1}^{K_0} \frac{ x_j }{ M(p(\theta))_j}\nabla M(p(\theta))_j = \sum_{j = 1}^{K_0} \sum_{k \in A_j} \frac{ x_j}{ M(p(\theta))_j} \nabla p_k(\theta).\]

Letting \(j(k)\) be defined by \(k \in A_{j(k)}\) we see that the gradient can also be written as \[\nabla \ell(\theta) = \sum_{k=1}^K \frac{x_{j(k)}}{ M(p(\theta))_{j(k)}} \nabla p_k(\theta) = \mathbf{\tilde{x}}(\theta) \mathrm{D}p(\theta),\] where \(\mathrm{D}p(\theta)\) is the Jacobian of the parametrization \(\theta \mapsto p(\theta)\), and \(\mathbf{\tilde{x}}(\theta)\) is the vector with \[\mathbf{\tilde{x}}(\theta)_k = \frac{ x_{j(k)}}{M(p(\theta))_{j(k)}}.\]

```
grad_loglik <- function(par, x, prob, Dprob, group) {
p <- prob(par)
if(is.null(p)) return(rep(NA, length(par)))
- (x[group] / M(p, group)[group]) %*% Dprob(par)
}
```

The Jacobian needs to be implemented for the specific example of peppered moths.

```
Dprob <- function(p) {
p[3] <- 1 - p[1] - p[2]
matrix(
c(2 * p[1], 0,
2 * p[2], 2 * p[1],
2* p[3] - 2 * p[1], -2 * p[1],
0, 2 * p[2],
-2 * p[2], 2 * p[3] - 2 * p[2],
-2 * p[3], -2 * p[3]),
ncol = 2, nrow = 6, byrow = TRUE)
}
```

We can then use the conjugate gradient algorithm to compute the maximum-likelihood estimate.

```
optim(c(0.3, 0.3), loglik, grad_loglik, x = c(85, 196, 341),
prob = prob, Dprob = Dprob, group = c(1, 1, 1, 2, 2, 3),
method = "CG")
```

```
## $par
## [1] 0.07083691 0.18873652
##
## $value
## [1] 600.481
##
## $counts
## function gradient
## 92 19
##
## $convergence
## [1] 0
##
## $message
## NULL
```

The peppered Moth example is very simple.
The log-likelihood can easily be computed, and we used this
problem to illustrate ways of implementing a
likelihood in R and how to use `optim`

to maximize it.

One of the likelihood implementations was very problem specific
while the other more abstract and general, and we used the same general and abstract
approach to implement the gradient above. The gradient could then
be used for other optimization algorithms, still using `optim`

, such as
conjugate gradient. In fact, you can use conjugate gradient without
computing and implementing the gradient.

```
optim(c(0.3, 0.3), loglik, x = c(85, 196, 341),
prob = prob, group = c(1, 1, 1, 2, 2, 3),
method = "CG")
```

```
## $par
## [1] 0.07084109 0.18873718
##
## $value
## [1] 600.481
##
## $counts
## function gradient
## 107 15
##
## $convergence
## [1] 0
##
## $message
## NULL
```

If we do not implement a gradient, a numerical gradient is
used by `optim()`

. This can result in a slower algorithm than if the
gradient is implemented, but more seriously, in can result in convergence
problems. This is because there is a subtle tradeoff between numerical
accuracy and accuracy of the finite difference approximation used to
approximate the gradient. We did not
experience convergence problems in the example above, but one way to remedy such problems
is to set the `parscale`

or `fnscale`

entries in the `control`

list argument to `optim()`

.

In the following chapter the peppered moth example is used to illustrate the EM algorithm. It is important to understand that the EM algorithm does not rely on the ability to compute the likelihood or the gradient of the likelihood for that matter. In many real applications of the EM algorithm the computation of the likelihood is challenging or even impossible, thus most standard optimization algorithms will not be directly applicable.

### References

Nocedal, Jorge, and Stephen J. Wright. 2006. *Numerical Optimization*. Second. Springer Series in Operations Research and Financial Engineering. New York: Springer.