7.2 Descent direction algorithms

The negative gradient of \(H\) in \(\theta\) is the direction of steepest descent. Starting from \(\theta_0\) and with the goal of minimizing \(H\), it is natural to move away from \(\theta_0\) in the direction of \(-\nabla H(\theta_0)\). Thus we could define
\[\theta_1 = \theta_0 - \gamma \nabla H(\theta_0)\] for a suitably chosen \(\gamma > 0\). By Taylor’s theorem \[H(\theta_1) = H(\theta_0) - \gamma \|\nabla H(\theta_0)\|^2_2 + o(\gamma),\] which means that if \(\theta_0\) is not a stationary point (\(\nabla H(\theta_0) \neq 0\)) then \[H(\theta_1) < H(\theta_0)\] for \(\gamma\) small enough.

More generally, we define a descent direction in \(\theta_0\) as a vector \(\rho_0 \in \mathbb{R}^p\) such that \[\nabla H(\theta_0)^T \rho_0 < 0.\] By the same kind of Taylor argument as above, \(H\) will descent for a sufficiently small step size in the direction of any descent direction. And if \(\theta_0\) is not a stationary point, \(-\nabla H(\theta_0)^T\) is a descent direction.

One strategy for choosing \(\gamma\) is to minimize the univariate function \[\gamma \mapsto H(\theta_0 + \gamma \rho_0),\] which is an example of a line search method. Such a minimization would give the maximal possible descent in the direction \(\rho_0\), and as we have argued, if \(\rho_0\) is a descent direction, a minimizer \(\gamma > 0\) guarantees descent of \(H\). However, unless the minimization can be done analytically it is often computationally too expensive. Less will also do, and as shown in Example 7.1, if the Hessian has uniformly bounded numerical radius it is possible to fix one (sufficiently small) step length that will guarantee descent.

7.2.2 Gradient descent

We implement gradient descent with backtracking below as the function GD(). For gradient descent, the sufficient descent condition amounts to choosing the smallest \(k \geq 0\) such that

\[H(\theta_{n} - d^k \gamma_0 \nabla H(\theta_{n})) \leq H(\theta_n) - cd^k \gamma_0 \|\nabla H(\theta_{n})\|_2^2.\]

The GD() function takes the starting point, \(\theta_0\), the objective function, \(H\), and its gradient, \(\nabla H\), as arguments. The four parameters \(d\), \(c\), \(\gamma_0\) and \(\varepsilon\) that control the algorithm can also be specified as additional arguments, but are given some reasonable default values. The implementation uses the squared norm of the gradient as a stopping criterion, but it also has a maximal number of iterations as a safeguard. Note that if the maximal number is reached, a warning is printed.

Finally, we include a callback argument (the cb argument). If a function is passed to this argument, it will be evaluated in each iteration of the algorithm. This gives us the possibility of logging or printing values of variables during evaluation, which can be highly useful for understanding the inner workings of the algorithm. Monitoring or logging intermediate values during the evaluation of code is referred to as tracing.

GD <- function(
  par, 
  H,
  gr,
  d = 0.8, 
  c = 0.1, 
  gamma0 = 0.01, 
  epsilon = 1e-4, 
  maxit = 1000,
  cb = NULL
) {
  for(i in 1:maxit) {
    value <- H(par)
    grad <- gr(par)
    h_prime <- sum(grad^2)
    if(!is.null(cb)) cb()
    # Convergence criterion based on gradient norm
    if(h_prime <= epsilon) break
    gamma <- gamma0
    # Proposed descent step
    par1 <- par - gamma * grad
    # Backtracking while descent is insufficient
    while(H(par1) > value - c * gamma * h_prime) {
      gamma <- d * gamma
      par1 <- par - gamma * grad
    }
    par <- par1
  }
  if(i == maxit)
    warning("Maximal number, ", maxit, ", of iterations reached")
  par
}

We will use the Poisson regression example to illustrate the use of gradient descent and other optimization algorithms, and we need to implement functions in R for computing the negative log-likelihood and its gradient.

The implementation below uses a function factory to produce a list containing a parameter vector, the negative log-likelihood function and its gradient. We anticipate that for the Newton algorithm in Section 7.3 we also need an implementation of the Hessian, which is thus included here as well.

We will exploit the model.matrix() function to construct the model matrix from the data via a formula, and the sufficient statistic is also computed. The implementations use linear algebra and vectorized computations relying on access to the model matrix and the sufficient statistics in their enclosing environment.

poisson_model <- function(form, data, response) {
  X <- model.matrix(form, data)
  y <- data[[response]]
  # The function drop() drops the dim attribute and turns, for instance,
  # a matrix with one column into a vector
  t_map <- drop(crossprod(X, y))  # More efficient than drop(t(X) %*% y)
  n <- nrow(X)
  p <- ncol(X)
  
  H <- function(beta) 
    drop(sum(exp(X %*% beta)) - beta %*% t_map) /n
  
  grad_H <- function(beta) 
    (drop(crossprod(X, exp(X %*% beta))) - t_map) / n
  
  Hessian_H <- function(beta)
    crossprod(X, drop(exp(X %*% beta)) * X) / n
  
  list(par = rep(0, p), H = H, grad_H = grad_H, Hessian_H = Hessian_H)
}

We choose to normalize the log-likelihood by the number of observations \(n\) (the number of rows in the model matrix). This does have a small computational cost, but the resulting numerical values become less dependent upon \(n\), which makes it easier to choose sensible default values of various parameters for the numerical optimization algorithms. Gradient descent is very slow for the large Poisson model with individual store effects, so we consider only the simple model with two parameters.

veg_pois <- poisson_model(~ log(normalSale), vegetables, response = "sale")
pois_GD <- GD(veg_pois$par, veg_pois$H, veg_pois$grad_H)

The gradient descent implementation is tested by comparing the minimizer to the estimated parameters as computed by glm().

rbind(pois_glm = coefficients(pois_model_null), pois_GD)
##          (Intercept) log(normalSale)
## pois_glm    1.461440       0.9215699
## pois_GD     1.460352       0.9219358

We get the same result up to the first two decimals. The convergence criterion on our gradient descent algorithm was quite loose (\(\varepsilon = 10^{-4}\), which means that the norm of the gradient is smaller than \(10^{-2}\) when the algorithm stops). This choice of \(\varepsilon\) in combination with \(\gamma_0 = 0.01\) implies that the algorithm stops when the gradient is so small that the changes are at most of norm \(10^{-4}\).

Comparing the resulting values of the negative log-likelihood shows agreement up to the first five decimals, but we notice that the negative log-likelihood for the parameters fitted using glm() is just slightly smaller.

veg_pois$H(coefficients(pois_model_null))
veg_pois$H(pois_GD)
## [1] -124.406827879897
## [1] -124.406825325047

To investigate what actually went on inside the gradient descent algorithm we will use the callback argument to trace the internals of a call to GD(). The tracer() function from the CSwR package can be used to construct a tracer object with a tracer() function that we can pass as the callback argument. The tracer object and its tracer() function work by storing information in the enclosing environment of tracer(). When used as the callback argument to e.g. GD() the tracer() function will look up variables in the evaluation environment of GD(), store them and print them if requested, and store run time information as well.

When GD() has returned, the trace information can be accessed via the summary() method for the tracer object. The tracer objects and their tracer() function should not be confused with the trace() function from the R base package, but the tracer object’s tracer() function can be passed as the tracer argument to trace() to interactively inject tracing code into any R function. Here, the tracer objects will only be used together with a callback argument.

We use the tracer object with our gradient descent implementation and print trace information every 50th iteration.

GD_tracer <- tracer(c("value", "h_prime", "gamma"), N = 50)
pois_GD <- GD(veg_pois$par, veg_pois$H, veg_pois$grad_H, cb = GD_tracer$tracer)
## n = 1: value = 1; h_prime = 14268.59; gamma = NA; 
## n = 50: value = -123.9395; h_prime = 15.45722; gamma = 0.004096; 
## n = 100: value = -124.3243; h_prime = 3.133931; gamma = 0.00512; 
## n = 150: value = -124.3935; h_prime = 0.601431; gamma = 0.00512; 
## n = 200: value = -124.4048; h_prime = 0.09805907; gamma = 0.00512; 
## n = 250: value = -124.4065; h_prime = 0.0109742; gamma = 0.00512; 
## n = 300: value = -124.4068; h_prime = 0.002375827; gamma = 0.00512; 
## n = 350: value = -124.4068; h_prime = 0.000216502; gamma = 0.004096;

We see that the gradient descent algorithm runs for a little more than 350 iterations, and we can observe how the value of the negative log-likelihood is descending. We can also see that the step length \(\gamma\) bounces between \(0.004096 = 0.8^4 \times 0.01\) and \(0.00512 = 0.8^3 \times 0.01\), thus the backtracking takes 3 to 4 iterations to find a step length with sufficient descent.

The printed trace does not reveal the run time information. The run time is measured for each iteration of the algorithm and the cumulative run time is of greater interest. This information can be computed and inspected after the algorithm has converged, and it is returned by the summary method for tracer objects.

tail(summary(GD_tracer))
##         value      h_prime    gamma      .time
## 372 -124.4068 1.125779e-04 0.005120 0.06352470
## 373 -124.4068 1.218925e-04 0.005120 0.06364027
## 374 -124.4068 1.323878e-04 0.005120 0.06375466
## 375 -124.4068 1.441965e-04 0.005120 0.06387087
## 376 -124.4068 1.574572e-04 0.005120 0.06398542
## 377 -124.4068 7.600796e-05 0.004096 0.06411624

The trace information is stored in a list. The summary method transforms the trace information into a data frame with one row per iteration. We can also access individual entries of the list of trace information via subsetting.

GD_tracer[377] 
## $value
## [1] -124.4068
## 
## $h_prime
## [1] 7.600796e-05
## 
## $gamma
## [1] 0.004096
## 
## $.time
## [1] 0.000130819
Gradient norm (left) and value of the negative log-likelihood (right) above the limit value $H(\theta_{\infty})$. The straight line is fitted to the data points except the first ten using least squares, and the rate is computed from this estimate and reported per ms.

Figure 7.1: Gradient norm (left) and value of the negative log-likelihood (right) above the limit value \(H(\theta_{\infty})\). The straight line is fitted to the data points except the first ten using least squares, and the rate is computed from this estimate and reported per ms.

7.2.3 Conjugate gradients

The gradient direction is not the best descent direction. It is too local, and convergence can be quite slow. One of the better algorithms that is still a first order algorithm (using only gradient information) is the nonlinear conjugate gradient algorithm. In the Fletcher–Reeves version of the algorithm the descent direction is initialized as the negative gradient \(\rho_0 = - \nabla H(\theta_{0})\) and then updated as \[\rho_{n} = - \nabla H(\theta_{n}) + \frac{\|\nabla H(\theta_n)\|_2^2}{\|\nabla H(\theta_{n-1})\|_2^2} \rho_{n-1}.\] That is, the descent direction, \(\rho_{n}\), is the negative gradient but modified according to the previous descent direction. There is plenty of opportunity to vary the prefactor of \(\rho_{n-1}\), and the one presented here is what makes it the Fletcher–Reeves version. Other versions go by the names of their inventors such as Polak–Ribière or Hestenes–Stiefel.

In fact, \(\rho_{n}\) need not be a descent direction unless we put some restrictions on the step lengths. One possibility is to require that the step length \(\gamma_{n}\) satisfies the strong curvature condition \[|h'(\gamma)| = |\nabla H(\theta_n + \gamma \rho_n)^T \rho_n | \leq \tilde{c} |\nabla H(\theta_n)^T \rho_n| = \tilde{c} |h'(0)|\] for a \(\tilde{c} < \frac{1}{2}\). Then \(\rho_{n + 1}\) can be shown to be a descent direction if \(\rho_{n}\) is.

We implement the conjugate gradient method in a slightly different way. Instead of introducing the more advanced curvature condition, we simply reset the algorithm to use the gradient direction in any case where a non-descent direction has been chosen. Resets of descent direction every \(p\)-th iteration is recommended anyway for the nonlinear conjugate gradient algorithm.

CG <- function(
  par, 
  H,
  gr,
  d = 0.8, 
  c = 0.1, 
  gamma0 = 1, 
  epsilon = 1e-6,
  maxit = 1000,
  cb = NULL
) {
  p <- length(par)
  m <- 1
  rho0 <- numeric(p)
  for(i in 1:maxit) {
    value <- H(par)
    grad <- gr(par)
    grad_norm_sq <- sum(grad^2)
    if(!is.null(cb)) cb()
    if(grad_norm_sq <= epsilon) break
    gamma <- gamma0
    # Descent direction
    rho <- - grad + grad_norm_sq * rho0
    h_prime <- drop(t(grad) %*% rho)
    # Reset to gradient descent if m > p or rho is not a descent direction
    if(m > p || h_prime >= 0) {
      rho <- - grad
      h_prime <- - grad_norm_sq 
      m <- 1
    }
    par1 <- par + gamma * rho
    # Backtracking
    while(H(par1) > value + c * gamma * h_prime) {
      gamma <- d * gamma
      par1 <- par + gamma * rho
    }
    rho0 <- rho / grad_norm_sq
    par <- par1
    m <- m + 1
  }
  if(i == maxit)
    warning("Maximal number, ", maxit, ", of iterations reached")
  par
}
CG_tracer <- tracer(c("value", "gamma", "grad_norm_sq"), N = 10)
pois_CG <- CG(veg_pois$par, veg_pois$H, veg_pois$grad_H, cb = CG_tracer$tracer)
## n = 1: value = 1; gamma = NA; grad_norm_sq = 14269; 
## n = 10: value = -123.15; gamma = 0.018014; grad_norm_sq = 129.78; 
## n = 20: value = -123.92; gamma = 0.022518; grad_norm_sq = 77.339; 
## n = 30: value = -124.23; gamma = 0.018014; grad_norm_sq = 22.227; 
## n = 40: value = -124.41; gamma = 0.0092234; grad_norm_sq = 0.179; 
## n = 50: value = -124.41; gamma = 0.10737; grad_norm_sq = 0.028232; 
## n = 60: value = -124.41; gamma = 0.0092234; grad_norm_sq = 0.00021747; 
## n = 70: value = -124.41; gamma = 0.0092234; grad_norm_sq = 1.7488e-06;
Gradient norms (top) and negative log-likelihoods (bottom) for gradient descent (left) and conjugate gradient (right).

Figure 7.2: Gradient norms (top) and negative log-likelihoods (bottom) for gradient descent (left) and conjugate gradient (right).

This algorithm is fast enough to fit the large Poisson regression model.

veg_pois <- poisson_model(~ store + log(normalSale) - 1, vegetables, response = "sale")
CG_tracer <- tracer(c("value", "gamma", "grad_norm_sq"), N = 100)
pois_CG <- CG(veg_pois$par, veg_pois$H, veg_pois$grad_H, cb = CG_tracer$tracer)
## n = 1: value = 1; gamma = NA; grad_norm_sq = 12737; 
## n = 100: value = -127.9; gamma = 0.018014; grad_norm_sq = 1.676; 
## n = 200: value = -128.28; gamma = 0.011529; grad_norm_sq = 2.5128; 
## n = 300: value = -128.55; gamma = 0.0037779; grad_norm_sq = 0.068176; 
## n = 400: value = -128.59; gamma = 0.022518; grad_norm_sq = 0.0028747; 
## n = 500: value = -128.59; gamma = 0.0092234; grad_norm_sq = 0.0652; 
## n = 600: value = -128.59; gamma = 0.018014; grad_norm_sq = 0.00020555; 
## n = 700: value = -128.59; gamma = 0.0019343; grad_norm_sq = 5.3952e-06; 
## n = 800: value = -128.59; gamma = 0.005903; grad_norm_sq = 3.7118e-06; 
## n = 900: value = -128.59; gamma = 0.0037779; grad_norm_sq = 1.0621e-06;
tail(summary(CG_tracer))
##         value       gamma grad_norm_sq    .time
## 899 -128.5894 0.005902958 5.214667e-06 11.65685
## 900 -128.5894 0.003777893 1.062092e-06 11.67087
## 901 -128.5894 0.068719477 2.119915e-05 11.67842
## 902 -128.5894 0.107374182 2.047029e-04 11.68518
## 903 -128.5894 0.004722366 7.056181e-06 11.70461
## 904 -128.5894 0.003777893 8.881615e-07 11.71881

Using optim() with the conjugate gradient method.

system.time(
  pois_optim_CG <- optim(
    veg_pois$par, 
    veg_pois$H, 
    veg_pois$grad_H, 
    method = "CG", 
    control = list(maxit = 10000)
  )
)
##    user  system elapsed 
##  10.593   0.032  10.650
pois_optim_CG[c("value", "counts")]
## $value
## [1] -128.5895
## 
## $counts
## function gradient 
##    10674     4549

7.2.4 Peppered Moths

Returning to the peppered moth from Section 6.2.1 we implemented in that section the log-likelihood for general multinomial cell collapsing and applied the implementation to compute the maximum-likelihood estimate. In this section we implement the gradient as well. From the expression for the log-likelihood in (6.2) it follows that the gradient equals

\[\nabla \ell(\theta) = \sum_{j = 1}^{K_0} \frac{ x_j }{ M(p(\theta))_j}\nabla M(p(\theta))_j = \sum_{j = 1}^{K_0} \sum_{k \in A_j} \frac{ x_j}{ M(p(\theta))_j} \nabla p_k(\theta).\]

Letting \(j(k)\) be defined by \(k \in A_{j(k)}\) we see that the gradient can also be written as \[\nabla \ell(\theta) = \sum_{k=1}^K \frac{x_{j(k)}}{ M(p(\theta))_{j(k)}} \nabla p_k(\theta) = \mathbf{\tilde{x}}(\theta) \mathrm{D}p(\theta),\] where \(\mathrm{D}p(\theta)\) is the Jacobian of the parametrization \(\theta \mapsto p(\theta)\), and \(\mathbf{\tilde{x}}(\theta)\) is the vector with \[\mathbf{\tilde{x}}(\theta)_k = \frac{ x_{j(k)}}{M(p(\theta))_{j(k)}}.\]

grad_loglik <- function(par, x, prob, Dprob, group) {
  p <- prob(par)
  if(is.null(p)) return(rep(NA, length(par)))
  - (x[group] / M(p, group)[group]) %*% Dprob(par)
}

The Jacobian needs to be implemented for the specific example of peppered moths.

Dprob <- function(p) {
  p[3] <- 1 - p[1] - p[2]
  matrix(
    c(2 * p[1],             0, 
      2 * p[2],             2 * p[1], 
      2* p[3] - 2 * p[1],  -2 * p[1],
      0,                    2 * p[2],         
      -2 * p[2],            2 * p[3] - 2 * p[2], 
      -2 * p[3],           -2 * p[3]),
    ncol = 2, nrow = 6, byrow = TRUE)
}

We can then use the conjugate gradient algorithm to compute the maximum-likelihood estimate.

optim(c(0.3, 0.3), loglik, grad_loglik, x = c(85, 196, 341), 
      prob = prob, Dprob = Dprob, group = c(1, 1, 1, 2, 2, 3), 
      method = "CG")
## $par
## [1] 0.07083691 0.18873652
## 
## $value
## [1] 600.481
## 
## $counts
## function gradient 
##       92       19 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

The peppered Moth example is very simple. The log-likelihood can easily be computed, and we used this problem to illustrate ways of implementing a likelihood in R and how to use optim to maximize it.

One of the likelihood implementations was very problem specific while the other more abstract and general, and we used the same general and abstract approach to implement the gradient above. The gradient could then be used for other optimization algorithms, still using optim, such as conjugate gradient. In fact, you can use conjugate gradient without computing and implementing the gradient.

optim(c(0.3, 0.3), loglik, x = c(85, 196, 341), 
      prob = prob, group = c(1, 1, 1, 2, 2, 3), 
      method = "CG")
## $par
## [1] 0.07084109 0.18873718
## 
## $value
## [1] 600.481
## 
## $counts
## function gradient 
##      107       15 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

If we do not implement a gradient, a numerical gradient is used by optim(). This can result in a slower algorithm than if the gradient is implemented, but more seriously, in can result in convergence problems. This is because there is a subtle tradeoff between numerical accuracy and accuracy of the finite difference approximation used to approximate the gradient. We did not experience convergence problems in the example above, but one way to remedy such problems is to set the parscale or fnscale entries in the control list argument to optim().

In the following chapter the peppered moth example is used to illustrate the EM algorithm. It is important to understand that the EM algorithm does not rely on the ability to compute the likelihood or the gradient of the likelihood for that matter. In many real applications of the EM algorithm the computation of the likelihood is challenging or even impossible, thus most standard optimization algorithms will not be directly applicable.

References

Nocedal, Jorge, and Stephen J. Wright. 2006. Numerical Optimization. Second. Springer Series in Operations Research and Financial Engineering. New York: Springer.