7.3 Newton-type algorithms

The Newton algorithm is very similar to gradient descent except that the gradient descent direction is replaced by \[\rho_n = - D^2 H(\theta_n)^{-1} \nabla H(\theta_n).\]

The Newton algorithm is typically much more efficient than gradient descent and will converge in few iterations. However, the storage of the \(p \times p\) Hessian, its computation, and the solution of the equation to compute \(\rho_n\) all scale like \(p^2\) and this can make the algorithm useless for very large \(p\).

A variety of alternatives to the Newton algorithm exist that replace the Hessian by another matrix that can be easier to compute and update. It should be noted that if we choose a matrix \(B_n\) in the \(n\)-th iteration, then \(- B_n \nabla H(\theta_n)\) is a descent direction whenever \(B_n\) is a positive definite matrix.

Newton implementation (with trace).

Newton <- function(
  par, 
  H,
  gr,
  hess,
  d = 0.8, 
  c = 0.1, 
  gamma0 = 1, 
  epsilon = 1e-10, 
  maxit = 50,
  cb = NULL
) {
  for(i in 1:maxit) {
    value <- H(par)
    grad <- gr(par)
    if(!is.null(cb)) cb()
    if(sum(grad^2) <= epsilon) break
    Hessian <- hess(par) 
    rho <- - drop(solve(Hessian, grad)) 
    gamma <- gamma0
    par1 <- par + gamma * rho
    h_prime <- t(grad) %*% rho 
    while(H(par1) > value +  c * gamma * h_prime) { 
      gamma <- d * gamma 
      par1 <- par + gamma * rho
    }
    par <- par1 
  }
  if(i == maxit)
    warning("Maximal number, ", maxit, ", of iterations reached")
  par
}

7.3.1 Poisson regression

We use the implementation of the Hessian matrix.

Newton_tracer <- tracer(c("value", "h_prime", "gamma"), N = 0)
pois_Newton <- Newton(
  veg_pois$par, 
  veg_pois$H, 
  veg_pois$grad_H, 
  veg_pois$Hessian_H, 
  cb = Newton_tracer$tracer
)
range(pois_Newton - coefficients(pois_model))
## [1] -4.979266e-10  1.298776e-06
rbind(
  pois_Newton = veg_pois$H(pois_Newton),
  pois_glm = veg_pois$H(coefficients(pois_model))
)
##                               [,1]
## pois_Newton -128.58945047446991339
## pois_glm    -128.58945047447093657
summary(Newton_tracer)
##         value       h_prime    gamma     .time
## 1     1.00000            NA       NA 0.0000000
## 2   -14.83270 -4.140563e+03 0.022518 0.2219832
## 3   -64.81635 -4.029847e+02 0.262144 0.4208414
## 4  -111.33647 -7.636275e+01 1.000000 0.6083847
## 5  -124.24937 -2.104160e+01 1.000000 0.7919283
## 6  -127.71116 -5.652483e+00 1.000000 0.9836466
## 7  -128.49729 -1.332600e+00 1.000000 1.1579588
## 8  -128.58733 -1.647034e-01 1.000000 1.3379428
## 9  -128.58945 -4.159696e-03 1.000000 1.5120448
## 10 -128.58945 -3.288913e-06 1.000000 1.6771430

The R function glm.fit() uses a Newton algorithm (without backtracking) and is about a factor five faster on this example.

env_pois <- environment(veg_pois$H)
system.time(glm.fit(env_pois$X, env_pois$y, family = poisson()))
##    user  system elapsed 
##   0.392   0.008   0.400

One should be careful when comparing run times for different optimization algorithms, but in this case they have achieved about the same precision with the faster glm.fit() that even obtained the smallest negative log-likelihood value of the two.

7.3.2 Quasi-Newton algorithms

We turn to other descent direction algorithms that are more efficient than gradient descent by choosing the descent direction in a more clever way but less computationally demanding than the Newton algorithm that requires the computation of the full Hessian in each iteration.

We will only consider the application of the BFGS algorithm via the implementation in the R function optim().

system.time(
  pois_BFGS <- optim(
    veg_pois$par, 
    veg_pois$H, 
    veg_pois$grad_H, 
    method = "BFGS", 
    control = list(maxit = 10000)
  )
)
##    user  system elapsed 
##   0.322   0.001   0.324
range(pois_BFGS$par - coefficients(pois_model))
## [1] -0.00954968  0.08481093
pois_BFGS[c("value", "counts")]
## $value
## [1] -128.5894
## 
## $counts
## function gradient 
##      158      148

7.3.3 Sparsity

One of the benefits of the implementations of \(H\) and its derivatives as well as of the descent algorithms is that they can exploit sparsity of \(\mathbf{X}\) almost for free. The implementations have not done that in previous computations, because \(\mathbf{X}\) has been stored as a dense matrix. In reality, \(\mathbf{X}\) is a very sparse matrix (the vast majority of the matrix entries are zero), and if we convert it into a sparse matrix, all the matrix-vector products will be more run time efficient. Sparse matrices are implemented in the R package Matrix.

library(Matrix)
env_pois$X <- Matrix(env_pois$X)
class(env_pois$X)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

Without changing any other code, we get an immediate run time improvement using e.g. optim() and the BFGS algorithm.

system.time(
  pois_BFGS_sparse <- optim(
    veg_pois$par, 
    veg_pois$H, 
    veg_pois$grad_H, 
    method = "BFGS", 
    control = list(maxit = 10000)
  )
)
##    user  system elapsed 
##   0.108   0.002   0.110

We should in real applications avoid constructing a dense intermediate model matrix as a step toward constructing a sparse model matrix. This is possible by constructing the sparse model matrix directly using a function from the R package MatrixModels. Ideally, we should reimplement pois_model() to support an option for using sparse matrices, but to focus on the run time benefits of sparse matrices, we simply change the matrix in the appropriate environment directly.

env_pois$X <- MatrixModels::model.Matrix(
  ~ store + log(normalSale) - 1, 
  data = vegetables, 
  sparse = TRUE
)
class(env_pois$X)
## [1] "dsparseModelMatrix"
## attr(,"package")
## [1] "MatrixModels"

The Newton implementation benefits enormously from using sparse matrices because the bottleneck is the computation of the Hessian.

Newton_tracer <- tracer(c("value", "h_prime", "gamma"), N = 0)
  pois_Newton <- Newton(
  veg_pois$par, 
  veg_pois$H, 
  veg_pois$grad_H, 
  veg_pois$Hessian_H, 
  cb = Newton_tracer$tracer
)
summary(Newton_tracer)
##         value       h_prime    gamma     .time
## 1     1.00000            NA       NA 0.0000000
## 2   -14.83270 -4.140563e+03 0.022518 0.2096308
## 3   -64.81635 -4.029847e+02 0.262144 0.4291613
## 4  -111.33647 -7.636275e+01 1.000000 0.6285711
## 5  -124.24937 -2.104160e+01 1.000000 0.8204904
## 6  -127.71116 -5.652483e+00 1.000000 1.0318102
## 7  -128.49729 -1.332600e+00 1.000000 1.2248038
## 8  -128.58733 -1.647034e-01 1.000000 1.4163189
## 9  -128.58945 -4.159696e-03 1.000000 1.6071219
## 10 -128.58945 -3.288913e-06 1.000000 1.8359039

To summarize the run times we have measured for the Poisson regression example, we found that the conjugate gradient algorithms took of the order of 10 seconds to converge. The Newton-type algorithms in this section were faster and took between 0.3 and 1.7 seconds to converge. The use of sparse matrices reduced the run time of the quasi-Newton algorithm BFGS by a factor 3, but it reduced the run time of the Newton algorithm by a factor 50 to about 0.03 seconds. One could be concerned that the construction of the sparse model matrix takes more time (which we did not measure), but if measured it turns out that for this example it takes about the same time to construct the dense model matrix as it takes to construct the sparse one.

Run time efficiency is not the only argument for using sparse matrices as they are also more memory efficient. It is memory (and time) inefficient to use dense intermediates, and for truly large scale problems impossible. Using sparse model matrices for regression models allows us to work with larger models that have more variables, more factor levels and more observations than if we use dense model matrices. For the Poisson regression model the memory used by either representation can be found.

object.size(env_pois$X)
object.size(as.matrix(env_pois$X))
## Sparse matrix memory usage:
## 123440 bytes
## Dense matrix memory usage:
## 3103728 bytes

We see that the dense matrix uses around a factor 30 more memory than the sparse representation. In this case it means using around 3 MB for storing the dense matrix instead of around 100 kB, which won’t be a problem on a contemporary computer. However, going from using 3 GB for storing a matrix to using 100 Mb could be the difference between not being able to work with the matrix on a standard laptop to running the computations with no problems. Using model.Matrix makes it possible to construct sparse model matrices directly and avoid all dense intermediates.

The function glm4() from the MatrixModels package for fitting regression models, can exploit sparse model matrices direction, and can thus be useful in cases where your model matrix becomes very large but sparse. There are two main applications where the model matrix becomes sparse. When you model the response using one or more factors, and possibly their interactions, the model matrix will become particularly sparse if the factors have many levels. Another case is when you model the response via basis expansions of quantitative predictors and use basis functions with local support. The B-splines form an important example of such a basis with local support that results in a sparse model matrix.