## 7.3 Newton-type algorithms

The Newton algorithm is very similar to gradient descent except that the gradient descent direction is replaced by \[\rho_n = - D^2 H(\theta_n)^{-1} \nabla H(\theta_n).\]

The Newton algorithm is typically much more efficient than gradient descent and will converge in few iterations. However, the storage of the \(p \times p\) Hessian, its computation, and the solution of the equation to compute \(\rho_n\) all scale like \(p^2\) and this can make the algorithm useless for very large \(p\).

A variety of alternatives to the Newton algorithm exist that replace the Hessian by another matrix that can be easier to compute and update. It should be noted that if we choose a matrix \(B_n\) in the \(n\)-th iteration, then \(- B_n \nabla H(\theta_n)\) is a descent direction whenever \(B_n\) is a positive definite matrix.

Newton implementation (with trace).

```
Newton <- function(
par,
H,
gr,
hess,
d = 0.8,
c = 0.1,
gamma0 = 1,
epsilon = 1e-10,
maxit = 50,
cb = NULL
) {
for(i in 1:maxit) {
value <- H(par)
grad <- gr(par)
if(!is.null(cb)) cb()
if(sum(grad^2) <= epsilon) break
Hessian <- hess(par)
rho <- - drop(solve(Hessian, grad))
gamma <- gamma0
par1 <- par + gamma * rho
h_prime <- t(grad) %*% rho
while(H(par1) > value + c * gamma * h_prime) {
gamma <- d * gamma
par1 <- par + gamma * rho
}
par <- par1
}
if(i == maxit)
warning("Maximal number, ", maxit, ", of iterations reached")
par
}
```

### 7.3.1 Poisson regression

We use the implementation of the Hessian matrix.

```
Newton_tracer <- tracer(c("value", "h_prime", "gamma"), N = 0)
pois_Newton <- Newton(
veg_pois$par,
veg_pois$H,
veg_pois$grad_H,
veg_pois$Hessian_H,
cb = Newton_tracer$tracer
)
```

`## [1] -4.979266e-10 1.298776e-06`

```
## [,1]
## pois_Newton -128.58945047446991339
## pois_glm -128.58945047447093657
```

```
## value h_prime gamma .time
## 1 1.00000 NA NA 0.0000000
## 2 -14.83270 -4.140563e+03 0.022518 0.2219832
## 3 -64.81635 -4.029847e+02 0.262144 0.4208414
## 4 -111.33647 -7.636275e+01 1.000000 0.6083847
## 5 -124.24937 -2.104160e+01 1.000000 0.7919283
## 6 -127.71116 -5.652483e+00 1.000000 0.9836466
## 7 -128.49729 -1.332600e+00 1.000000 1.1579588
## 8 -128.58733 -1.647034e-01 1.000000 1.3379428
## 9 -128.58945 -4.159696e-03 1.000000 1.5120448
## 10 -128.58945 -3.288913e-06 1.000000 1.6771430
```

The R function `glm.fit()`

uses a Newton algorithm (without backtracking)
and is about a factor five faster on this example.

```
env_pois <- environment(veg_pois$H)
system.time(glm.fit(env_pois$X, env_pois$y, family = poisson()))
```

```
## user system elapsed
## 0.392 0.008 0.400
```

One should be careful when comparing run times for different optimization
algorithms, but in this case they have achieved about the same precision
with the faster `glm.fit()`

that even obtained the smallest negative
log-likelihood value of the two.

### 7.3.2 Quasi-Newton algorithms

We turn to other descent direction algorithms that are more efficient than gradient descent by choosing the descent direction in a more clever way but less computationally demanding than the Newton algorithm that requires the computation of the full Hessian in each iteration.

We will only consider the application of the BFGS algorithm via the
implementation in the R function `optim()`

.

```
system.time(
pois_BFGS <- optim(
veg_pois$par,
veg_pois$H,
veg_pois$grad_H,
method = "BFGS",
control = list(maxit = 10000)
)
)
```

```
## user system elapsed
## 0.322 0.001 0.324
```

`## [1] -0.00954968 0.08481093`

```
## $value
## [1] -128.5894
##
## $counts
## function gradient
## 158 148
```

### 7.3.3 Sparsity

One of the benefits of the implementations of \(H\) and its derivatives as well as of the descent algorithms is that they can exploit sparsity of \(\mathbf{X}\) almost for free. The implementations have not done that in previous computations, because \(\mathbf{X}\) has been stored as a dense matrix. In reality, \(\mathbf{X}\) is a very sparse matrix (the vast majority of the matrix entries are zero), and if we convert it into a sparse matrix, all the matrix-vector products will be more run time efficient. Sparse matrices are implemented in the R package Matrix.

```
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
```

Without changing any other code, we get an immediate
run time improvement using e.g. `optim()`

and the BFGS algorithm.

```
system.time(
pois_BFGS_sparse <- optim(
veg_pois$par,
veg_pois$H,
veg_pois$grad_H,
method = "BFGS",
control = list(maxit = 10000)
)
)
```

```
## user system elapsed
## 0.108 0.002 0.110
```

We should in real applications avoid constructing a dense intermediate
model matrix as a step toward constructing a sparse model matrix. This
is possible by constructing the sparse model matrix directly using
a function from the R package MatrixModels. Ideally, we should reimplement
`pois_model()`

to support an option for using sparse matrices, but
to focus on the run time benefits of sparse matrices, we simply
change the matrix in the appropriate environment directly.

```
env_pois$X <- MatrixModels::model.Matrix(
~ store + log(normalSale) - 1,
data = vegetables,
sparse = TRUE
)
class(env_pois$X)
```

```
## [1] "dsparseModelMatrix"
## attr(,"package")
## [1] "MatrixModels"
```

The Newton implementation benefits enormously from using sparse matrices because the bottleneck is the computation of the Hessian.

```
Newton_tracer <- tracer(c("value", "h_prime", "gamma"), N = 0)
pois_Newton <- Newton(
veg_pois$par,
veg_pois$H,
veg_pois$grad_H,
veg_pois$Hessian_H,
cb = Newton_tracer$tracer
)
summary(Newton_tracer)
```

```
## value h_prime gamma .time
## 1 1.00000 NA NA 0.0000000
## 2 -14.83270 -4.140563e+03 0.022518 0.2096308
## 3 -64.81635 -4.029847e+02 0.262144 0.4291613
## 4 -111.33647 -7.636275e+01 1.000000 0.6285711
## 5 -124.24937 -2.104160e+01 1.000000 0.8204904
## 6 -127.71116 -5.652483e+00 1.000000 1.0318102
## 7 -128.49729 -1.332600e+00 1.000000 1.2248038
## 8 -128.58733 -1.647034e-01 1.000000 1.4163189
## 9 -128.58945 -4.159696e-03 1.000000 1.6071219
## 10 -128.58945 -3.288913e-06 1.000000 1.8359039
```

To summarize the run times we have measured for the Poisson regression example, we found that the conjugate gradient algorithms took of the order of 10 seconds to converge. The Newton-type algorithms in this section were faster and took between 0.3 and 1.7 seconds to converge. The use of sparse matrices reduced the run time of the quasi-Newton algorithm BFGS by a factor 3, but it reduced the run time of the Newton algorithm by a factor 50 to about 0.03 seconds. One could be concerned that the construction of the sparse model matrix takes more time (which we did not measure), but if measured it turns out that for this example it takes about the same time to construct the dense model matrix as it takes to construct the sparse one.

Run time efficiency is not the only argument for using sparse matrices as they are also more memory efficient. It is memory (and time) inefficient to use dense intermediates, and for truly large scale problems impossible. Using sparse model matrices for regression models allows us to work with larger models that have more variables, more factor levels and more observations than if we use dense model matrices. For the Poisson regression model the memory used by either representation can be found.

```
## Sparse matrix memory usage:
## 123440 bytes
## Dense matrix memory usage:
## 3103728 bytes
```

We see that the dense matrix uses around a factor 30 more memory than the
sparse representation. In this case it means using around 3 MB for storing the
dense matrix instead of around 100 kB, which won’t be a problem
on a contemporary computer. However, going from using 3 GB for
storing a matrix to using 100 Mb could be the difference between
not being able to work with the matrix on a standard laptop to
running the computations with no problems. Using `model.Matrix`

makes
it possible to construct sparse model matrices directly and avoid
all dense intermediates.

The function `glm4()`

from the MatrixModels package for fitting regression models,
can exploit sparse model matrices direction, and can
thus be useful in cases where your model matrix becomes very large but sparse.
There are two main applications where the model matrix becomes sparse.
When you model the response using one or more factors, and possibly their
interactions, the model matrix will become particularly sparse if
the factors have many levels. Another case is when you model the response
via basis expansions of quantitative predictors and use basis functions
with local support. The B-splines form an important example of such a basis
with local support that results in a sparse model matrix.