# Chapter 7 Numerical optimization

The main application of numerical optimization in statistics is for the computation of parameter estimates. Typically by maximizing the likelihood function or by maximizing or minimizing another estimation criterion. The focus of this chapter is on optimization algorithms for (penalized) maximum likelihood estimation. Out of tradition we formulate all results and algorithms in terms of minimization and not maximization.

The generic optimization problem considered is the minimization of
\(H : \Theta \to \mathbb{R}\) for \(\Theta \subseteq \mathbb{R}^p\) an
open set and \(H\) twice differentiable. In applications, \(H = -\ell\),
the negative log-likelihood function, or \(H = -\ell + J\), where \(J : \Theta \to \mathbb{R}\)
is a *penalty function*, likewise twice differentiable, that does not
depend upon data.

Statistical optimization problems share some properties that we should pay attention to. The \(-\ell\) term in the objective function to be minimized is a sum over the data, and the more data we have the more computationally demanding it is to evaluate \(H\) and its derivatives. There are exceptions, though, when a sufficient statistic can be computed upfront, but generally we must expect run time to grow with data size. Additionally, high precision in the computed (local) minimizer is typically not necessary. If the numerical error is already orders of magnitudes smaller than the statistical uncertainty of the parameter estimate being computed, further optimization makes no relevant difference.

In fact, blindly pursuing the global maximum of the likelihood can lead you astray if your model is not well behaved. In situations where \(H\) is not convex, as is the case for finite mixtures, there are common examples where the likelihood is unbounded, yet there will be a local minimizer that is a good estimate. Though we typically phrase our algorithms as optimization algorithms, what we are really pursuing when \(H\) is not convex are stationary points; solutions to \(\nabla H(\theta) = 0\). And rather than being fixated on computing the global minimizer of \(H\) to the greatest numerical precision, we should instead explore the likelihood surface and find as many approximately stationary points as possible. In this respect, optimization in statistics differs from certain other applications of optimization. Finding the maximizer of the likelihood is not “the thing”. It is a surrogate for exploring model fits, and we should also remeber to understand the statistical precision of our fitted model. As a minimum we should quantify the uncertainty of (relevant aspects of) the fitted model that is due to the finite sample size.

For the generic minimization problem considered in this chapter, the practical
challenge when implementing algorithms in R is typically
to implement efficient evaluation of \(H\) and its derivatives. In particular,
efficient evaluations of \(-\ell\). Several choices of standard optimization
algorithms are possible and some are already implemented and available in R.
For many practical purposes the BFGS-algorithms as implemented via the `optim`

function work well and require only the computation of gradients. It is, of course,
paramount that \(H\) and \(\nabla H\) are correctly implemented, and
efficiency of the algorithms is largely determined by the efficiency of
the implementation of \(H\) and \(\nabla H\) but also by the choice of parametrization.

For some optimization problems, Newton-type algorithms are preferable, and standard
implementations are available in R through `nlm()`

and `nls()`

but with different
interfaces than `optim()`

.