1.3 Optimization

The third part of the book is on optimization. This is a huge research field in itself and the focus of the book is on its relatively narrow application to parameter estimation in statistics. That is, when a statistical model is given by a parametrized family of probability distributions, optimization of some criterion – often the likelihood function – is used to find a model that fits the data.

Classical and generic optimization algorithms based on first and second order derivatives can often be used, but its important to understand how convergence can be measured and monitored, and how the different algorithms scale with the size of the data set and the dimension of the parameter space. The choice of the right data structure, such as sparse matrices, can be pivotal for efficient implementations of the criterion function that is to be optimized.

For a mixture of von Mises distributions it is possible to compute the likelihood and optimize it using standard algorithms. However, it can also be optimized using the EM-algorithm, which is a clever algorithm for optimizing likelihood functions for models that can be formulated in terms of latent variables.

This section demonstrates how to fit a mixture of von Mises distributions to the angle data using the EM-algorithm as implemented in the R package movMF. This will illustrate some of the many practical choices we have to make when using numerical optimization such as: the choice of starting value; the stopping criterion; the precise specification of the steps that the algorithm should use; and even whether it should use randomized steps.

1.3.1 The EM-algorithm

The movMF() function implements the EM-algorithm for mixtures of von Mises distributions. The code below shows one example call of movMF() with one of the algorithmic control arguments specified. It is common in R that such arguments are all bundled together in a single list argument called control. It can make the function call appear a bit more complicated than it needed to be, but it allows for easier reuse of the – sometimes fairly long – list of control arguments. Note that as the movMF package works with data as elements on the unit circle, the angle data must be transformed using cos and sin.

psi_circle <- cbind(cos(phipsi$psi), sin(phipsi$psi))
vM_fit <- movMF(
  x = psi_circle, 
  k = 2,              # The number of mixture components
  control = list(
    start = "S"       # Determines how starting values are chosen
  )
)

The function movMF() returns an object of class movMF as the documentation says (see ?movMF). In this case it means that vM_fit is a list with a class label movMF, which controls how generic functions work for this particular list. When the object is printed, for instance, some of the the content of the list is formatted and printed as follows.

vM_fit
## theta:
##        [,1]      [,2]
## 1  3.472846 -1.935807
## 2 -4.508098  3.872847
## alpha:
## [1] 0.5486586 0.4513414
## L:
## [1] 193.3014

What we see above is the estimated \(\theta\)-parameters for each of the two mixture components printed as a matrix, the mixture proportions (alpha) and the value of the log-likelihood function (L) in the estimated parameters.

We can compare the fitted model to the data using the density function as implemented above and the parameters estimated by the EM-algorithm.

hist(phipsi$psi, breaks = seq(-pi, pi, length.out = 15), prob = TRUE)
rug(phipsi$psi)
density(phipsi$psi, bw = "SJ", cut = 0) %>% lines(col = "red", lwd = 2)
curve(dvM(x, vM_fit$theta, vM_fit$alpha[1]), add = TRUE, col = "blue", lwd = 2)

As we can see from the figure, this looks like a fairly good fit to the data, and this is reassuring for two reasons. First, it shows that the two-component mixture of von Mises distributions is a good model. This is a statistical reassurance. Second, it shows that the optimization over the parameter space found a good fit. This is a numerical reassurance. If we optimize over a parameter space and subsequently find that the model does not fit the data well, it is either because the numerical optimization failed or because the model is wrong.

Numerical optimization can fail for a number of reasons. For once, the algorithm can get stuck in a local optimum, but it can also stop prematurely either because the maximal number of iterations is reached or because the stopping criterion is fulfilled (even though the algorithm has not reached an optimum). Some information related to the two last problems can be gathered from the algorithm itself, and for movMF() such information is found in a list entry of vM_fit.

vM_fit$details
## $reltol
## [1] 1.490116e-08
## 
## $iter
##    iter maxiter 
##      16     100 
## 
## $logLiks
## [1] 193.3014
## 
## $E
## [1] "softmax"
## 
## $kappa
## NULL
## 
## $minalpha
## [1] 0
## 
## $converge
## [1] TRUE

We see that the number of iterations used was 16 (and less than the maximal number of 100), and the stopping criterion (small relative improvement) was active (converge is TRUE, if FALSE the algorithm is run for a fixed number of iterations). The tolerance parameter used for the stopping criterion was \(1.49 \times 10^{-8}\).

The small relative improvement criterion for stopping in iteration \(n\) is \[ |L(\theta_{n-1}) - L(\theta_n)| < \varepsilon (|L(\theta_{n-1})| + \varepsilon)\] where \(L\) is the log-likelihood and \(\varepsilon\) is the tolerance parameter above. The default tolerance parameter is the square root of the machine epsilon, see also ?.Machine. This is a commonly encountered default for a “small-but-not-too-small-number”, but outside of numerical differentiation this default may not be supported by much theory.

By choosing different control arguments we can change how the numerical optimization proceeds. A different method for setting the starting value is chosen below, which contains a random component. Here we consider the results in four different runs of the entire algorithm with different random starting values. We also decrease the number of maximal iterations to 10 and make the algorithm print out information about its progress along the way.

vM_control <- list(
    verbose = TRUE,   # Print output showing algorithmic progress
    maxiter = 10,     
    nruns = 4         # Effectively 4 runs with randomized starting values
  )
vM_fit <- movMF(psi_circle, 2, control = vM_control)
## Run: 1
## Iteration: 0 *** L: 158.572
## Iteration: 1 *** L: 190.999
## Iteration: 2 *** L: 193.118
## Iteration: 3 *** L: 193.238
## Iteration: 4 *** L: 193.279
## Iteration: 5 *** L: 193.293
## Iteration: 6 *** L: 193.299
## Iteration: 7 *** L: 193.3
## Iteration: 8 *** L: 193.301
## Iteration: 9 *** L: 193.301
## Run: 2
## Iteration: 0 *** L: 148.59
## Iteration: 1 *** L: 188.737
## Iteration: 2 *** L: 192.989
## Iteration: 3 *** L: 193.197
## Iteration: 4 *** L: 193.264
## Iteration: 5 *** L: 193.288
## Iteration: 6 *** L: 193.297
## Iteration: 7 *** L: 193.3
## Iteration: 8 *** L: 193.301
## Iteration: 9 *** L: 193.301
## Run: 3
## Iteration: 0 *** L: 168.643
## Iteration: 1 *** L: 189.946
## Iteration: 2 *** L: 192.272
## Iteration: 3 *** L: 192.914
## Iteration: 4 *** L: 193.157
## Iteration: 5 *** L: 193.249
## Iteration: 6 *** L: 193.282
## Iteration: 7 *** L: 193.295
## Iteration: 8 *** L: 193.299
## Iteration: 9 *** L: 193.301
## Run: 4
## Iteration: 0 *** L: 4.43876
## Iteration: 1 *** L: 5.33851
## Iteration: 2 *** L: 6.27046
## Iteration: 3 *** L: 8.54826
## Iteration: 4 *** L: 14.4429
## Iteration: 5 *** L: 29.0476
## Iteration: 6 *** L: 61.3225
## Iteration: 7 *** L: 116.819
## Iteration: 8 *** L: 172.812
## Iteration: 9 *** L: 190.518

In all four cases it appears that the algorithm is approaching the same value of the log-likelihood as what we found above, though the last run starts out in a much lower value and takes more iterations to reach a large log-likelihood value. Note also that in all runs the log-likelihood is increasing. It is a feature of the EM-algorithm that every step of the algorithm will increase the likelihood.

Variations of the EM-algorithm are possible, like in movMF where the control argument E determines the so-called E-step of the algorithm. Here the default (softmax) gives the actual EM-algorithm whereas hardmax and stochmax give alternatives. While the alternatives do not guarantee that the log-likelihood increases in every iteration they can be numerically beneficial. We rerun the algorithm from the same four starting values as above but using hardmax as the E-step. Note how we can reuse the control list by just adding a single element to it.

vM_control$E <- "hardmax"
vM_fit <- movMF(psi_circle, 2, control = vM_control)
## Run: 1
## Iteration: 0 *** L: 158.572
## Iteration: 1 *** L: 193.052
## Iteration: 2 *** L: 193.052
## Run: 2
## Iteration: 0 *** L: 148.59
## Iteration: 1 *** L: 192.443
## Iteration: 2 *** L: 192.812
## Iteration: 3 *** L: 193.052
## Iteration: 4 *** L: 193.052
## Run: 3
## Iteration: 0 *** L: 168.643
## Iteration: 1 *** L: 191.953
## Iteration: 2 *** L: 192.812
## Iteration: 3 *** L: 193.052
## Iteration: 4 *** L: 193.052
## Run: 4
## Iteration: 0 *** L: 4.43876
## Iteration: 1 *** L: 115.34
## Iteration: 2 *** L: 191.839
## Iteration: 3 *** L: 192.912
## Iteration: 4 *** L: 192.912

It is striking that all runs now stopped before the maximal number of iterations was reached, and run four is particularly noteworthy as it jumps from its low starting value to above 190 in just two iterations. However, hardmax is a heuristic algorithm, whose fixed points are not necessarily stationary points of the log-likelihood. We can also see above that all four runs stopped at values that are clearly smaller than the log-likelihood of about 193.30 that was reached using the real EM-algorithm. Whether this is a problem from a statistical viewpoint is a different matter; using hardmax could give an estimator that is just as efficient as the maximum likelihood estimator.

Which optimization algorithm should we then use? This is in general a very difficult question to answer, and it is non-trivial to correctly assess which algorithm is “the best”. As the application of movMF() above illustrates, optimization algorithms may have a number of different parameters that can be tweaked, and when it comes to actually implementing an optimization algorithm we need to: make it easy to tweak parameters; investigate and quantify the effect of the parameters on the algorithm; and choose sensible defaults. Without this work it is impossible to have a meaningful discussion about the benefits and deficits of various algorithms.

1.3.2 Large scale optimization

Numerical optimization is the tool that has made statistical models useful for real data analysis – most importantly by making it possible to compute maximum likelihood estimators in practice. This works very well when the model can be given a parametrization with a concave log-likelihood, while optimization of more complicated log-likelihood surfaces can be numerically challenging and lead to statistically dubious results.

In contemporary statistics and machine learning, numerical optimization has come to play an even more important role as structural model constraints are now often also part of the optimization, for instance via penalty terms in the criterion function. Instead of working with simple models selected for each specific problem and with few parameters, we use complex models with thousands or millions of parameters. They are flexible and adaptable but need careful, regularized optimization to not overfit the data. With large amounts of data the regularization can be turned down and we can discover aspects of data that the simple models would never show. However, we need to pay a computational price.

When the dimension of the parameter space, \(p\), and the number of data points, \(n\), are both large, evaluation of the log-likelihood or its gradient can become prohibitively costly. Optimization algorithms based on higher order derivatives are then completely out of the question, and even if the (penalized) log-likelihood and its gradient can be computed in \(O(pn)\)-operations this may still be too much for an optimization algorithm that does this in each iteration. Such large scale optimization problems have spurred a substantial development of stochastic gradient methods and related stochastic optimization algorithms that only use parts of data in each iteration, and which are currently the only way to fit sufficiently complicated models to data.