7.1 Algorithms and convergence

A numerical optimization algorithm computes from an initial value \(\theta_0 \in \Theta\) a sequence \(\theta_1, \theta_2, \ldots \in \Theta\). One could hope for \[\theta_n \rightarrow \text{arg min}_{\theta} H(\theta)\] for \(n \to \infty\), but much less can typically be guaranteed. First, the global minimizer may not exist or it may not be unique, in which case the convergence itself is undefined or ambiguous. Second, \(\theta_n\) can in general only be shown to converge to a local minimizer if anything. Third, \(\theta_n\) may not even converge, but \(H(\theta_n)\) may still converge to a local minimum.

This section will give a brief introduction to convergence analysis of optimization algorithms. We will see what kind of conditions on \(H\) can be used to show convergence results and some of the basic proof techniques. We will only scratch the surface here with the intention that it can motivate the algorithms introduced in subsequent sections and chapters as well as the empirical techniques introduced below for practical assessment of convergence.

Indeed, theory rarely provides us with sharp quantitative results on the rate of convergence and we will need computational techniques to monitor and measure convergence of algorithms in practice. Otherwise we cannot compare the efficiency of different algorithms.

7.1.1 Descent algorithms

Example 7.1 Suppose that \(D^2H(\theta)\) has numerical radius uniformly bounded by \(L\), that is, \[|\gamma^T D^2H(\theta) \gamma| \leq L \|\gamma\|_2^2\] for all \(\theta \in \Theta\) and \(\gamma \in \mathbb{R}^p\). Define an algorithm by \[\theta_{n} = \theta_{n-1} - \frac{1}{L + 1} \nabla H(\theta_{n-1}).\] Fixing \(n\) there is by Taylor’s theorem a \(\tilde{\theta} = \alpha \theta_{n} + (1- \alpha)\theta_{n-1}\) (where \(\alpha \in [0,1]\)) on the line between \(\theta_n\) and \(\theta_{n-1}\) such that

\[\begin{align*} H(\theta_n) & = H(\theta_{n-1}) - \frac{1}{L+1} \|\nabla H(\theta_{n-1})\|_2^2 + \frac{1}{(L+1)^2} \nabla H(\theta_{n-1})^T D^2H(\tilde{\theta}) \nabla H(\theta_{n-1}) \\ & \leq H(\theta_{n-1}) - \frac{1}{L+1} \|\nabla H(\theta_{n-1})\|_2^2 + \frac{L}{(L+1)^2} \|\nabla H(\theta_{n-1})\|_2^2 \\ & = H(\theta_{n-1}) - \frac{1}{(L+1)^2} \|\nabla H(\theta_{n-1})\|_2^2. \end{align*}\]

This shows that \(H(\theta_n) \leq H(\theta_{n-1})\), and if \(\theta_{n-1}\) is not a stationary point, \(H(\theta_n) < H(\theta_{n-1})\). That is, the algorithm will produce a sequence with non-increasing \(H\)-values, and unless it hits a stationary point the \(H\)-values will be strictly decreasing. The algorithm is an example of a gradient descent algorithm.

In general, we define a descent algorithm to be an algorithm for which \[H(\theta_0) \geq H(\theta_1) \geq H(\theta_2) \geq \ldots.\] If all inequalities are sharp, unless if some \(\theta_i\) is a local minimizer, the algorithm is called a strict descent algorithm. The gradient descent algorithm in Example 7.1 is a strict descent algorithm. However, even for a strict descent algorithm, \(H\) may just descent in smaller and smaller steps without converging toward a local minimum – even if \(H\) is bounded below.

Suppose now that \(H\) is level bounded, meaning that the closed set \[\mathrm{lev}(\theta_0) = \{\theta \in \Theta \mid H(\theta) \leq H(\theta_0)\}\] is bounded (and thereby compact). Then \(H\) is bounded from below and \(H(\theta_n)\) is convergent for any descent algorithm. Restricting attention to the gradient descent algorithm, we see that

\[\begin{align} H(\theta_n) & = (H(\theta_n) - H(\theta_{n-1})) + (H(\theta_{n-1}) - H(\theta_{n-2})) + ... + (H(\theta_1) - H(\theta_0)) + H(\theta_0) \\ & \leq H(\theta_0) - \frac{1}{(L + 1)^2} \sum_{k=1}^n \|\nabla H(\theta_{k-1})\|_2^2. \end{align}\]

Because \(H\) is bounded below, this implies that \(\sum_{k=1}^{\infty} \|\nabla H(\theta_{k-1})\|_2^2 < \infty\) and hence \[\|\nabla H(\theta_{n})\|_2 \rightarrow 0\] for \(n \to \infty\). By compactness of \(\mathrm{lev}(\theta_0)\), \(\theta_n\) has a convergent subsequence with limit \(\theta_{\infty}\), and we conclude by continuity of \(\nabla H\) that \[\nabla H(\theta_{\infty}) = 0,\] and \(\theta_{\infty}\) is a stationary point. In fact, this holds for any limit point of the sequence, and this implies that if \(H\) has a unique stationary point in \(\mathrm{lev}(\theta_0)\), \(\theta_{\infty}\) is a minimizer, and \[\theta_n \rightarrow \theta_{\infty}\] for \(n \to \infty\).

To summarize, if \(D^2H(\theta)\) has uniformly bounded numerical radius, and if \(H\) is level bounded with a unique stationary point in \(\mathrm{lev}(\theta_0)\), then the gradient descent algorithm of Example 7.1 is a strict descent algorithm that converges toward that minimum. A sufficient condition on \(H\) for this to hold is that the eigenvalues of (the symmetric) matrix \(D^2H(\theta)\) for all \(\theta\) are contained in an interval \([l, L]\) with \(0 < l \leq L.\) In this case, \(H\) is a strongly convex function with a unique global minimizer.

7.1.2 Maps and fixed points

Most algorithms take the form of an update scheme, which from a mathematical viewpoint is a map \(\Phi : \Theta \to \Theta\) such that \[\theta_n = \Phi(\theta_{n-1}) = \Phi \circ \Phi (\theta_{n-2}) = \Phi^{\circ n}(\theta_0).\] The gradient descent algorithm from Example 7.1 is given by the map \[\Phi_{\nabla}(\theta) = \theta - \frac{1}{L + 1} \nabla H(\theta).\] When the map \(\Phi\) is continuous and \(\theta_n \rightarrow \theta_{\infty}\) it follows that \[\Phi(\theta_n) \rightarrow \Phi(\theta_{\infty}).\] Since \(\Phi(\theta_n) = \theta_{n+1} \rightarrow \theta_{\infty}\) we see that \[\Phi(\theta_{\infty}) = \theta_{\infty}.\] That is, \(\theta_{\infty}\) is a fixed point of \(\Phi\). The gradient descent map, \(\Phi_{\nabla}\), has \(\theta\) as fixed point if and only if
\[\nabla H(\theta) = 0,\] that is, if and only if \(\theta\) is a stationary point.

We can use the observation above to flip the perspective around. Instead of asking if \(\theta_n\) converges to a local minimizer for a given algorithm, we can ask if we can find a map \(\Phi: \Theta \to \Theta\) whose fixed points are local minimizers. If so, we can ask if the iterates \(\Phi^{\circ n}(\theta_0)\) converge. Mathematics is full of fixed point theorems that: i) give conditions under which a map has a fixed point; and ii) in some cases guarantee that the iterates \(\Phi^{\circ n}(\theta_0)\) converge. The most prominent such fixed point theorem is Banach’s fixed point theorem. It states that if \(\Phi\) is a contraction, that is, \[\| \Phi(\theta) - \Phi(\theta')\| \leq c \|\theta - \theta'\|\] for a constant \(c \in [0,1)\) (using any norm), then \(\Phi\) has a unique fixed point and \(\Phi^{\circ n}(\theta_0)\) converges to that fixed point for any starting point \(\theta_0 \in \Theta\).

We will show that \(\Phi_{\nabla}\) is a contraction on \(\mathrm{lev}(\theta_0)\) under the assumption that the eigenvalues of \(D^2H(\theta)\) for all \(\theta \in \mathrm{lev}(\theta_0)\) are contained in an interval \([l, L]\) with \(0 < l \leq L\). If \(\theta, \theta' \in \mathrm{lev}(\theta_0)\) we find by Taylor’s theorem that \[\nabla H(\theta) = \nabla H(\theta') + D^2H(\tilde{\theta})(\theta - \theta')\] for some \(\tilde{\theta}\) on the line between \(\theta\) and \(\theta'\). For the gradient descent map this gives that

\[\begin{align*} \|\Phi_{\nabla}(\theta) - \Phi_{\nabla}(\theta')\|_2 & = \left\|\theta - \theta' - \frac{1}{L+1}\left(\nabla H(\theta) - \nabla H(\theta')\right)\right\|_2 \\ & = \left\|\theta - \theta' - \frac{1}{L+1}\left( D^2H(\tilde{\theta})(\theta - \theta')\right)\right\|_2 \\ & = \left\|\left(I - \frac{1}{L+1} D^2H(\tilde{\theta}) \right) (\theta - \theta')\right\|_2 \\ & \leq \left(1 - \frac{l}{L + 1}\right) \|\theta - \theta'\|_2, \end{align*}\]

since the eigenvalues of \(I - \frac{1}{L+1} D^2H(\tilde{\theta})\) are all between \(1 - L/(L + 1)\) and \(1 - l/(L+1)\). This shows that \(\Phi_{\nabla}\) is a contraction for the \(2\)-norm on \(\mathrm{lev}(\theta_0)\) with \(c = 1 - l/(L + 1) < 1\). This provides an alternative proof, via Banach’s fixed point theorem, of convergence of the gradient descent algorithm in Example 7.1 for a strongly convex \(H\) with uniformly bounded Hessian.

The number \(c\) quantifies how fast \(\theta_n\) converges toward the fixed point. The smaller \(c\) is, the faster is the convergence. Details are given in the next section, but it is worth noticing here that \(c = 1 - l/(L + 1)\) is independent of the dimension of \(\Theta\) for given bounds \(l\) and \(L\). It way also be worth noticing that \(\kappa = (L + 1) / l\) is an upper bound on the conditioning number of the matrix \(D^2H(\theta)\) uniformly in \(\theta \in \mathrm{lev}(\theta_0)\). A large conditioning number of the second derivative indicates that the graph of \(H\) looks like a narrow ravine, which will force \(\kappa^{-1}\) to be small, thus \(c\) to be close to one and the convergence to be slow.

7.1.3 Convergence rate

Banach’s fixed point theorem tells us more than just convergence. It actually tells us that \[\|\theta_n - \theta_{\infty}\| = \|\Phi(\theta_{n-1}) - \theta_{\infty}\| \leq c \|\theta_{n-1} - \theta_{\infty}\| \leq c^n \|\theta_0 - \theta_{\infty}\|.\] That is, \(\|\theta_n - \theta_{\infty}\| \to 0\) with at least geometric rate \(c < 1\).

To discuss how fast numerical optimization algorithms converge in general, there is a refined notion of asymptotic convergence order as well as rate.

Definition 7.1 We say that a convergent sequence \((\theta_n)_{n \geq 1}\) in \(\mathbb{R}^p\) has asymptotic convergence order \(q\) with asymptotic rate \(r \in (0, 1)\) if \[\lim_{n \to \infty} \frac{\|\theta_{n} - \theta_{\infty}\|}{\|\theta_{n-1} - \theta_{\infty}\|^q} = r.\]

If the order is \(q = 1\) we say that the convergence is linear, if \(q = 2\) we say that the convergence is quadratic and so on. If \[\limsup_{n \to \infty} \frac{\|\theta_{n} - \theta_{\infty}\|}{\|\theta_{n-1} - \theta_{\infty}\|} = 1\] we say that convergence is sublinear. Banach’s fixed point theorem implies a convergence that is at least as fast as linear convergence with asymptotic rate \(c\). Moreover, if \(\Phi\) is just a contraction for \(n \geq N_0\) for some \(N_0\), then for \(n \geq N_0\) \[\|\theta_{n + 1} - \theta_{n}\| \leq c \| \theta_{n} - \theta_{n-1}\|.\] The convergence may be superlinear, but if it is linear, the rate is bounded by \(c\). To indicate if \(\Phi\) is asymptotically a contraction, we can introduce \[R_n = \frac{\|\theta_{n + 1} - \theta_{n}\|}{\|\theta_{n} - \theta_{n- 1}\|}\] and monitor its behavior as \(n \to \infty\). The constant \[r = \limsup_{n \to \infty} R_n\] is then asymptotically the smallest possible contraction constant. If convergence is sublinear and \(R_n \to 1\) the sequence is called logarithmically convergent (by definition), while \(r \in (0,1)\) is an indication of linear convergence with rate \(r\), and \(r = 0\) is an indication of superlinear convergence.

In practice, we can plot the ratio \(R_n\) as the algorithm is running, and use \(R_n\) as an estimate of the rate \(r\) for large \(n\). However, this can be a quite unstable method for estimating the rate. Alternatively, we can run the algorithm for a large number, \(N\), say, of iterations. Ideally so that \(\theta_N = \theta_{\infty}\) up to computer precision. Then if we have linear convergence with rate \(r\), \[ \| \theta_n - \theta_N \| \simeq r \| \theta_{n-1} - \theta_N \| \simeq \ldots \simeq r^{n - N_0} \| \theta_{N_0} - \theta_N \| \] for \(n \geq N_0\) and \(N_0\) sufficiently large. That is, \[\log \|\theta_{n} - \theta_{N}\| \simeq n \log(r) + d,\] for \(n \geq N_0\), and we can plot and monitor \(\log \|\theta_{n} - \theta_{N}\|\). It should decay approximately linearly as a function of \(n\) with slope \(\log(r) < 0\), which can be estimated by least squares. If the algorithm has sublinear convergence we will see this as a slower-than-linear decay. Finally, it is also possible to estimate the order \(q\) as well as the rate \(r\) by using that \[\log \|\theta_{n} - \theta_{N}\| \simeq q \log \|\theta_{n-1} - \theta_{N}\| + \log(r)\] for \(n \geq N_0\). We can estimate \(q\) and \(\log(r)\) by fitting a linear function by least squares to these log-log transformed norms of errors.

It is, of course, possible to investigate convergence in terms of the sequences \(H(\theta_n)\) and \(\nabla H(\theta_n)\) instead. Here \(\nabla H(\theta_n)\) is particularly appealing as we know that the limit should be \(0\). We use the same terminology of order and rate for these sequences. That is, using the gradient to quantify convergence, the order is \(q\) and the rate is \(r \in (0, 1)\) if \[\lim_{n \to \infty} \frac{\|\nabla H(\theta_n)\|}{\|\nabla H(\theta_{n-1})\|^q} = r.\] If the order is linear, the rate can then be estimated from \[\log \|\nabla H(\theta_n)\| \simeq n \log(r) + d\] for \(n \geq N_0\) by least squares.

It is important to notice that all rates discussed above are per iteration, which is natural when investigating a single algorithm. However, different algorithms may spend different amounts of time per iteration, and it does not make sense to make a comparison of per iteration rates across different algorithms. If one iteration takes \(\delta\) time units (seconds, say) the per time unit rate is \[r^{1/\delta}.\] If \(t_n\) denotes the run time of \(n\) iterations, we could estimate \(\delta\) as \(t_N / N\) for \(N\) sufficiently large.

We will throughout systematically investigate convergence as a function of \(t_n\) instead of \(n\), and we will estimate rates per time unit directly by least squares regression on \(t_n\) instead of \(n\).

7.1.4 Stopping criteria

The discussion above on theoretical convergence properties and practical methods for measuring convergence rates has left one important practical question unanswered. No algorithm can run for an infinite number of iterations, thus all algorithms need a criterion for when to stop.

We say that an algorithm has converged by iteration \(n\) if \[H(\theta_n) - H(\theta_{\infty}) < \varepsilon_0\] for an error tolerance \(\varepsilon_0 > 0\). Unfortunately, we cannot compute \(H(\theta_n) - H(\theta_{\infty})\), and the choice of an appropriate stopping criterion is notoriously difficult. Below we give four of the most commonly used criteria and discuss benefits and deficits for each of them. They can be used individually and in combinations, but unfortunately none of them provides convergence guarantees. It is nevertheless common to abuse the terminology and say that an algorithm has converged when a stopping criterion is reached, but this can be misleading.

Maximal number of iterations: Stop when \[n = N\] for a fixed maximal number of iterations \(N\). This is arguably the simplest criterion, but, obviously, reaching a maximal number of iterations provides no evidence in itself that \(H(\theta_N)\) is sufficiently close to a (local) minimum. For a specific problem we could from experience know of a sufficiently large \(N\) so that the algorithm has typically converged after \(N\) iterations, but the most important use of this criterion is in combination with another criterion so that it works as a safeguard against an infinite loop.

The other three stopping criteria all depend on choosing a tolerance parameter \(\varepsilon > 0\), which will play different roles in the three criteria.

Small relative change: Stop when \[\|\theta_n - \theta_{n-1}\| \leq \varepsilon(\|\theta_n\| + \varepsilon).\] The idea is that when \(\theta_n \simeq \theta_{n-1}\) the sequence has approximately reached the limit \(\theta_{\infty}\) and we can stop. It is possible to use an absolute criterion such as \(\|\theta_n - \theta_{n-1}\| < \varepsilon\), but then the criterion would be sensitive to a rescaling of the parameters. Thus fixing a reasonable tolerance parameter across many problems makes more sense for the relative then the absolute criterion. The main reason for adding \(\varepsilon\) on the right hand size is to make the criterion well behaved even if \(\|\theta_n\|\) is close to zero.

The main benefit of this criterion is that it does not require evaluations of the objective function. Some algorithms, such as the EM-algorithm, does not need to evaluate the objective function, and it may even be difficult to do so. In those cases this criterion can be used. The main deficit is that a single iteration with little change in the parameter can happen for many reasons besides convergence, and it does not imply that \(H(\theta_{n}) - H(\theta_{\infty})\) is small.

Small relative descent: Stop when \[H(\theta_{n-1}) - H(\theta_n) \leq \varepsilon (|H(\theta_n)| + \varepsilon).\] This criterion only makes sense if the algorithm is a descent algorithm. As discussed above, an absolute criterion would be sensitive to rescaling of the objective function, and the added \(\varepsilon\) on the right hand side is to ensure a reasonable behavior if \(H(\theta_n)\) is close to zero.

This criterion is natural for descent algorithms; we stop when the algorithm does not decrease the value of the objective function sufficiently. The use of a relative criterion makes is possible to choose a tolerance parameter that works reasonably well for many problems. A conventional choice is \(\varepsilon \simeq 10^{-8}\) (and often chosen as the square root of the machine epsilon), though the theoretical support for this choice is weak. The deficit of the algorithm is as for the criterion above: a small descent does not imply that \(H(\theta_{n}) - H(\theta_{\infty})\) is small, and it could happen if the algorithm enters a part of \(\Theta\) where \(H\) is very flat, say.

Small gradient: Stop when \[\|\nabla H(\theta_n)\| \leq \varepsilon.\] This criterion directly measures if \(\theta_n\) is close to being a stationary point, but a small value of \(\|\nabla H(\theta_n)\|\) is still no guarantee that \(H(\theta_{n}) - H(\theta_{\infty})\) is small. The criterion also requires the computation of the gradient. In addition, that different norms, \(\|\cdot\|\), can be used, and if the coordinates of the gradient generally are of different orders of magnitude a norm that rescales the coordinates can be chosen.

Neither of the four criteria above give a theoretical convergence guarantee, but in some special cases it is possible to develop criteria with a stronger theoretical support. If we can find a continuous function \(\tilde{H}\) satisfying \(H(\theta_{\infty}) = \tilde{H}(\theta_{\infty})\) and \[H(\theta_{\infty}) \geq \tilde{H}(\theta)\] for all \(\theta \in \Theta\) (or just in \(\mathrm{lev}(\theta_0)\) for a descent algorithm) then \[0 \leq H(\theta_{n}) - H(\theta_{\infty}) \leq H(\theta_{n}) - \tilde{H}(\theta_{n}),\] and the right hand side directly quantifies convergence. Convex duality theory gives for many convex optimization problems such a function, \(\tilde{H}\), and in those cases a convergence criterion based on \(H(\theta_{n}) - \tilde{H}(\theta_{n})\) actually comes with a theoretical guarantee on how close we are to the theoretical minimum. We will not pursue the necessary convex duality theory here, but it is useful to know that in some cases we can do better than the ad hoc criteria above.