# 9 Stochastic Optimization

Numerical optimization involves different tradeoffs such as
an *exploration-exploitation* tradeoff.
On the one hand, the objective function must be thoroughly explored
to build an adequate model of it. On the other hand, the model should be
exploited so as to find the minimum quickly. Another tradeoff is between
the accuracy of the model and the time it takes to compute with it.

The optimization algorithms considered in Chapters 7 and 8 work on all available data and take deterministic steps in each iteration. The models are based on accurate local computations of derivatives that can be greedily exploit the local model obtained from derivatives, but they do little exploration.

By including randomness into optimization algorithms it is possible to lower the computational costs and make the algorithms more exploratory. This can be done in various ways. Examples of stochastic optimization algorithms include simulated annealing and evolutionary algorithms that incorporate randomness into the iterative steps with the purpose of exploring the objective function better than a deterministic algorithm is able to. In particular, to avoid getting stuck in saddle points and to escape local minima. Stochastic gradient algorithms form another example, where descent directions are approximated by gradients from random subsets of the data.

The literature on stochastic optimization is huge, and this chapter will only cover some examples of particular relevance to statistics and machine learning. The most prominent applications are to large scale optimization, where stochastic gradient algorithms have become the standard solution. When the dimension of the optimization problem becomes very large, second order methods become prohibitively slow, and if the number of observations is also large, even one computation of the gradient for the entire data batch becomes time consuming. In those cases, stochastic gradient algorithms, that originate from online learning, can make progress more quickly while still using the entire batch of data.

## 9.1 Stochastic gradient algorithms

Stochastic gradient algorithms have their origin in an online learning
framework, where data arrives sequentially as a stream of data points and where
the objective function is an expected loss. Robbins and Monro (1951) introduced in
their seminal paper a variant of the online stochastic gradient algorithm that
they called the *stochastic approximation method*, and they established the
first convergence result for such algorithms. To understand what stochastic
gradient algorithms are supposed to optimize, we introduce the
general framework of a population model and give conditions that ensure
that the basic online algorithm converges. Subsequently, the basic online
algorithm is turned into an algorithm for batch data, which is the algorithm of
primary interest. In the following sections, various beneficial extensions of
the basic batch algorithm are explored.

### 9.1.1 Population models and loss functions

We will consider observations from the sample
space \(\mathcal{X}\), and we will be interested in estimating parameters
from the parameter space \(\Theta\). A loss function
\[L : \mathcal{X} \times \Theta \to \mathbb{R}\]
is fixed throughout, and we want to minimize the expected
loss also known as the *risk*. That is, we want to minimize
\[H(\theta) = E(L(X, \theta)) = \int L(x, \theta) \mathrm{\mu}(dx)\]
with \(X \in \mathcal{X}\) having distribution \(\mu\).
Of course, it is implicitly understood that the expectation has to be well
defined for all \(\theta\).

**Example 9.1 **Suppose that \(X = (Y, Z)\) with \(Y\) a real valued random variable, and
that \(\mu(z, \theta)\) denotes a parametrized mean value conditionally on \(Z = z\).
With the squared error loss,
\[L((y,z), \theta) = \frac{1}{2} (y - \mu(z, \theta))^2,\]
the risk is the mean squared error
\[\mathrm{MSE}(\theta) = \frac{1}{2} E (Y - \mu(Z, \theta))^2.\]
From the definition of \(\mathrm{MSE}(\theta)\) we see that
\[2 \mathrm{MSE}(\theta) = E(Y - E(Y \mid Z))^2 + E (E(Y\mid Z) - \mu(Z, \theta))^2,\]
where the first term does not depend upon \(\theta\). Thus minimizing the
mean squared error is the same as finding \(\theta_0\) such that \(\mu(z, \theta_0)\)
is the optimal approximation of \(E(Y \mid Z = z)\) in a squared error sense.

Note how the link between the distribution of \(X\) and the parameter is defined in the example above by the choice of loss function. There is no upfront assumption that \(E(Y\mid Z = z) = \mu(z, \theta_0)\) for any \(\theta_0 \in \Theta\), but if there is such a \(\theta_0\), it will clearly be a minimizer. In general, the optimal \(\theta_0\) is simply the \(\theta\) that minimizes the risk.

An alternative to the squared error loss is the log-likelihood loss, which can be used when we have a parametrized family of distributions.

**Example 9.2 **Suppose that \(f_{\theta}\) denotes a density on \(\mathcal{X}\) parametrized by
\(\theta\). Then the log-likelihood loss is
\[L(x, \theta) = - \log f_{\theta}(x).\]
The corresponding risk,
\[H(\theta) = - E \log(f_{\theta}(X)),\]
is known as the cross-entropy. If the distribution of \(X\) has density \(f^0\) then
\[\begin{align*}
H(\theta) & = - E \log(f^0(X)) - E \log(f_{\theta}(X)/f^0(X)) \\
& = H(f^0) + D(f^0 \ || \ f_{\theta})
\end{align*}\]
where the first term is the entropy of \(f^0\), and the second is the
Kullback-Leibler divergence of \(f_{\theta}\) from \(f^0\).
The entropy does not depend upon \(\theta\) and minimizing
the cross-entropy is thus the same as finding a \(\theta\) with \(f_{\theta}\)
being the optimal approximation of \(f^0\) in a Kullback-Leibler sense.

We consider now the same regression setup as in Example 9.1 with \(X = (Y, Z)\), but we let \[f_{\theta}(y | z) = e^{- \mu(z, \theta)} \frac{y^{\mu(z, \theta)}}{y!}\] denote the Poisson point probabilities for the Poisson distribution with mean \(\mu(z, \theta)\) conditionally on \(Z = z\). Then the log-likelihood loss is \[- \log f_{\theta}(y | z) = \mu(z, \theta) - y \log(\mu(z, \theta))\] up to an additive constant not depending on \(\theta\), and the cross-entropy is \[H(\theta) = E\big(\mu(Z, \theta) - E(Y \mid Z) \log(\mu(Z, \theta))\big),\] again up to an additive constant. This risk function quantifies how \(\mu(Z, \theta)\) deviates from \(E(Y \mid Z)\) in a different way than the risk based on the squared error loss. However, if \(E(Y \mid Z = z) = \mu(z, \theta_0)\) for some \(\theta_0\), it is still true that \(\theta_0\) is a minimizer, cf. Exercise 9.1.

The log-likelihood loss is an appropriate loss function if the parametrized family of distributions fits data well. For a Gaussian (conditional) mean value model the log-likelihood loss gives the same risk as the squared error loss – but the squared error loss can also be suitable even if data is not Gaussian. It can be a suitable loss whenever we just want to fit a model of the conditional mean of \(Y\). If we want to fit the (conditional) median instead, we can use the absolute deviation \[L((y,z), \theta) = |y - \mu(z, \theta)|.\] This is a special case of the check loss functions used for quantile regression. Thus by choosing the loss function we decide which aspects of the distribution we model, that is, whether we want a good global fit as for the log-likelihood loss, a fit of the mean value as for the squared error loss, or a fit of the the median or another quantile as for the check loss.

### 9.1.2 Online stochastic gradient algorithm

The classical stochastic gradient algorithm is an example of an online learning algorithm. It is based on the simple observation that if we can interchange differentiation and expectation then \[\nabla H(\theta) = E \left( \nabla_{\theta} L(X, \theta) \right),\] thus if \(X_1, X_2, \ldots\) form an i.i.d. sequence then \(\nabla_{\theta} L(X_i, \theta)\) is unbiased as an estimate of the gradient of \(H\) for any \(\theta\) and any \(i\). With inspiration from gradient descent algorithms it is natural to suggest stochastic parameter updates of the form \[\theta_{n + 1} = \theta_n - \gamma_n \nabla_{\theta} L(X_{n+1}, \theta_n)\] starting from some initial value \(\theta_0\). The direction, \(\nabla_{\theta} L(X_{n+1}, \theta_n)\), is, however, not guaranteed to be a descent direction for \(H\), and even if it is, \(\gamma_n\) is typically not chosen to guarantee descent.

The sequence of step size parameters \(\gamma_n \geq 0\) are known collectively
as the *learning rate*. It can be a deterministic sequence, but \(\gamma_n\) may
also depend on \(X_1, \ldots, X_{n}\) and \(\theta_0, \ldots, \theta_{n}\).
For stochastic gradient algorithms, convergence can be shown under global
conditions on the decay of the learning rate rather than local conditions
on the individual step lengths.

**Theorem 9.1 **Suppose \(H\) is strongly convex and
\[E(\|\nabla_{\theta} L(X, \theta))\|^2) \leq A + B \|\theta\|^2.\]
If \(\theta^*\) is the global minimizer of \(H\) then \(\theta_n\) converges almost surely
toward \(\theta^*\) if
\[\begin{equation}
\sum_{n=1}^{\infty} \gamma_n^2 < \infty \quad \text{and} \quad
\sum_{n=1}^{\infty} \gamma_n = \infty. \tag{9.1}
\end{equation}\]

From the above result, convergence of the algorithm
is guaranteed if the learning rate, \(\gamma_n\), converges to 0 but
does so sufficiently slowly. Though formulated in a slightly different way,
Robbins and Monro (1951) were the first to demonstrate convergence of an online
learning algorithm under conditions as above on the learning rate.
Following their terminology, much has since been written on online
learning and adaptive control theory under the name *stochastic approximation*,
Lai (2003).

The precise way that the learning
rate decays is known as the *decay schedule*, and a flexible
three-parameter power law family of decay schedules is given by
\[\gamma_n = \frac{\gamma_0 K}{K + n^a} = \frac{\gamma_0 }{1 + K^{-1} n^{a}}\]
for some initial learning rate \(\gamma_0 > 0\) and constants \(K, a > 0\). If
\(a \in (0.5, 1]\) the resulting learning rate satisfies the convergence conditions
(9.1).

The parameter \(\gamma_0\) determines the initial baseline rate, and Figure 9.1 illustrates the effect of the parameters \(K\) and \(a\) on the decay. The parameter \(a\) is the asymptotic exponent of \(\gamma_n \sim \gamma_0 K n^{-a}\), and \(K\) determines how quickly the rate will turn into a pure power law decay. Moreover, if we have a target rate, \(\gamma_{1}\), that we want to hit after \(n_{1}\) iterations, and we fix the exponent \(a\), we can also solve for \(K\) to find \[K = \frac{n_1^a \gamma_1}{\gamma_0 - \gamma_1}.\] This gives us a decay schedule that interpolates between \(\gamma_0\) and \(\gamma_1\) over the range \(0, \ldots, n_1\) of iterations.

We implement `decay_scheduler()`

as a function that returns a particular
decay schedule, with the possibility to determine \(K\) automatically from a
target rate.

```
decay_scheduler <- function(gamma0 = 1, a = 1, K = 1, gamma1, n1) {
force(a)
if (!missing(gamma1) && !missing(n1))
K <- n1^a * gamma1 / (gamma0 - gamma1)
b <- gamma0 * K
function(n) b / (K + n^a)
}
```

The following example of online Poisson regression illustrates the general ideas.

**Example 9.3 **In this example \(Y_i \mid Z_i = z_i \sim \mathrm{Pois}(\varphi(\beta_0 + \beta_1 z_i))\)
for \(\beta = (\beta_0, \beta_1)^T\) the parameter vector and
\(\varphi: \mathbb{R} \to (0,\infty)\) a continuously differentiable function.
We let the \(Z_i\)-s be uniformly distributed in \((-1, 1)\), but this choice is not
particularly important. The conditional mean of \(Y_i\) given \(Z_i = z_i\) is
\[\mu(z_i, \beta) = \varphi(\beta_0 + \beta_1 z_i)\]
and we will first consider the squared error loss. To this end, observe that
\[\nabla_{\beta} \mu(z_i, \beta) = \varphi'(\beta_0 + \beta_1 z_i)
\left( \begin{array}{c} 1 \\ z_i \end{array} \right),\]
which for the squared error loss results in the gradient
\[\nabla_{\beta} \frac{1}{2} (y_i - \mu(z_i, \beta) )^2 =
\varphi'(\beta_0 + \beta_1 z_i) (\mu(z_i, \beta) - y_i) \left( \begin{array}{c} 1 \\ z_i \end{array} \right).\]

We simulate data and explore the learning algorithm in the special case with \(\varphi = \exp\). To clearly emulate the online nature of the algorithm, the implementation below generates the observations sequentially in the loop.

```
N <- 5000
beta_true = c(2, 3)
mu <- function(z, beta) exp(beta[1] + beta[2] * z)
beta <- vector("list", N)
rate <- decay_scheduler(gamma0 = 0.0004, K = 100)
beta[[1]] <- c(beta0 = 1, beta1 = 1)
for(i in 2:N) {
# Simulating a new data point
z <- runif(1, -1, 1)
y <- rpois(1, mu(z, beta_true))
# Update via squared error gradient
mu_old <- mu(z, beta[[i - 1]])
beta[[i]] <- beta[[i - 1]] - rate(i) * mu_old * (mu_old - y) * c(1, z)
}
beta[[N]] # This is close to beta_true; the algorithm works!
```

```
## beta0 beta1
## 2.037990 2.987941
```

For the log-likelihood loss we instead find the gradient \[\nabla_{\beta} \big( \mu(z_i, \beta) - y_i \log(\mu(z_i, \beta)) \big) = \frac{\varphi'(\beta_0 + \beta_1 z_i)}{\mu(z_i, \beta)} (\mu(z_i, \beta) - y_i) \left( \begin{array}{c} 1 \\ z_i \end{array} \right),\] which leads to a slightly different but equally valid algorithm. In the special case with \(\varphi = \exp\), the derivative is \(\varphi'(\beta_0 + \beta_1 z_i) = \mu(z_i, \beta)\), \[\nabla_{\beta} \big( \mu(z_i, \beta) - y_i \log(\mu(z_i, \beta)) \big) = (\mu(z_i, \beta) - y_i) \left( \begin{array}{c} 1 \\ z_i \end{array} \right),\] and the log-likelihood gradient differs from the squared error gradient by lacking the factor \(\mu(z_i, \beta)\). With \(Z\) uniformly distributed on \((-1, 1\)), the distribution of \(\mu(Z, (2, 3))\) has range between \(e^{-1} \simeq 0.3679\) and \(e^5 \simeq 148.4\) and is right skewed, that is, it is concentrated toward the smaller values but with a long right tail. Its median is \(e^2 \simeq 7.389\), while its mean is \((e^5 - e) / 6 \simeq 24.67\).

The squared error gradient is typically longer than the log-likelihood gradient due to the factor \(\mu(z_i, \beta)\) — and sometimes by a large factor. In the implementation below with the gradient of the log-likelihood we therefore choose \(\gamma_0\) a factor 25 larger than the \(\gamma_0 = 0.0004\) that was used with the squared error gradient.

```
rate <- decay_scheduler(gamma0 = 0.01, K = 100)
beta[[1]] <- c(beta0 = 1, beta1 = 1)
for(i in 2:N) {
# Simulating a new data point
z <- runif(1, -1, 1)
y <- rpois(1, mu(z, beta_true))
# Update via log-likelihood gradient
mu_old <- mu(z, beta[[i - 1]])
beta[[i]] <- beta[[i - 1]] - rate(i) * (mu_old - y) * c(1, z)
}
beta[[N]] # This is close to beta_true; this algorithm also works!
```

```
## beta0 beta1
## 2.008452 2.987035
```

Figure 9.2 shows how the estimates of \(\beta_0\) and \(\beta_1\)
converge toward the true values for the two different choices of loss functions.
With \(\varphi = \exp\) and either loss function, the risk, \(H(\beta)\), is
strongly convex and attains its unique minimum in \(\beta^* = (2, 3)^T\).
Theorem 9.1 suggests convergence for appropriate learning
rates – except that the growth condition on the gradient is not fulfilled
with \(\varphi = \exp\). The growth condition *is* fulfilled if we replace
the exponential function by softplus, \(\varphi(w) = \log(1 + e^w)\), which for
small values of \(w\) behaves like the exponential function, but grows linearly
with \(w\) for \(w \to \infty\).

The gradient from the squared error loss resulted in slower convergence and a more jagged sample path than the gradient from the log-likelihood. This is explained by the random factor \(\mu(z_i, \beta)\) for the squared error gradient, which makes the step sizes somewhat irregular when data is from a Poisson distribution. It also makes choosing \(\gamma_0\) a delicate balance. A small increase of \(\gamma_0\) will make the algorithm unstable, while decreasing \(\gamma_0\) will make the convergence even slower, though the fluctuations will also be damped. Making the right choice – or even a suitable choice – of decay schedule depends heavily on the problem considered and the gradient used. It is a problem specific challenge to find a good schedule – and even just to choose the three parameters when we use the power law schedule.

The example illustrates that the loss function not only determines what we model, but also how well the learning algorithm works. Both loss functions are appropriate, but since the data is from the Poisson distribution, the log-likelihood loss leads to faster convergence. Exercise 9.2 explores the opposite situation where the data is from a Gaussian distribution.

### 9.1.3 Batch stochastic gradient algorithms

The online algorithm does not store data, and once a data point is used it is forgotten. The online algorithm is working in a context where data arrives as a stream of data points and the model is updated continually. In statistics, we more frequently encounter batch algorithms, where an entire batch of data points is stored and processed by the algorithm, and where each data point in the batch can be accessed as many times as we like. However, when the batch is sufficiently large, many standard batch algorithms are slow, and some ideas from online algorithms can beneficially be transferred to batch processing of data.

Within the population model framework in Section 9.1.1 the objective is to minimize the population risk, \(H(\theta)\), defined as the expected loss w.r.t. the probability distribution \(\mu\). For the online algorithms we imagine an endless stream of data points from \(\mu\), which can be used to ultimately minimize \(H(\theta)\). Batch algorithms replace the population quantity by an empirical surrogate — the average loss on the batch \[H_N(\theta) = \frac{1}{N} \sum_{i=1}^N L(x_i, \theta).\] Minimizing \(H_N\) as a surrogate of minimizing \(H\) is known as empirical risk minimization.

If data is i.i.d. the standard deviation of \(H_N(\theta)\) decays as \(N^{-1/2}\) with \(N\), while the run time of its computation increases as \(N\). Thus as we invest more computation time by using more data we get diminishing returns; doubling the run time will only lower the precision of \(H_N\) as an approximation of \(H\) by a factor \(1 / \sqrt{2} \simeq 0.7\). The same is, of course, true if we look at the gradient, \(\nabla H_N(\theta)\), or higher derivatives of \(H_N\).

It is not obvious that the computational costs of using the entire data set to compute the gradient, say, is worth the effort compared to using only a fraction of the data. Thus we ask if there is a better tradeoff between run time and precision by using a fraction of data points each time we compute the gradient? The stochastic gradient algorithm is one such algorithm that “cycles through the data” and uses only a random fraction of data points for each computation. In its basic version presented first, it uses only a single data point in each iteration, and it is really the same algorithm as presented in Section 9.1.2 except that the population risk is replaced by the empirical risk defined in terms of the batch data.

With \(\hat{\mu}_N = \frac{1}{N}\sum_{i=1}^N \delta_{x_i}\) denoting the empirical distribution of the batch data set, we see that \[H_N(\theta) = \int L(x, \theta) \mathrm{\mu}_N(dx),\] that is, the empirical risk is simply the expected loss with respect to the empirical distribution. Thus to minimize \(H_N\) we can use the online approach by sampling observations from \(\hat{\mu}_N\). This leads to the following basic version of the stochastic gradient algorithm applied to a data batch.

From an initial parameter value, \(\theta_0\), we iteratively compute the parameters as follows: given \(\theta_{n}\)

- sample an index \(i\) uniformly from \(\{1, \ldots, N\}\)
- compute \(\rho_n = \nabla_{\theta} L(x_i, \theta_{n})\)
- update the parameter \(\theta_{n+1} = \theta_{n} - \gamma_n \rho_n.\)

Note that sampling the index \(i\) is equivalent to sampling an observation from \(\hat{\mu}_N\), which in turn is the same as nonparametric bootstrapping. Just as in the online setting, the sequence of learning rates, \(\gamma_n\), is a tuning parameter of the algorithm.

We implement the basic stochastic gradient algorithm below,
allowing for a user defined decay schedule of the learning rate. However,
instead of implementing one long loop, we divide the iterations into
*epochs* with each epoch consisting of \(N\) iterations. In the implementation,
the maximal number of iterations is also given in terms of epochs, and
the decay schedule is applied on a per epoch basis.

We also introduce a small twist on the sampling from the empirical distribution; instead of sampling with replacement (bootstrapping) we sample without replacement. Sampling \(N\) indices from \(\{1, \ldots, N\}\) without replacement is the same as sampling a permutation of the indices.

```
SG <- function(
par,
grad, # Function of parameter and observation index
N, # Sample size
gamma, # Decay schedule or a fixed learning rate
maxiter = 100, # Max epoch iterations
sampler = sample, # How data is resampled. Default is a random permutation
cb = NULL,
...
) {
gamma <- if (is.function(gamma)) gamma(1:maxiter) else rep(gamma, maxiter)
for(k in 1:maxiter) {
if(!is.null(cb)) cb()
samp <- sampler(N)
for(j in 1:N) {
i <- samp[j]
par <- par - gamma[k] * grad(par, i, ...)
}
}
par
}
```

One epoch in the algorithm above is exactly one pass through the entire batch of
data points, but in a random order. The default value of `sampler = sample`

means that resampling is done without replacement. If we call `SG()`

with
`sampler = function(N) sample(N, replace = TRUE)`

we would get sampling
with replacement, in which case an epoch would be a pass through
\(N\) data points sampled independently from the batch.
Sampling with replacement will feed the stochastic gradient
algorithm with i.i.d. samples from the empirical distribution. Sampling without
replacement introduces some dependence. Curiously, sampling without
replacement has turned out to be empirically superior to sampling
with replacement, and recent theoretical
results, Gürbüzbalaban, Ozdaglar, and Parrilo (2019), support that it leads to a faster rate of convergence.

We may ask if the sampling actually matters, and whether we could just leave out that
part of the algorithm? In practice, data sets may come in a “bad order,”
for instance in an unfortunate ordering according to
one or more of its variables, and cycling through the data points in such an
ordering can easily lead the algorithm astray. *It is therefore important to
always randomize the order of the data points somehow*. A minimal amount of randomization
in common use is to just do one initial random permutation, corresponding
to moving `samp <- sampler(N)`

outside of the outer for-loop above. This may
be enough randomization for the algorithm to work in some cases, but the link
to the convergence result for the online algorithm is lost.

**Example 9.4 **As a continuation of Example 9.3 we consider the batch version
of Poisson regression. We will only use the log-likelihood gradient, and we
first simulate a small data set with \(N = 50\) data points.

```
N <- 50
z <- runif(N, -1, 1)
y <- rpois(N, mu(z, beta_true))
grad_pois <- function(par, i) (mu(z[i], par) - y[i]) * c(1, z[i])
```

Using the `grad_pois()`

function above, we run the stochastic gradient
algorithm for 1000 epochs with a decay schedule that
interpolates between \(\gamma_0 = 0.02\) and \(\gamma_1 = 0.001\).

```
pois_SG_tracer <- tracer("par", N = 0)
SG(
c(0, 0),
grad_pois,
N = N,
gamma = decay_scheduler(gamma0 = 0.02, gamma1 = 0.001, n1 = 1000),
maxiter = 1000,
cb = pois_SG_tracer$tracer
)
```

`## [1] 1.905394 3.162559`

The resulting parameter estimate should be compared to the maximum-likelihood estimate from an ordinary Poisson regression.

`beta_hat <- coefficients(glm(y ~ z, family = poisson))`

`## beta_hat: 1.898305 3.15986`

The batch version of stochastic gradient descent converges toward the minimizer, \((\hat{\beta}_0, \hat{\beta}_1)\), of the empirical risk. This is contrary to the online version that converges toward the minimizer of the theoretical risk, which in this case is \((\beta_0, \beta_1) = (2, 3)\). With a larger batch size, \((\hat{\beta}_0, \hat{\beta}_1)\) will come closer to \((2, 3)\). Figure 9.3 shows clearly how the algorithms converge toward a limit that depends on the batch size, and for \(N = 500\), this limit is much closer to the theoretical minimizer.

```
N <- 500
z <- runif(N, -1, 1)
y <- rpois(N, mu(z, beta_true))
pois_SG_tracer_2 <- tracer("par", N = 0)
SG(
par = c(0, 0),
grad = grad_pois,
N = N,
gamma = decay_scheduler(gamma0 = 0.02, gamma1 = 0.001, n1 = 100),
cb = pois_SG_tracer_2$tracer
)
```

`## [1] 1.994688 3.022137`

`beta_hat_2 <- coefficients(glm(y ~ z, family = poisson))`

`## beta_hat_2: 1.98893 3.027715`

Example 9.4 and Figure 9.3, in particular, illustrate that if the data set is relatively small, the algorithm quickly attains a precision smaller than the statistical uncertainty, and further optimization is therefore futile. However, for larger data sets, optimization to a greater precision can be beneficial.

### 9.1.4 Predicting news article sharing on social media

In this section we will illustrate the use of the basic stochastic gradient algorithm for learning a model that predicts how many times a news article is shared on social media. The data will be subjected to transformations and normalizations to make the use of a linear model and the squared error loss reasonable.

The basic stochastic gradient algorithm for the linear model was introduced to the early machine learning community in 1960 via ADALINE (Adaptive Linear Neuron) by Bernard Widrow and Ted Hoff. ADALINE was implemented as a physical device capable of learning patterns via stochastic gradient updates. The math is the same today, but the implementation has fortunately become somewhat easier.

With a linear model and the squared error loss, \(L((y, x), \beta) = \frac{1}{2} (y - \beta^T x)^2\), the gradient becomes \[\nabla_{\beta} L((y, x), \beta) = - x (y - \beta^T x) = x (\beta^T x - y),\] which results in updates of the form \[\beta_{n+1} = \beta_n - \gamma_n x_i (\beta_n^T x_i - y_i).\] That is, the parameter moves in the direction of \(x_i\) if \(\beta_n^T x_i < y_i\) and in the direction of \(-x_i\) if \(\beta_n^T x_i > y_i\). The amount by which it moves is controlled partly by the learning rate, \(\gamma_n\), and partly by the size of the residual, \(y_i - \beta_n^T x_i\). A larger residual gives a larger move.

The following function factory for linear models takes a model matrix and a response vector for the complete data batch as arguments and implements the squared error loss function on the entire batch as well as the gradient in a single observation. The returned list also contains a parameter vector of the correct dimension, which can be used for initialization of optimization algorithms.

```
ls_model <- function(X, y) {
N <- length(y)
X <- unname(X) # Strips X of names
list(
# Initial parameter value
par0 = rep(0, ncol(X)),
# Objective function
H = function(beta)
drop(crossprod(y - X %*% beta)) / (2 * N),
# Gradient in a single observation
grad = function(beta, i) {
xi <- X[i, ]
xi * drop(xi %*% beta - y[i])
}
)
}
```

The data, originally collected by Fernandes, Vinagre, and Cortez (2015), was obtained from the
UCI Machine Learning Repository
and contains 39,644 observations on 61 variables. One variable is the integer
valued `shares`

, which will be the target variable of our predictions.

`News <- readr::read_csv("data/OnlineNewsPopularity.csv")`

Two of the variables, `timedelta`

and `url`

, are not relevant predictors, and
`is_weekend`

is redundant given the other weekday variables, so we exclude
those three variables. Some of the predictors are also highly correlated,
and we exclude four additional predictors before the model matrix is constructed.

```
News <- dplyr::select(
News,
- url,
- timedelta,
- is_weekend,
- n_non_stop_words,
- n_non_stop_unique_tokens,
- self_reference_max_shares,
- kw_min_max
)
# The model matrix without an explicit intercept is constructed from all
# variables remaining in the data set but the target variable 'shares'
X <- model.matrix(shares ~ . - 1, data = News)
```

This data set is by current standards not a large data set – the dense model
matrix takes only about 20 MB of memory – and the
linear model (with the target variable `shares`

log-transformed)
can easily be fitted by simply solving the least squares problem.
It takes about 0.2 seconds on a standard laptop to compute the solution.
The residual plot in Figure 9.4 shows that the
model is actually not a poor fit to data, though there is a considerable
unexplained residual variance.

Optimization of the squared error loss using this data set will be used to illustrate a number of stochastic gradient algorithms even though we can compute the optimizer easily by other means. The real practical benefit of stochastic gradient algorithms comes when applied to large scale problems that are difficult to treat as textbook examples. But using a toy problem makes it easier to understand in detail how the different algorithms behave.

We will standardize all the columns of \(X\) to have the same norm. Specifically
to have non-central second moment 1.
This does not change the optimization problem but corresponds to a
reparametrization where all parameters are rescaled. The rescaling brings
the parameters on a comparable scale, which is typically a good idea for
optimization algorithms based on gradients only, see also Exercise
9.3. After rescaling we
initialize the linear model with a call to `ls_model()`

and refit the
model using the new parametrization.

```
X_raw <- X
# Standardization and log-transforming the target variable
X <- scale(X, center = FALSE)
y <- log(News$shares)
# The '%<-%' destructure assignment operator is from the zeallot package
c(par0, H, ls_grad) %<-% ls_model(X, y)
# Fitting the model using standard linear model computations
lm_News <- lm.fit(X, y)
par_hat <- lm_News$coefficients # Will be used below for comparisons
```

We first run the stochastic gradient algorithm with a fixed learning rate of \(\gamma = 10^{-5}\) for 50 epochs with a tracer that computes and stores the value of the objective function at each epoch.

```
SG_tracer <- tracer("value", expr = quote(value <- H(par)))
SG(
par = par0,
grad = ls_grad,
N = nrow(X),
gamma = 1e-5,
maxiter = 50,
cb = SG_tracer$tracer
)
```

Using the trace from the last epochs, we can compare the objective function
values to the minimum found using `lm.fit()`

above. The minimum was
not reached completely after the 50 epochs that took some time to compute.

```
## value .time
## 45 0.4268483 17.26148
## 46 0.4262983 17.82874
## 47 0.4259705 18.21194
## 48 0.4251088 18.61303
## 49 0.4245424 18.99180
## 50 0.4240064 19.37994
```

`H(par_hat)`

`## [1] 0.3781595`

We will use profiling to investigate what most of the time was spent on in the stochastic gradient algorithm.

The profiling result shows, unsurprisingly, that the
computation of the gradient and the update of the parameter vector takes up most
of the run time. But if we look closer at the implementation of the gradient, we see
that the innocuously looking subsetting `X[i, ]`

to the \(i\)-th row is actually
responsible for about half of the run time. We also see a substantial
allocation and deallocation of memory associated with this line. It is a
bottleneck of the R implementation that slicing out a row from a bigger
matrix cannot be done without creating a copy of that row, and this is why
this particular line takes up so much time.

To further investigate the convergence of the basic stochastic gradient algorithm we run it with a larger learning rate of \(\gamma = 5 \times 10^{-5}\) and then with a power law decay schedule, which interpolates from an initial learning rate of \(10^{-3}\) to a learning rate of \(10^{-5}\) after 50 epochs.

```
SG_tracer$clear()
SG(
par = par0,
grad = ls_grad,
N = nrow(X),
gamma = 5e-5,
maxiter= 50,
cb = SG_tracer$tracer
)
SG_trace_high <- summary(SG_tracer)
SG_tracer$clear()
SG(
par = par0,
grad = ls_grad,
N = nrow(X),
gamma = decay_scheduler(gamma0 = 7e-5, gamma1 = 4e-5, a = 0.5, n1 = 50),
maxiter= 50,
cb = SG_tracer$tracer
)
SG_trace_decay <- summary(SG_tracer)
```

We will compare the convergence of the three stochastic gradient algorithms with the convergence of gradient descent with backtracking. For gradient descent we choose \(\gamma = 8 \times 10^{-2}\), which results in only a few initial backtracking steps and then all subsequent steps will use the step length \(\gamma = 8 \times 10^{-2}\). Choosing a larger \(\gamma\) for this particular optimization resulted in backtracking until a step length around \(8 \times 10^{-2}\) was reached, thus this choice of \(\gamma\) will use a minimal amount of time on the backtracking step of the algorithm.

```
grad_H <- function(beta) crossprod(X, X %*% beta - y) / nrow(X)
GD_tracer <- tracer("value", N = 10)
GD(
par = par0,
H = H,
gr = grad_H,
gamma = 8e-2,
maxiter = 800,
cb = GD_tracer$tracer
)
GD_trace <- summary(GD_tracer)
```

Figure 9.5 shows how the four algorithms converge.
The gradient descent algorithm converges faster than the stochastic gradient
algorithm with the low learning rate \(\gamma = 10^{-5}\), but with the
high learning rate \(\gamma = 5 \times 10^{-5}\) and the power law decay schedule
the stochastic gradient algorithm converges about as fast as gradient descent.
For this particular run, the power law decay schedule shows marginally
faster convergence than the fixed high learning rate, and this was ensured after
some tuning of the decay schedule parameters. Despite

theoretical guarantees of convergence with the power law decay schedule,
the practical choice of a suitable learning rate or decay schedule is really
an empirical art. With very large data, the tuning
of learning rate parameters can be done on a small subset
of data before the algorithms are run using the full data set.

For comparison purposes it is typically better to monitor convergence of optimization algorithms as a function of real time than iterations, but when comparing algorithms in terms of real time we are admittedly comparing their specific implementations. The R implementation of the stochastic gradient algorithm has some shortcomings that are not shared by the gradient descent algorithm. One epoch of the stochastic gradient algorithm should be about as computationally demanding as one iteration of gradient descent as both will compute the gradient in each data point exactly once and add them up. The vectorized batch gradient computation is fairly efficient in R, but the iterative looping over data in the stochastic gradient algorithm is not, and this inefficiency is compounded by the inefficiency of matrix slicing that results in data copying as the profiling revealed. Thus the comparisons in Figure 9.5 are arguably not entirely fair to the stochastic gradient algorithm – only the specific R implementation.

In the subsequent section we will see alternatives to the basic stochastic gradient algorithm, which will diminish the shortcomings of a pure R implementation somewhat. Another way to circumvent some of the shortcomings of the R implementation is to rewrite the algorithm using Rcpp. We will pursue Rcpp implementations in Section 9.3, but we will first consider some beneficial modifications of the basic algorithm.

## 9.2 Beyond basic stochastic gradient algorithms

The gradient \(\nabla_{\theta} L(x_i, \theta)\) for a single random data point is quickly computed, and though unbiased as an estimate of \(\nabla_{\theta} H_N(\theta)\) it has a large variance. This affects the basic stochastic gradient algorithm negatively as the directions of each update can oscillate quite wildly from iteration to iteration. This section covers some techniques that yield a better tradeoff between run time and variability.

The most obvious technique is to use more than one observation per computation
of the gradient, which gives us *mini-batch* stochastic gradient algorithms.
A second technique is to incorporate some memory about previous directions into
the movements of the algorithms — in the same spirit as how the conjugate gradient
algorithm uses the previous gradient to modify the descent direction.

The literature on deep learning has recently exploded with variations on the stochastic gradient algorithm. Performance is mostly studied empirically and applied in practice to the highly non-convex optimization problem of learning a neural network. A comprehensive coverage of all the different ideas will not be attempted, and only three of the most solidified variations will be treated. The use of mini-batches is ubiquitous, and momentum will be introduced to illustrate a variation with memory. Finally, the Adam algorithm uses memory in combination with adaptive learning rates to achieve both speed and robustness.

### 9.2.1 Mini-batches

The three steps of the mini-batch stochastic gradient algorithm with mini-batch size \(m\) are: given \(\theta_{n}\)

- sample \(m\) indices, \(I_n = \{i_1, \ldots, i_m\}\), from \(\{1, \ldots, N\}\)
- compute \(\rho_n = \frac{1}{m} \sum_{i\in I_n} \nabla_{\theta} L(x_i, \theta_{n})\)
- update the parameter \(\theta_{n+1} = \theta_{n} - \gamma_n \rho_n.\)

Of course, the mini-batch algorithm with \(m = 1\) is the basic stochastic gradient
algorithm. As for the basic algorithm, we implement the variation
where we sample a *partition*
\[I_1 \cup \ldots \cup I_M \subseteq \{1, \ldots, N\}\]
for \(M = \lfloor N / m \rfloor\) and in one epoch loop through the mini-batches
\(I_1, \ldots, I_M\).

In the following sections we will develop a couple of modifications to the basic stochastic gradient algorithm, and we will therefore implement a more generic version of the algorithm. What is common to all the modifications is that they differ in the details of the epoch loop, thus we take out that loop as a separate function.

```
SG <- function(
par,
N, # Sample size
gamma, # Decay schedule or a fixed learning rate
epoch = batch, # Epoch update function
..., # Other arguments passed to epoch updates
maxiter = 100, # Max epoch iterations
sampler = sample, # How data is resampled. Default is a random permutation
cb = NULL
) {
gamma <- if (is.function(gamma)) gamma(1:maxiter) else rep(gamma, maxiter)
for(k in 1:maxiter) {
if(!is.null(cb)) cb()
samp <- sampler(N)
par <- epoch(par, samp, gamma[k], ...)
}
par
}
```

The implementation uses `batch()`

as the default update function,
and we implement this function below. It uses the
random permutation to generate the \(M\) mini-batches, and it contains a
loop through the mini-batches containing a call of `grad()`

for the computation of the average gradient for each mini-batch in the loop.
Note that the `grad`

argument to `batch()`

will be captured by and passed on from
a call of `SG()`

via the `...`

argument.

```
batch <- function(
par,
samp,
gamma,
grad, # Function of parameter and observation index
m = 50, # Mini-batch size
...
) {
M <- floor(length(samp) / m)
for(j in 0:(M - 1)) {
i <- samp[(j * m + 1):(j * m + m)]
par <- par - gamma * grad(par, i, ...)
}
par
}
```

The `grad()`

function implemented in `ls_model()`

in Section 9.1.4 assumes
that the index argument \(i\) is a single number and not a vector. It computes
for a vector \(i\) a matrix containing the
different gradients as columns. We therefore reimplement
`ls_model()`

so that `grad()`

computes the average of the gradients
as it is supposed to for a mini-batch.

```
ls_model <- function(X, y) {
N <- length(y)
X <- unname(X)
list(
par0 = rep(0, ncol(X)),
H = function(beta)
drop(crossprod(y - X %*% beta)) / (2 * N),
# Gradient that works for a mini-batch indexed by i
grad = function(beta, i) {
xi <- X[i, , drop = FALSE]
drop(crossprod(xi, xi %*% beta - y[i])) / length(i)
}
)
}
```

We initialize the linear model using the new implementation of `ls_model()`

.

`c(par0, H, ls_grad) %<-% ls_model(X, y)`

With increased flexibility of the algorithms comes more tuning parameters, and making a good choice of all of them becomes increasingly difficult. When introducing mini-batches we need to choose the mini-batch size in addition to the learning rate, and a good choice of learning rate or decay schedule will depend on the size of the mini-batch. To simplify matters, a mini-batch size of 1000 and a fixed learning rate are used in the subsequent applications. The learning rate will generally be chosen as large as possible without making the algorithms diverge, and with a mini-batch size of 1000 it is possible to run the algorithm with a learning rate of \(\gamma = 0.05\).

```
SG_tracer$clear()
SG(
par = par0,
N = nrow(X),
gamma = 5e-2,
grad = ls_grad,
m = 1000,
maxiter = 200,
cb = SG_tracer$tracer
)
```

Figure 9.6 shows that this implementation of the mini-batch stochastic gradient algorithm converges faster than the basic stochastic gradient algorithm with a power law decay schedule as well as the gradient descent algorithm. Eventually, it begins to fluctuate due to the fixed learning rate, but it quickly gets close to the minimum. We should not jump to conclusions from this assessment. It only really shows the improvement to the specific R implementation of using a mini-batch algorithm, and this implementation clearly benefits from the more vectorized computations with mini-batches of size 1000. We will return to this discussion in Section 9.3.1.

### 9.2.2 Momentum

Mini-batches stabilize the gradients, and so does momentum. Both techniques can be used in combination, and the momentum update of a mini-batch stochastic gradient algorithm is as follows: Given \(\theta_{n}\) and a batch \(I_n \subseteq \{1, \ldots, N\}\) with \(|I_n| = m\)

- compute \(g_n = \frac{1}{m} \sum_{i\in I_n} \nabla_{\theta} L(x_i, \theta_{n})\)
- compute \(\rho_n = \beta \rho_{n-1} + (1-\beta) g_n\)
- update the parameter \(\theta_{n+1} = \theta_{n} - \gamma_n \rho_n.\)

The memory of the algorithm is in the second step, where the direction, \(\rho_{n}\), is updated using a convex combination of the previous direction, \(\rho_{n-1}\), and the mini-batch gradient, \(g_n\). Usually, the initial direction is chosen as \(\rho_0 = 0\). The parameter \(\beta \in [0,1)\) is a tuning parameter determining how long the memory is. A value like \(\beta = 0.9\) or \(\beta = 0.95\) is often recommended – otherwise the memory in the algorithm will be rather short, and the effect of using momentum will be small. A choice of \(\beta = 0\) corresponds to the mini-batch algorithm without memory.

Contrary to the batch epoch function, the momentum epoch function
needs to store the previous direction between updates. It is not immediately
clear how to achieve this between two epochs using the generic `SG()`

implementation, but by implementing momentum epochs using a function factory,
we can easily use an enclosing environment of the epoch function
for storage.

```
momentum <- function() {
rho <- 0
function(
par,
samp,
gamma,
grad,
m = 50, # Mini-batch size
beta = 0.95, # Momentum memory
...
) {
M <- floor(length(samp) / m)
for(j in 0:(M - 1)) {
i <- samp[(j * m + 1):(j * m + m)]
# Using '<<-' assigns the value to rho in the enclosing environment
rho <<- beta * rho + (1 - beta) * grad(par, i, ...)
par <- par - gamma * rho
}
par
}
}
```

When calling `SG()`

below with `epoch = momentum()`

, the evaluation
of the function factory `momentum()`

returns the momentum epoch function
with is own local environment used to store \(\rho\). With momentum, we can
increase the learning rate to \(\gamma = 7 \times 10^{-2}\).

```
SG_tracer$clear()
SG(
par = par0,
N = nrow(X),
gamma = 7e-2,
epoch = momentum(),
grad = ls_grad,
m = 1000,
maxiter = 150,
cb = SG_tracer$tracer
)
```

Figure 9.7 shows that the momentum algorithm converges a little faster than the mini-batch algorithm without momentum. This can be explained by the slightly larger learning rate made possible by the use of momentum. With enough memory, momentum dampens rapid fluctuations, which allows for a larger choice of learning rate and a speedier convergence. However, too much memory results in low frequency oscillations and slow convergence. The excellent article Why Momentum Really Works contains many more details about momentum algorithms and beautiful illustrations.

### 9.2.3 Adaptive learning rates

One difficulty with optimization algorithms based only on gradients is that gradients are not invariant to reparametrizations. In fact, using gradients implicitly assumes that all parameters are on comparable scales. For our news article example, we standardized the model matrix to achieve this, but for many other optimization problems it is not so easy to choose a parametrization with all parameters on comparable scales. And even when we can reparametrize so that all parameters are on a common scale, this common scale can change from problem to problem making it impossible to recommend a good default choice of learning rate. The practical implication is that a considerable amount of tuning is necessary, when the algorithms are applied.

Algorithms that implement adaptive learning rates are alternatives to extensive tuning. They include schemes for adjusting the learning rate to the specific optimization problem. Adapting the learning rate is equivalent to scaling the gradient adaptively, and to achieve a form of automatic standardization of parameter scales, we will consider algorithms that adaptively scale each coordinate of the gradient separately.

To gain some intuition on how to sensibly adapt the scales of the gradient, we will analyze the typical scale of the mini-batch gradient for the linear model. Introduce first the normalized squared column norms \[\zeta_j = \frac{1}{N} \sum_{i=1}^N x_{ij}^2 = \frac{1}{N} \| x_{\cdot j}\|^2_2,\] and note that with a standardized \(X\), all the \(\zeta_j\)-s are equal. With \(\hat{\beta}\) the least squares estimate we also have that \[\frac{1}{N} \sum_{i=1}^N x_{ij} (y_i - \hat{\beta}{}^Tx_i) = 0\] for \(j = 1, \ldots, p\). Thus if we sample a random index, \(\iota\), from \(\{1, \ldots, N\}\) it holds that \(E(x_{\iota j} (y_{\iota} - \hat{\beta}{}^Tx_{\iota})) = 0\), where the expectation is w.r.t. \(\iota\). With \[\hat{\sigma}^2 = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{\beta}{}^Tx_i)^2\] denoting the residual variance, we also have that

\[ V(x_{\iota j} (y_{\iota} - \hat{\beta}{}^Tx_{\iota})) = E(x_{\iota j}^2 (y_{\iota} - \hat{\beta}{}^Tx_{\iota})^2) \simeq \zeta_j \hat{\sigma}^2 \]

For the above approximation to hold, the squared residual, \((y_{\iota} - \hat{\beta}{}^Tx_{\iota})^2\), must be roughly independent of \(x_{\iota}\), which is not guaranteed by the least squares fit alone, but holds approximately if data is from a population with homogeneous residual variance.

If \(I \subseteq \{1,\ldots,N\}\) is a random subset of size \(m\) sampled *with*
replacement, the averaged gradient
\[g = - \frac{1}{m} \sum_{i \in I} x_{i} (y_i - \hat{\beta}{}^T x_i)\]
is an average of \(m\) i.i.d. random variables with mean 0, thus
\[E(g_j^2) = V(g_j) \simeq \zeta_j \frac{\hat{\sigma}^2}{m}.\]
If \(\odot\) denotes the coordinate wise product of vectors (aka the Hadamard product),
this can also be written as
\[E(g \odot g) \simeq \zeta \frac{\hat{\sigma}^2}{m}.\]

The computations suggest that by estimating \(v = E(g \odot g)\) for a mini-batch gradient evaluated in \(\beta = \hat{\beta},\) we are in fact estimating \(\zeta\) up to a scale factor. We extrapolate this insight to the general case and standardize the \(j\)-th coordinate of the descent direction by an estimate of \(1 / \sqrt{v_j}\) to bring the coordinates on a (more) common scale. We will implement adaptive estimation of \(v\) using a similar update scheme as for momentum, where the estimate in iteration \(n\) is updated as a convex combination of the current value of \(g_n \odot g_n\) and the previous estimate of \(v\).

Given \(\theta_{n}\) and a batch \(I_n \subseteq \{1, \ldots, N\}\) with \(|I_n| = m\) the update consists of the following steps

- compute \(g_n = \frac{1}{m} \sum_{i\in I_n} \nabla_{\theta} L(x_i, \theta_{n})\)
- compute \(\rho_n = \beta_1 \rho_{n-1} + (1-\beta_1) g_n\)
- compute \(v_n = \beta_2 v_{n-1} + (1-\beta_2) g_n \odot g_n\)
- update the parameter \(\theta_{n+1} = \theta_{n} - \gamma_n \rho_n / (\sqrt{v_n} + 10^{-8}).\)

The vectors \(\rho_0\) and \(v_0\) are usually initialized to be \(0\). The tuning parameters \(\beta_1, \beta_2 \in [0, 1)\) control the memory of the first and second moments, respectively. The \(\sqrt{v_n}\) and the division in the last step are coordinate wise. The constant \(10^{-8}\) could, of course, be chosen differently, but is just a safeguard against division-by-zero.

The algorithm above is known as *Adam* (adaptive moment estimation),
and was introduced and analyzed by Kingma and Ba (2014). They include so-called
bias-correction steps that upscale \(\rho_n\) and \(v_n\) by the factors \(1 / (1 - \beta_1^n)\)
and \(1 / (1 - \beta_2^n)\), respectively. These steps are not difficult
to implement but are left out in the implementation below for simplicity.
It is also possible to replace the \(\sqrt{v_n}\) by other powers \(v_n^q\).
The choice of \(q = 1\) instead of \(q = 1/2\) makes the algorithm (more)
invariant to scale changes. Again, for simplicity we will only implement
the algorithm with \(q = 1/2\).

The `adam()`

function below is a function factory just as `momentum()`

, which
returns a function doing the Adam epoch update loop with an enclosing
environment used for storage of `rho`

and `v`

.

```
adam <- function() {
rho <- v <- 0
function(
par,
samp,
gamma,
grad,
m = 50, # Mini-batch size
beta1 = 0.9, # Momentum memory
beta2 = 0.9, # Second moment memory
...
) {
M <- floor(length(samp) / m)
for(j in 0:(M - 1)) {
i <- samp[(j * m + 1):(j * m + m)]
gr <- grad(par, i, ...)
rho <<- beta1 * rho + (1 - beta1) * gr
v <<- beta2 * v + (1 - beta2) * gr^2
par <- par - gamma * (rho / (sqrt(v) + 1e-8))
}
par
}
}
```

We run the stochastic gradient algorithm with Adam epochs and mini-batches of size 1000 with a fixed learning rate of \(0.01\) and a power law decay schedule interpolating between a learning rate of \(0.5\) and \(0.002\). The theoretical results by Kingma and Ba (2014) support a decay schedule proportional to \(1/\sqrt{n}\), thus we take \(a = 0.5\) below.

```
SG_tracer$clear()
SG(
par = par0,
N = nrow(X),
gamma = 1e-2,
epoch = adam(),
grad = ls_grad,
m = 1000,
maxiter = 150,
cb = SG_tracer$tracer
)
SG_trace_adam <- summary(SG_tracer)
SG_tracer$clear()
SG(
par = par0,
N = nrow(X),
gamma = decay_scheduler(gamma0 = 0.5, gamma1 = 2e-3, a = 0.5, n1 = 150),
epoch = adam(),
grad = ls_grad,
m = 1000,
maxiter = 150,
cb = SG_tracer$tracer
)
SG_trace_adam_decay <- summary(SG_tracer)
```

Figure 9.8 shows that both runs of the Adam implementation converge faster initially than any of the other algorithms. Eventually they both level off when the error is around \(10^{-3}\) and from this point on they fluctuate randomly. What is not shown is that Adam is also somewhat more robust to changes of the learning rate and rescaling of the parameters. Though Adam has more tuning parameters, it is easier to find good values of those parameters that will lead to fast convergence.

## 9.3 Stochastic gradient algorithms with Rcpp

As pointed out toward the end of Section 9.1.4, the implementations of stochastic gradient algorithms in R suffer from some shortcomings. In this section we will explore how either parts of the algorithms or entire algorithms can be moved to C++ via Rcpp.

The modularity of the `SG()`

implementation makes it easy to replace the
implementation of either the gradient computation or the entire epoch
loop by a C++ implementation, while retaining the overall control of the algorithm and the
resampling in R. This is explored first and consists mostly of translating
the numerical linear algebra of the gradient computations into C++ code.
We can then easily test, compare and benchmark the implementations using the R
implementation as a reference.

In the second part of this section the entire mini-batch stochastic gradient
algorithm is translated into C++. This has a couple of notable consequences. First,
we need access to a sampler in C++ that can do the randomization. While
there are various C++ interfaces to an equivalent of `sample()`

, some
considerations need to go into an appropriate choice. Second, we have to
give up on tracing as otherwise implemented. Though it is possible to implement
callback of an R function from a C++ function, a tracer will not have
the same access to the calling environment as in the R implementation.
Thus for performance assessment we will rely on benchmarking
of the entire algorithm.

For the C++ implementations we need to give up on some of the abstractions
that R provides, though we will benefit from Rcpp data types like
`NumericVector`

and `NumericMatrix`

. In a final implementation
we will use RcppArmadillo
to regain an abstract approach to numerical linear
algebra via the C++ library Armadillo.

### 9.3.1 Gradients and epochs in Rcpp

We first implement the computation of the mini-batch gradient in Rcpp for
the linear model, thus replacing the R implementation of the gradient in
`ls_model()`

. The implemented function takes the model matrix \(X\) and the
target variable \(y\) as arguments in addition to
the parameter and the subset of indices representing the mini-batch.
The matrix-vector multiplications for computing predicted values,
residuals and ultimately the gradient are replaced by two double for-loops.

```
// [[Rcpp::export]]
NumericVector lin_grad(
NumericVector beta,
IntegerVector ii,
NumericMatrix X,
NumericVector y
) {int m = ii.length(), p = beta.length();
NumericVector grad(p), yhat(m);// Shift indices one down due to zero-indexing in C++
1;
IntegerVector iii = clone(ii) -
for(int i = 0; i < m; ++i) {
for(int j = 0; j < p; ++j) {
yhat[i] += X(iii[i], j) * beta[j];
}
}for(int i = 0; i < m; ++i) {
for(int j = 0; j < p; ++j) {
grad[j] += X(iii[i], j) * (yhat[i]- y[iii[i]]);
}
}return grad / m;
}
```

Note how `clone(ii)`

is used to create a copy of the index vector `ii`

before
it is shifted by one. When R objects like vectors and matrices are passed
as arguments to a C++ function, only a pointer to the data in the object
is effectively passed. A modification of the resulting Rcpp object within
the body of the C++ function will be reflected in R – most likely
as an unwanted side effect. To prevent this we can create a
copy using `clone()`

of all data in the object – a so-called deep copy.

Since only a pointer to a matrix is passed when calling
a C++ function, no notable cost is associated with passing the entire
model matrix as argument. On the contrary, if we were to pass a subset of the model
matrix to the C++ function from R, data would inevitably be copied.
And since slicing of the matrix in C++ does not create a copy of data,
the implementation of `lin_grad()`

avoids the data copying problem that
matrix slicing in R incurred on the gradient computations.

We could test `lin_grad()`

by comparing it to `ls_grad()`

for different
choices of parameter vectors and indices. A simple way to achieve this is
to run the mini-batch stochastic gradient algorithm with either implementation
of the gradient for one epoch and compare the results. Note
that we make sure the algorithms use the same mini-batches by setting the seed.

```
set.seed(10)
par1 <- SG(
par = par0,
grad = ls_grad,
N = nrow(X),
gamma = 0.01,
maxiter = 1
)
set.seed(10)
par2 <- SG(
par = par0,
grad = lin_grad,
N = nrow(X),
gamma = 0.01,
maxiter = 1,
X = X,
y = y
)
range(par1 - par2)
```

`## [1] 0 0`

The differences are all 0, thus the test indicates that `lin_grad()`

implements
the same computation as `ls_grad()`

.

We can also move the entire epoch loop to C++, thus replacing the `batch()`

R function by Rcpp function `lin_batch()`

below. This is achieved simply by
adding one outer loop to `lin_grad()`

above, which loops over mini-batches and
updates the parameter vector.

```
// [[Rcpp::export]]
NumericVector lin_batch(
NumericVector par,
IntegerVector ii, double gamma,
NumericMatrix X,
NumericVector y, int m = 50
) {int p = par.length(), N = ii.length();
int M = floor(N / m);
NumericVector grad(p), yhat(N), beta = clone(par);1;
IntegerVector iii = clone(ii) -
for(int j = 0; j < M; ++j) {
for(int i = j * m; i < (j + 1) * m; ++i) {
for(int k = 0; k < p; ++k) {
yhat[i] += X(iii[i], k) * beta[k];
}
}for(int k = 0; k < p; ++k) {
0;
grad[k] = for(int i = j * m; i < (j + 1) * m; ++i) {
grad[k] += X(iii[i], k) * (yhat[i] - y[iii[i]]);
}
}
beta = beta - gamma * (grad / m);
}return beta;
}
```

We can test the implementation of `lin_batch()`

just as `lin_grad()`

by
running the stochastic gradient algorithm for one epoch. We note that
the `lin_batch()`

function should be passed to `SG()`

as the `epoch`

argument,
and that no `grad`

argument is required. The gradient computations are hard
coded into the implementation of this epoch function.

```
set.seed(10)
par1 <- SG(
par = par0,
grad = ls_grad,
N = nrow(X),
gamma = 0.01,
maxiter = 1
)
set.seed(10)
par2 <- SG(
par = par0,
epoch = lin_batch,
N = nrow(X),
gamma = 0.01,
maxiter = 1,
X = X,
y = y
)
range(par1 - par2)
```

`## [1] 0 0`

We see that these differences are also zero, and the test indicates that
`lin_batch()`

implements the same computation as `batch()`

with either
of the two implementations of the gradient.

To investigate the effect of the C++ implementation, the mini-batch stochastic gradient algorithm is run with the Rcpp implementations of either the gradient or the entire epoch using a mini-batch size of 1000 and a constant learning rate \(\gamma = 0.05\), which can be compared to the pure R implementation. In addition, we run the algorithm with mini-batch size one and \(\gamma = 5 \times 10^{-5}\) using the Rcpp implementation of the epoch.

Figure 9.9 shows, unsurprisingly, that implementing the
gradient as a C++ function makes the convergence faster compared to the pure
R implementation, but turning the entire epoch into a C++ function results in
little if any further improvement. However, the figure also shows that
the basic stochastic gradient algorithm (mini-batch size equal to one) converges
at a rate comparable to that of the mini-batch algorithm with mini-batch
size 1000 when the entire epoch is in C++. Thus any benefit of the mini-batch
algorithm with mini-batch size 1000 seems to be largely due to the
inefficiency of the basic algorithm in pure R. Experiments using the
`lin_batch()`

epoch implementation (not shown) suggest that the mini-batch
algorithm does have a real edge over the basic algorithm, but with much smaller
mini-batch sizes, e.g. \(m = 20\), cf Exercise 9.4.

### 9.3.2 Full Rcpp implementations

In this section we convert the entire stochastic gradient algorithm for the linear model into a C++ function. Two considerations come up when doing so:

- How do we sample with or without replacement in C++?
- Is it possible in C++ to implement the gradient computations using numerical linear algebra instead of two double for-loops?

We deal with the first question first, that is, how to sample from a finite
set of size \(d\) in C++ such as `sample()`

does in R. Though it may seem
like a rather trivial problem, it is not entirely trivial to implement
sampling from a general distribution on a finite set of size \(d\) that
is both correct and efficient as a function of \(d\). A non-exhaustive list of
ways to sample from discrete sets using Rcpp is:

- The R function
`sample()`

can be retrieved via`Function sample("sample");`

in Rcpp, which makes the function callable from C++ code. There is, however, an overhead associated with calling R functions from C++, which we prefer to avoid. - There is an Rcpp sugar implementation of
`sample()`

. It generates different samples than the R function`sample()`

. - There is an RcppArmadilloExtension implementation of
`sample()`

. See also the blog post Sampling Integers. It requires the R package RcppArmadillo, and it also generates different samples than the R function`sample()`

. - There is the function
`dqsample.int()`

from the dqrng package dealt with in Section 4.1.2. It requires the R package dqrng and runs on an independent stream of pseudo random numbers then R’s RNG.

The Rcpp sugar and the Rcpp Armadillo versions of `sample()`

are touched on
in the Rcpp introduction vignette, but they are not documented in detail. Both use R’s stream of pseudo random numbers, but they
nevertheless generate sequences that differ from each other and from `sample()`

.
To reproduce results in R, we would have to write an interface to those C++
functions if we were to use them. Since the dqrng package provides an interface
to `dqsample.int()`

from R as well as from C++, we will use that sampler
in our implementations.

If we only want to generate random permutations, there are a couple of additional
alternatives. There is `randperm()`

from the Armadillo library, and there is also
`std::random_shuffle()`

from the C++ Standard Library.
The implementation of `std::random_shuffle()`

is, however, not specified
by the standard and can be compiler dependent. Both of these alternatives
will also run on an independent stream of pseudo random numbers then

R’s RNG, and will not be considered any further.

The full Rcpp implementation of the stochastic gradient algorithm is now
a simple extension of `lin_batch()`

with one additional outer loop over
the epochs and a single line calling `dqsample_int()`

to sample the
random permutation.

```
// [[Rcpp::depends(dqrng)]]
// [[Rcpp::export]]
NumericVector SG_Rcpp(
NumericVector par, int N,
NumericVector gamma,
NumericMatrix X,
NumericVector y,int m = 50,
int maxiter = 100
) {int p = par.length(), M = floor(N / m);
NumericVector grad(p), yhat(N), beta = clone(par);
IntegerVector ii;
for(int l = 0; l < maxiter; ++l) {
// Note that dqsample_int samples from {0, 1, ..., N - 1}
ii = dqrng::dqsample_int(N, N); for(int j = 0; j < M; ++j) {
for(int i = j * m; i < (j + 1) * m; ++i) {
0;
yhat[i] = for(int k = 0; k < p; ++k) {
yhat[i] += X(ii[i], k) * beta[k];
}
}for(int k = 0; k < p; ++k) {
0;
grad[k] = for(int i = j * m; i < (j + 1) * m; ++i) {
grad[k] += X(ii[i], k) * (yhat[i] - y[ii[i]]);
}
}
beta = beta - gamma[l] * (grad / m);
}
}return beta;
}
```

We test the implementation by comparing it to the pure R implementation, but
using `dqsample_int() / dqsample.int()`

to sample the same permutations in both cases.

```
dqset.seed(10)
par1 <- SG(
par = par0,
grad = ls_grad,
N = nrow(X),
gamma = 1e-3,
maxiter = 10,
sampler = dqrng::dqsample.int
)
dqset.seed(10)
par2 <- SG_Rcpp(
par = par0,
N = nrow(X),
gamma = rep(1e-3, 10),
X = X,
y = y,
maxiter = 10
)
range(par1 - par2)
```

`## [1] 0 0`

The two most notable overall differences between `SG()`

and `SG_Rcpp()`

is that
`SG_Rcpp()`

is low level and very problem specific, while `SG()`

is high level
and generic. It is
surely possible to refactor the C++ code to write a more generic stochastic
gradient algorithm, that would rely on other functions for computing
e.g. the gradient. However, to retain performance those functions would need to
be implemented in C++ as well.

To achieve a more high level implementation, we can use the RcppArmadillo
package, which provides an interface to numerical linear
algebra from the C++ library Armadillo.
In this way we can use a matrix-vector style implementation similar to
the R implementation but without excessive data copying. It is, however,
necessary to use various data types from the Armadillo library, e.g.
matrices and column vectors, and it is also necessary to type
cast the sequence of integers from `dqsample_int()`

as a `uvec`

data type.

```
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::colvec SG_arma(
NumericVector par, int N,
NumericVector gamma,const arma::mat& X,
const arma::colvec& y,
int m = 50,
int maxiter = 100
) {int p = par.length(), M = floor(N / m);
arma::colvec grad(p), yhat(N), beta = clone(par);
uvec ii, iii;
for(int l = 0; l < maxiter; ++l) {
ii = as<arma::uvec>(dqrng::dqsample_int(N, N));for(int j = 0; j < M; ++j) {
1) * m - 1);
iii = ii.subvec(j * m, (j +
beta = beta - gamma[l] * (X.rows(iii).t() * (X.rows(iii) * beta - y(iii)) / m);
}
}return beta;
}
```

The function returns a column vector, which R represents as a \(p \times 1\)
matrix, whereas all the other implementations return a vector. But besides this
difference, a test will show that `SG_Rcpp()`

computes the same updates as
all other implementations when `dqsample_int()`

is used.

We benchmark and compare all five implementations of the mini-batch stochastic gradient algorithm by running them for 10 epoch iterations with a fixed learning rate of \(\gamma = 10^{-4}\) and the default mini-batch size \(m = 50\).

```
bench::mark(
SG = SG(par0, nrow(X), 1e-4, maxiter = 10, grad = ls_grad),
SG_lin_grad = SG(par0, nrow(X), 1e-4, maxiter = 10, grad = lin_grad, X = X, y = y),
SG_lin_batch = SG(par0, nrow(X), 1e-4, lin_batch, maxiter = 10, X = X, y = y),
SG_Rcpp = SG_Rcpp(par0, nrow(X), rep(1e-4, 10), X = X, y = y, maxiter = 10),
SG_arma = SG_arma(par0, nrow(X), rep(1e-4, 10), X = X, y = y, maxiter = 10),
check = FALSE, iterations = 5
)
```

```
## # A tibble: 5 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 SG 897ms 991ms 1.01 183.18MB 5.07
## 2 SG_lin_grad 461ms 493ms 2.03 43.9MB 2.44
## 3 SG_lin_batch 345ms 364ms 2.74 9.14MB 0.549
## 4 SG_Rcpp 311ms 326ms 3.08 1.82MB 0
## 5 SG_arma 567ms 587ms 1.64 4.54MB 0.328
```

The pure R implementation (SG) is, unsurprisingly, the slowest, but the full Rcpp implementation (SG_Rcpp) is less then a factor 3 faster. The Armadillo based implementation is somewhat slower, though still faster than the pure R implementation.

We rerun all algorithms, but this time with the mini-batch size \(m = 1\). Because
`SG()`

allocates a lot of memory, the
default memory tracking of `mark()`

will become prohibitively slow,
thus memory tracking is disabled.

```
bench::mark(
SG = SG(par0, nrow(X), 1e-4, maxiter = 10, m = 1, grad = ls_grad),
SG_lin_grad = SG(par0, nrow(X), 1e-4, maxiter = 10, m = 1, grad = lin_grad, X = X, y = y),
SG_lin_batch = SG(par0, nrow(X), 1e-4, lin_batch, maxiter = 10, m = 1, X = X, y = y),
SG_Rcpp = SG_Rcpp(par0, nrow(X), rep(1e-4, 10), X = X, y = y, maxiter = 10, m = 1),
SG_arma = SG_arma(par0, nrow(X), rep(1e-4, 10), X = X, y = y, maxiter = 10, m = 1),
check = FALSE, memory = FALSE, iterations = 5
)
```

```
## # A tibble: 5 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 SG 5.49s 5.83s 0.169 NA 5.11
## 2 SG_lin_grad 2.3s 2.46s 0.401 NA 6.89
## 3 SG_lin_batch 418.5ms 421.59ms 2.35 NA 0.469
## 4 SG_Rcpp 396.32ms 424.52ms 2.38 NA 0
## 5 SG_arma 748.43ms 802.69ms 1.22 NA 0
```

This reveals some bigger differences. With a fixed number of epoch iterations,
the total amount of computations does not change substantially when
changing the mini-batch size from 50 to 1, but the run times of `SG()`

and `SG_lin_grad()`

increase notably. The C++
implementations `SG_lin_batch()`

and `SG_Rcpp`

mostly retain the run time
they had for \(m = 50\). The runtime of the Armadillo implementation is
not affected that much either, but it is still slower than the two other
C++ based implementations and now by about a factor 2.

In summary, when the mini-batches are not too small, the improvements on runtime
by using Rcpp were relatively small – about a factor 2 to 3. However, when
the mini-batch size is small or even one, the runtime was about a factor 10
smaller when using either the full Rcpp implementation or the Rcpp implementation
of the entire epoch update. Using `lin_batch()`

for the epoch updates strikes
a good compromise in this particular example, by being runtime competitive
but with the main control structures of the algorithm in R.

## 9.4 Exercises

**Exercise 9.1 **Consider the loss
\[L((y, z), \theta) = \mu(z, \theta) - y \log (\mu(z, \theta))\]
for a model, \(\mu(z, \theta)\), of the conditional mean, \(E(Y \mid Z = z)\), of
\(Y\) given \(Z = z\). See also Example 9.2.

Show that the function \[a \mapsto a - b \log(a)\] for fixed \(b\) attains it unique minimum in \((0, \infty)\) for \(a = b\).

Then show the lower bound on the risk \[H(\theta) = E(L((Y, Z), \theta)) \geq E\Big( E(Y \mid Z) - E(Y \mid Z) \log (E(Y \mid Z))\Big),\] and conclude that if \(E(Y \mid Z = z) = \mu(z, \theta_0)\) for some \(\theta_0\), then \(\theta_0\) is a global minimizer of the risk.

**Exercise 9.2 **Consider the same online setup as in Example 9.3 but with
\[Y_i \mid Z_i = z_i \sim \mathcal{N}(e^{\beta_0 + \beta_1 z_i}, 5)\]
instead. That is, the conditional mean value model is still log-linear, but
the distribution is Gaussian instead of Poisson.

Replicate the online learning simulation as in Example 9.3 using the Poisson log-likelihood gradient and the squared error gradient. You may want to experiment with different learning rates. How does the change of the distribution affect the convergence of the algorithms?

**Exercise 9.3 **Show that for the linear model and the squared error loss, the hessian of the
objective function is
\[D^2 H_N(\beta) = X^T X\]
where \(X\) is the model matrix. Compute the eigenvalues of this matrix using
the news data from Section 9.1.4 before and after standardization. What are
the conditioning numbers of the matrices, and how do their values
relate to convergence of gradient based methods? See also Section 7.1.1.

**Exercise 9.4 **Make sure that you have a working implementation of the mini-batch stochastic
gradient algorithm for the news data from Section 9.1.4 with
the `lin_batch()`

Rcpp implementation of the epoch updates. Set up an experiment
where you run the algorithm over a grid of mini-batch sizes and learning rates.
Run the algorithm until the absolute deviation from the minimum of the
empirical squared error loss is less than 0.01 and determine the optimal
choice of the tuning parameters.