## 3.5 Splines

In the previous section orthogonality of basis functions played an important role for computing basis function expansions efficiently as well as for the statistical assessment of estimated coefficients. This section will deal with bivariate smoothing via basis functions that are not necessarily orthogonal.

Though some of the material of this section will apply to any choice of basis, we restrict attention to splines and consider almost exclusively the widely used B-splines (the “B” is for basis).

### 3.5.1 Smoothing splines

To motivate splines we briefly consider the following penalized least squares criterion for finding a smooth approximation to bivariate data: minimize \[\begin{equation} L(f) = \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \|f''\|_2^2 \tag{3.1} \end{equation}\] over all twice differentiable functions \(f\). The first term is the standard squared error, and we can easily find a smooth function interpolating the \(y\)-values (if all the \(x\)-values are different), which will thus drive the squared error to 0. The squared 2-norm regularizes the minimization problem so that the minimizer finds a balance between interpolation and having a small second derivative (note that \(\|f''\|_2 = 0\) if and only if \(f\) is an affine function). The tuning parameter \(\lambda\) controls this balance.

It is possible to show that the minimizer of (3.1) is a
natural cubic spline with
*knots* in the data points \(x_i\). That is, the spline is a \(C^2\)-function
that equals a third degree polynomial in between the knots. At the knots,
the two polynomials that meet fit together up to the second derivative, but they
may differ on the third derivative. That the solution is *natural* means
that it has zero second and third derivative at and beyond the two boundary knots.

It is not particularly difficult to show that the space of natural cubic splines is a vector space of dimension \(n\) if all the \(x\)-values are different. It is therefore possible to find a basis of splines, \(\varphi_1, \ldots, \varphi_n\), such that the \(f\) that minimizes (3.1) is of the form \[f = \sum_{i=1}^n \beta_i \varphi_i.\] What is remarkable about this is that the basis (and the finite dimensional vector space it spans) doesn’t depend upon the \(y\)-values. Though the optimization is over an infinite dimensional space, the penalization ensures that the minimizer is always in the same finite dimensional space nomatter what \(y_1, \ldots, y_n\) are. Moreover, since (3.1) is a quite natural criterion to minimize to find a smooth function fitting the bivariate data, splines appear as good candidates for producing such smooth fits. On top of that, splines have several computational advantages and are widely used.

If we let \(\hat{f}_i = \hat{f}(x_i)\) with \(\hat{f}\) the minimizer of (3.1), we have in vector notation that \[\hat{\mathbf{f}} = \boldsymbol{\Phi}\hat{\beta}\] with \(\boldsymbol{\Phi}_{ij} = \varphi_j(x_i)\). The minimizer can be found by observing that

\[\begin{align} L(\mathbf{f}) & = (\mathbf{y} - \mathbf{f})^T (\mathbf{y} - \mathbf{f}) + \lambda \| f'' \|_2^2 \\ & = ( \mathbf{y} - \boldsymbol{\Phi}\beta)^T (\mathbf{y} - \boldsymbol{\Phi}\beta) + \lambda \beta^T \mathbf{\Omega} \beta \end{align}\]

where
\[\mathbf{\Omega}_{ij} = \langle \varphi_i'', \varphi_j'' \rangle =
\int \varphi_i''(z) \varphi_j''(z) \mathrm{d}z.\]
The matrix \(\mathbf{\Omega}\) is positive semidefinite by construction, and
we refer to it as the *penalty matrix*. It induces a seminorm on \(\mathbb{R}^n\)
so that we can express the seminorm, \(\|f''\|_2\), of \(f\) in terms of the
parameters in the basis expansion using \(\varphi_i\).

This is a standard penalized least squares problem, whose solution is \[\hat{\beta} = (\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \lambda \mathbf{\Omega})^{-1}\boldsymbol{\Phi}^T \mathbf{y}\] and with resulting smoother \[\hat{\mathbf{f}} = \underbrace{\boldsymbol{\Phi} ((\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \lambda \mathbf{\Omega})^{-1}\boldsymbol{\Phi}^T}_{\mathbf{S}_{\lambda}} \mathbf{y}.\] This linear smoother with smoothing matrix \(\mathbf{S}_{\lambda}\) based on natural cubic splines gives what is known as a smoothing spline that minimizes (3.1). We will pursue spline based smoothing by minimizing (3.1) but using various B-spline bases that may have more or less than \(n\) elements. For the linear algebra, it actually doesn’t matter if we use a spline basis or any other basis – as long as \(\boldsymbol{\Phi}_{ij} = \varphi_j(x_i)\) and \(\mathbf{\Omega}\) is given in terms of \(\varphi''_i\) as above.

### 3.5.2 Splines in R

The splines package in R implements some of the basic functions needed to work
with splines. In particular, the `splineDesign()`

function that computes evaluations
of B-splines and their derivatives.

```
# Note the specification of repeated boundary knots
knots <- c(0, 0, 0, seq(0, 1, 0.2), 1, 1, 1)
xx <- seq(0, 1, 0.005)
B_splines <- splineDesign(knots, xx)
matplot(xx, B_splines, type = "l", lty = 1)
```

The basis shown in Figure 3.13 is an example of a cubic B-spline
basis with the 11 inner knots \(0, 0.1, \ldots, 0.9, 1\). The repeated
boundary knots control how the spline basis behaves close to the boundaries
of the interval. This basis has 13 basis functions, not 11, and spans a
larger space than the space of *natural* cubic splines. It is
possible to compute a basis based on B-splines for the natural cubic splines
using the function `ns`

, but for all practical purposes this is not
important, and we will work exclusively with the B-spline basis itself.

The computation of the penalty matrix \(\mathbf{\Omega}\) constitutes a
practical problem, but observing that \(\varphi''_i\) is an affine
function in between knots leads to a simple way of
computing \(\mathbf{\Omega}_{ij}\). Letting \(g_{ij} = \varphi''_i \varphi''_j\)
it holds that \(g_{ij}\) is quadratic between two consecutive knots \(a\) and \(b\),
in which case
\[\int_a^b g_{ij}(z) \mathrm{d}z = \frac{b - a}{6}\left(g_{ij}(a) + 4 g_{ij}\left(\frac{a + b}{2}\right) + g_{ij}(b)\right).\]
This identity is behind Simpson’s rule
for numerical integration, and the fact that this is an identity for quadratic
polynomials, and not an approximation, means that Simpson’s rule
applied appropriately leads to exact computation of
\(\mathbf{\Omega}_{ij}\). All we need is the ability to evaluate
\(\varphi''_i\) at certain points, and `splineDesign()`

can be used for that.

```
pen_mat <- function(inner_knots) {
knots <- sort(c(rep(range(inner_knots), 3), inner_knots))
d <- diff(inner_knots) # The vector of knot differences; b - a
g_ab <- splineDesign(knots, inner_knots, derivs = 2)
knots_mid <- inner_knots[-length(inner_knots)] + d / 2
g_ab_mid <- splineDesign(knots, knots_mid, derivs = 2)
g_a <- g_ab[-nrow(g_ab), ]
g_b <- g_ab[-1, ]
(crossprod(d * g_a, g_a) +
4 * crossprod(d * g_ab_mid, g_ab_mid) +
crossprod(d * g_b, g_b)) / 6
}
```

It is laborious to write good tests of `pen_mat()`

. We would have to work out a
set of example matrices by other means, e.g. by hand. Alternatively, we can
compare to a simpler numerical integration technique using Riemann sums.

```
spline_deriv <- splineDesign(
c(0, 0, 0, 0, 0.5, 1, 1, 1, 1),
seq(0, 1, 1e-5),
derivs = 2
)
Omega_numeric <- crossprod(spline_deriv[-1, ]) * 1e-5 # Right Riemann sums
Omega <- pen_mat(c(0, 0.5, 1))
Omega_numeric / Omega
```

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9999700 0.9999673 0.999940 1.000000 NaN
## [2,] 0.9999673 0.9999663 0.999955 1.000000 1.000000
## [3,] 0.9999400 0.9999550 1.000000 1.000045 1.000060
## [4,] 1.0000000 1.0000000 1.000045 1.000034 1.000033
## [5,] NaN 1.0000000 1.000060 1.000033 1.000030
```

`## [1] -5.99967e-05 5.99983e-05`

And we should also test an example with non-equidistant knots.

```
spline_deriv <- splineDesign(
c(0, 0, 0, 0, 0.2, 0.3, 0.5, 0.6, 0.65, 0.7, 1, 1, 1, 1),
seq(0, 1, 1e-5),
derivs = 2
)
Omega_numeric <- crossprod(spline_deriv[-1, ]) * 1e-5 # Right Riemann sums
Omega <- pen_mat(c(0, 0.2, 0.3, 0.5, 0.6, 0.65, 0.7, 1))
range((Omega_numeric - Omega) / (Omega + 0.001)) # Relative error
```

`## [1] -0.0001607084 0.0002545494`

These examples indicate that `pen_mat()`

computes \(\mathbf{\Omega}\) correctly,
in particular as increasing the Riemann sum precision by
lowering the number \(10^{-5}\) will decrease the relative error (not shown).
Of course, correctness ultimately depends on `splineDesign()`

computing
the correct second derivatives, which hasn’t been tested here.

We can also test how our implementation of smoothing splines works on data. We do this here by implementing the matrix-algebra directly for computing \(\mathbf{S}_{\lambda} \mathbf{y}\).

```
inner_knots <- Nuuk_year$Year
Phi <- splineDesign(c(rep(range(inner_knots), 3), inner_knots), inner_knots)
Omega <- pen_mat(inner_knots)
smoother <- function(lambda)
Phi %*% solve(
crossprod(Phi) + lambda * Omega,
t(Phi) %*% Nuuk_year$Temperature
)
p_Nuuk +
geom_line(aes(y = smoother(10)), color = "blue") + # Undersmooth
geom_line(aes(y = smoother(1000)), color = "red") + # Smooth
geom_line(aes(y = smoother(100000)), color = "purple") # Oversmooth
```

Smoothing splines can be computed using the R function `smooth.spline()`

from the stats package. It is possible to manually specify the amount of
smoothing using one of the arguments `lambda`

, `spar`

or `df`

(the latter being the trace of the smoother matrix). However,
due to internal differences from the `splineDesign()`

basis
above, the `lambda`

argument to `smooth.spline()`

does not match the
\(\lambda\) parameter above.

If the amount of smoothing is not manually set, `smooth.spline()`

chooses
\(\lambda\) by *generalized cross validation* (GCV), which minimizes
\[\mathrm{GCV} = \sum_{i=1}^n \left(\frac{y_i - \hat{f}_i}{1 - \mathrm{df} / n}\right)^2,\]
where
\[\mathrm{df} = \mathrm{trace}(\mathbf{S}) = \sum_{i=1}^n S_{ii}.\]
GCV corresponds to LOOCV with the diagonal entries, \(S_{ii}\),
replaced by their average \(\mathrm{df} / n\). The main reason for
using GCV over LOOCV is that for some smoothers, such as the
spline smoother, it is possible to compute the trace \(\mathrm{df}\) easily
without computing \(\mathbf{S}\) or even its diagonal elements.

To compare our results to `smooth.spline()`

we optimize the GCV criterion.
First we implement a function that computes GCV for a fixed value of \(\lambda\).
Here the implementation is relying on computing the smoother matrix, but
this is not the most efficient implementation. Section 3.5.3
provides a diagonalization of the smoother matrix jointly in the tuning
parameters. This representation allows for efficient computation
with splines, and it will become clear why it is not necessary to compute
\(\mathbf{S}\) or even its diagonal elements. The trace is nevertheless
easily computable \(\mathbf{S}\).

```
gcv <- function(lambda, y) {
S <- Phi %*% solve(crossprod(Phi) + lambda * Omega, t(Phi))
df <- sum(diag(S)) # The trace of the smoother matrix
sum(((y - S %*% y) / (1 - df / length(y)))^2, na.rm = TRUE)
}
```

Then we apply this function to a grid of \(\lambda\)-values and choose the value of \(\lambda\) that minimizes GCV.

```
lambda <- seq(50, 250, 2)
GCV <- sapply(lambda, gcv, y = Nuuk_year$Temperature)
lambda_opt <- lambda[which.min(GCV)]
qplot(lambda, GCV) + geom_vline(xintercept = lambda_opt, color = "red")
```

Finally, we can visualize the resulting smoothing spline.

```
smooth_opt <- Phi %*% solve(
crossprod(Phi) + lambda_opt * Omega,
t(Phi) %*% Nuuk_year$Temperature
)
p_Nuuk + geom_line(aes(y = smooth_opt), color = "blue")
```

The smoothing spline that we found by minimizing GCV can be compared to
the smoothing spline that `smooth.spline()`

computes by minimizing GCV
as well.

```
smooth_splines <- smooth.spline(
Nuuk_year$Year,
Nuuk_year$Temperature,
all.knots = TRUE # Don't use heuristic
)
range(smooth_splines$y - smooth_opt)
```

`## [1] -0.000775662 0.001072587`

The differences between the smoothing spline computed by our implementation
and by `smooth.spline()`

is hardly detectable visually, and they are at most
of the order \(10^{-3}\) as computed above. It is possible to further
decrease the differences by finding the optimal value of \(\lambda\) with a
higher precision, but we will not pursue this here.

### 3.5.3 Efficient computation with splines

Using the full B-spline basis with knots in every observation is computationally
heavy and from a practical viewpoint unnecessary. Smoothing using B-splines
is therefore often done using a knot-selection heuristic that selects much fewer
knots than \(n\), in particular if \(n\) is large. This is also what `smooth.spline()`

does unless `all.knots = TRUE`

. The heuristic for selecting
the number of knots is a bit complicated, but it is implemented in the function `.nknots.smspl()`

,
which can be inspected for details. Once the number of knots gets above 200 it
grows extremely slowly with \(n\). With the number of knots selected,
a common heuristic for selecting their position is to use the quantiles of
the distribution of the \(x\)-values. That is, with 9 knots, say, the knots
are positioned in the deciles (0.1-quantile, 0.2-quantile etc.). This is
effectively also what `smooth.spline()`

does, and this heuristic places most of
the knots where we have most of the data points.

Having implemented a knot-selection heuristic that results in \(p\) B-spline basis functions, the matrix \(\Phi\) will be \(n \times p\), typically with \(p < n\) and with \(\Phi\) of full rank \(p\). In this case we derive a way of computing the smoothing spline that is computationally more efficient and numerically more stable than relying on the matrix-algebraic solution above. This is particularly so when we need to compute the smoother for many different \(\lambda\)-s to optimize the smoother. As we will show, we are effectively computing a simultaneous diagonalization of the (symmetric) smoother matrix \(\mathbf{S}_{\lambda}\) for all values of \(\lambda\).

The matrix \(\Phi\) has a singular value decomposition \[\Phi = \mathbf{U} D \mathbf{V}^T\] where \(D\) is diagonal with entries \(d_1 \geq d_2 \geq \ldots \geq d_p > 0\), \(\mathbf{U}\) is \(n \times p\), \(\mathbf{V}\) is \(p \times p\) and both are orthogonal matrices. This means that \[\mathbf{U}^T \mathbf{U} = \mathbf{V}^T \mathbf{V} = \mathbf{V} \mathbf{V}^T = I\] is the \(p \times p\) dimensional identity matrix. We find that

\[\begin{align} \mathbf{S}_{\lambda} & = \mathbf{U}D\mathbf{V}^T(\mathbf{V}D^2\mathbf{V}^T + \lambda \mathbf{\Omega})^{-1} \mathbf{V}D\mathbf{U}^T \\ & = \mathbf{U}D (D^2 + \lambda \mathbf{V}^T \mathbf{\Omega} \mathbf{V})^{-1} D \mathbf{U}^T \\ & = \mathbf{U} (I + \lambda D^{-1} \mathbf{V}^T \mathbf{\Omega} \mathbf{V} D^{-1})^{-1} \mathbf{U}^T \\ & = \mathbf{U}(I + \lambda \widetilde{\mathbf{\Omega}})^{-1} \mathbf{U}^T, \end{align}\]

where \(\widetilde{\mathbf{\Omega}} = D^{-1} \mathbf{V}^T \mathbf{\Omega} \mathbf{V} D^{-1}\) is a positive semidefinite \(p \times p\) matrix. By diagonalization, \[\widetilde{\mathbf{\Omega}} = \mathbf{W} \Gamma \mathbf{W}^T,\] where \(\mathbf{W}\) is orthogonal and \(\Gamma\) is a diagonal matrix with nonnegative values in the diagonal, and we find that

\[\begin{align} \mathbf{S}_{\lambda} & = \mathbf{U} \mathbf{W} (I + \lambda \Gamma)^{-1} \mathbf{W}^T \mathbf{U}^T \\ & = \widetilde{\mathbf{U}} (I + \lambda \Gamma)^{-1} \widetilde{\mathbf{U}}^T \end{align}\]

where \(\widetilde{\mathbf{U}} = \mathbf{U} \mathbf{W}\) is an orthogonal \(n \times p\) matrix.

The interpretation of this representation is as follows.

- First, the coefficients, \(\hat{\beta} = \widetilde{\mathbf{U}}^Ty\), are computed for expanding \(y\) in the basis given by the columns of \(\widetilde{\mathbf{U}}\).
- Second, the \(i\)-th coefficient is shrunk towards 0, \[\hat{\beta}_i(\lambda) = \frac{\hat{\beta}_i}{1 + \lambda \gamma_i}.\]
- Third, the smoothed values, \(\widetilde{\mathbf{U}} \hat{\beta}(\lambda)\), are computed as an expansion using the shrunken coefficients.

Thus the smoother works by shrinking the coefficients toward zero in the orthonormal basis given by the columns of \(\widetilde{\mathbf{U}}\). The coefficients corresponding to the largest eigenvalues \(\gamma_i\) are shrunk relatively more toward zero than those corresponding to the small eigenvalues. The basis formed by the columns of \(\widetilde{\mathbf{U}}\) is known as the Demmler-Reinch basis with reference to Demmler and Reinsch (1975).

We implement the computation of the diagonalization for the Nuuk temperature data using \(p = 20\) basis functions (18 inner knots) equidistantly distributed over the range of the years for which we have data.

```
inner_knots <- seq(1867, 2013, length.out = 18)
Phi <- splineDesign(c(rep(range(inner_knots), 3), inner_knots), Nuuk_year$Year)
Omega <- pen_mat(inner_knots)
Phi_svd <- svd(Phi)
Omega_tilde <- t(crossprod(Phi_svd$v, Omega %*% Phi_svd$v)) / Phi_svd$d
Omega_tilde <- t(Omega_tilde) / Phi_svd$d
# It is safer to use the numerical singular value decomposition ('svd()')
# for diagonalizing a positive semidefinite matrix than to use a
# more general numerical diagonalization implementation such as 'eigen()'.
Omega_tilde_svd <- svd(Omega_tilde)
U_tilde <- Phi_svd$u %*% Omega_tilde_svd$u
```

We observe from Figures 3.17 and 3.18 that there are two relatively large eigenvalues corresponding to the two basis functions with erratic behavior close to the boundaries, and there are two eigenvalues that are effectively zero corresponding to the two affine basis functions. In addition, the more oscillating the basis function is, the larger is the corresponding eigenvalue, and the more is the corresponding coefficient shrunk toward zero by the spline smoother.

Observe also that \[\mathrm{df}(\lambda) = \mathrm{trace}(\mathbf{S}_\lambda) = \sum_{i=1}^p \frac{1}{1 + \lambda \gamma_i},\] which makes it possible to implement GCV without even computing the diagonal entries of \(\mathbf{S}_{\lambda}.\)

### References

Demmler, A., and C. Reinsch. 1975. “Oscillation Matrices with Spline Smoothing.” *Numerische Mathematik* 24 (5): 375–82.