3 Bivariate smoothing

This chapter is on the estimation of a smooth relation between two variables \((X, Y)\). We will focus on direct estimation of (aspects of) the conditional distribution of \(Y\) given \(X\). Thus the variables are treated asymmetrically. When both variables are real valued, we get a good idea about their relation from a scatter plot of \(Y\) against \(X\), and the aim is basically to smooth the relation that the scatter plot shows.

For most of the methods considered, it does not matter if \(X\) represents a deterministic or a random variable. Thus \(X\) could be fixed by an experimental design, it could represent time or it could be an observed variable. In Figure 3.1 (left), \(Y\) is the annual average temperature in Nuuk, Greenland, and \(X\) is time in calendar years. In Figure 3.1 (right), \(Y\) is the monthly average temperature in Nuuk, while \(X\) is the monthly average temperature in Qaqortoq – a town further south of Nuuk in Greenland. In the former example, time is deterministic and we are interested in how the temperature in Nuuk changes over time. Is there, for instance, a trend? In the second example, we are interested in how the temperature in Qaqortoq is predictive of the temperature in Nuuk. We will use those two temperature examples throughout, see Vinther et al. (2006). The raw data is available from the site SW Greenland temperature data, but is also available from the CSwR package.

p_Nuuk <- ggplot(Nuuk_year, aes(x = Year, y = Temperature)) + 
  geom_point()

p_Nuuk + 
  geom_smooth(se = FALSE) +         # A loess smoother
  geom_smooth(
    method = "lm", 
    formula = y ~ poly(x, 10),      # A degree-10 polynomial expansion
    color = "red", 
    se = FALSE
  ) + 
  geom_smooth(
    method = "gam", 
    formula = y ~ s(x, bs = "cr"),   # A spline smoother via 'mgcv::gam()'
    color = "purple", 
    se = FALSE
  )

p_Nuuk_Qaqortoq <- ggplot(Nuuk_Qaqortoq, aes(Temp_Qaqortoq, Temp_Nuuk)) + 
  geom_point()

p_Nuuk_Qaqortoq +
  geom_smooth(
    method = "lm", 
    formula = y ~ x,                 # A straight line smoother
    color = "red", 
    se = FALSE
  ) + 
  geom_smooth(
    method = "gam", 
    formula = y ~ s(x, bs = "cr"),   # A spline smoother via 'mgcv::gam()'
    color = "purple", 
    se = FALSE
  )
Nuuk average yearly temperature in degrees Celsius (left) smoothed using loess (black), a degree 10 polynomial (red) and a smooth spline (purple). Nuuk average monthly temperature against Qaqortoq average montly temperature (right) smoothed using a straight line (red) and a smooth spline (purple).Nuuk average yearly temperature in degrees Celsius (left) smoothed using loess (black), a degree 10 polynomial (red) and a smooth spline (purple). Nuuk average monthly temperature against Qaqortoq average montly temperature (right) smoothed using a straight line (red) and a smooth spline (purple).

Figure 3.1: Nuuk average yearly temperature in degrees Celsius (left) smoothed using loess (black), a degree 10 polynomial (red) and a smooth spline (purple). Nuuk average monthly temperature against Qaqortoq average montly temperature (right) smoothed using a straight line (red) and a smooth spline (purple).

3.1 Nearest neighbor and kernel smoothers

One of the most basic ideas on smoothing bivariate data is to use a running mean or moving average. This is particularly sensible when the \(x\)-values are equidistant, e.g. when the observations constitute a time series such as the Nuuk temperature data. The running mean is an example of the more general nearest neighbor smoothers.

Mathematically, the \(k\) nearest neighbor smoother in \(x_i\) is defined as \[\hat{f}_i = \frac{1}{k} \sum_{j \in N_i} y_j\] where \(N_i\) is the set of indices for the \(k\) nearest neighbors of \(x_i\). This simple idea is actually very general and powerful. It works as long as the \(x\)-values lie in a metric space, and by letting \(k\) grow with \(n\) it is possible to construct consistent nonparametric estimators of regression functions, \(f(x) = E(Y \mid X = x)\), under minimal assumptions. The practical problem is that \(k\) must grow slowly in high dimensions, and the estimator is not a panacea.

In this chapter we focus mostly on \(x\) being real valued with the ordinary metric used to define the nearest neighbors. The total ordering of the real line adds a couple of extra possibilities to the definition of \(N_i\). When \(k\) is odd, the symmetric nearest neighbor smoother takes \(N_i\) to consist of \(x_i\) together with the \((k-1)/2\) smaller \(x_j\)-s closest to \(x_i\) and the \((k-1)/2\) larger \(x_j\)-s closest to \(x_i\). It is also possible to choose a one-sided smoother with \(N_i\) corresponding to the \(k\) smaller \(x_j\)-s closest to \(x_i\), in which case the smoother would be known as a causal filter.

The symmetric definition of neighbors makes it very easy to handle the neighbors computationally; we don’t need to compute and keep track of the \(n^2\) pairwise distances between the \(x_i\)-s, we only need to sort data according to the \(x\)-values. Once data is sorted, \[N_i = \{i - (k - 1) / 2, i - (k - 1) / 2 + 1, \ldots, i - 1 , i, i + 1, \ldots, i + (k - 1) / 2\}\] for \((k - 1) / 2 \leq i \leq n - (k - 1) / 2\). The symmetric \(k\) nearest neighbor smoother is thus a running mean of the \(y\)-values when sorted according to the \(x\)-values. There are a couple of possibilities for handling the boundaries, one being simply to not define a value of \(\hat{f}_i\) outside of the interval above.

With \(\hat{\mathbf{f}}\) denoting the vector of smoothed values by a nearest neighbor smoother we can observe that it is always possible to write \(\hat{\mathbf{f}} = \mathbf{S}\mathbf{y}\) for a matrix \(\mathbf{S}\). For the symmetric nearest neighbor smoother and with data sorted according to the \(x\)-values, the matrix has the following band diagonal form

\[ \mathbf{S} = \left( \begin{array}{cccccccccc} \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & 0 & 0 & \ldots & 0 & 0 \\ 0 & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & 0 & \ldots & 0 & 0\\ 0 & 0 & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ldots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & \ldots & \frac{1}{5} & \frac{1}{5} \\ \end{array} \right) \]

here given for \(k = 5\) and with dimensions \((n - 4) \times n\) due to the undefined boundary values.

3.1.1 Linear smoothers

A smoother of the form \(\hat{\mathbf{f}} = \mathbf{S}\mathbf{y}\) for a smoother matrix \(\mathbf{S}\), such as the nearest neighbor smoother, is known as a linear smoother. The linear form is often beneficial for theoretical arguments, and many smoothers considered in this chapter will be linear smoothers. For computing \(\mathbf{f}\) there may, however, be many alternatives to forming the matrix \(\mathbf{S}\) and computing the matrix-vector product. Indeed, this is often not the best way to compute the smoothed values.

It is, on the other hand, useful to see how \(\mathbf{S}\) can be constructed for the symmetric nearest neighbor smoother.

w <- c(rep(1/11, 11), rep(0, 147 - 10))
S <- matrix(w, 147 - 10, 147, byrow = TRUE)
## Warning in matrix(w, 147 - 10, 147, byrow = TRUE): data length [148] is not a
## sub-multiple or multiple of the number of rows [137]

The construction above relies on vector recycling of w in the construction of S and the fact that w has length \(147 + 1\), which will effectively cause w to be translated by one to the right every time it is recycled for a new row. As seen, the code triggers a warning by R, but in this case we get what we want.

S

We can use the matrix to smooth the annual average temperature in Nuuk using a running mean with a window of \(k = 11\) years. That is, the smoothed temperature at a given year is the average of the temperatures in the period from five years before to five years after. Note that to add the smoothed values to the previous plot we need to pad the values at the boundaries with NAs to get a vector of length 147.

# Check first if data is sorted correctly.
# The test is backwards, but confirms that data isn't unsorted :-)
is.unsorted(Nuuk_year$Year)
## [1] FALSE
f_hat <- c(rep(NA, 5), S %*% Nuuk_year$Temperature, rep(NA, 5))
p_Nuuk + geom_line(aes(y = f_hat), color = "blue")
Annual average temperature in Nuuk smoothed using the running mean with $k = 11$ neighbors.

Figure 3.2: Annual average temperature in Nuuk smoothed using the running mean with \(k = 11\) neighbors.

3.1.2 Implementing the running mean

The running mean smoother fulfills the following identity \[\hat{f}_{i+1} = \hat{f}_{i} - y_{i - (k-1)/2} / k + y_{i + (k + 1)/2} / k,\] which can be used for a much more efficient implementation than the matrix-vector multiplication. It should be emphasized again that the identity above and the implementation below assume that data is sorted according to \(x\)-values.

# The vector 'y' must be sorted according to the x-values
run_mean <- function(y, k) {
  n <- length(y)
  m <- floor((k - 1) / 2)
  k <- 2 * m + 1           # Ensures k to be odd and m = (k - 1) / 2
  y <- y / k
  s <- rep(NA, n)
  s[m + 1] <- sum(y[1:k])
  for(i in (m + 1):(n - m - 1)) 
    s[i + 1] <- s[i] - y[i - m] + y[i + 1 + m]
  s
}
p_Nuuk + geom_line(aes(y = run_mean(Nuuk_year$Temperature, 11)), color = "blue")
Annual average temperature in Nuuk smoothed using the running mean with \(k = 11\) neighbors. This time using a different implementation than in Figure 3.2.

Figure 3.3: Annual average temperature in Nuuk smoothed using the running mean with \(k = 11\) neighbors. This time using a different implementation than in Figure 3.2.

The R function filter() (from the stats package) can be used to compute running means and general moving averages using any weight vector. We compare our two implementations to filter().

f_hat_filter <- stats::filter(Nuuk_year$Temperature, rep(1/11, 11))
range(f_hat_filter - f_hat, na.rm = TRUE)
## [1] -8.882e-16  4.441e-16
range(f_hat_filter - run_mean(Nuuk_year$Temperature, 11), na.rm = TRUE)
## [1] -4.441e-16  1.332e-15

Note that filter() uses the same boundary convention as used in run_mean().

A benchmark comparison between matrix-vector multiplication, run_mean() and filter() gives the following table with median run time in microseconds.

##                                       expr  median
## 1                          S1 %*% y[1:512]  157.52
## 2                         S2 %*% y[1:1024]  625.87
## 3                         S3 %*% y[1:2048] 2439.54
## 4                         S4 %*% y[1:4096] 9789.94
## 5               run_mean(y[1:512], k = 11)   43.60
## 6              run_mean(y[1:1024], k = 11)   84.17
## 7              run_mean(y[1:2048], k = 11)  166.21
## 8              run_mean(y[1:4096], k = 11)  333.58
## 9   stats::filter(y[1:512], rep(1/11, 11))   30.09
## 10 stats::filter(y[1:1024], rep(1/11, 11))   40.96
## 11 stats::filter(y[1:2048], rep(1/11, 11))   61.36
## 12 stats::filter(y[1:4096], rep(1/11, 11))  115.23

The matrix-vector computation is clearly much slower than the two alternatives, and the time to construct the \(\mathbf{S}\)-matrix has not even been included in the benchmark above. There is also a difference in how the matrix-vector multiplication scales with the size of data compared to the alternatives. Whenever the data size doubles the run time approximately doubles for both filter() and run_mean(), while it quadruples for the matrix-vector multiplication. This shows the difference between an algorithm that scales like \(O(n)\) and an algorithm that scales like \(O(n^2)\) as the matrix-vector product does.

Despite that filter() is implementing a more general algorithm than run_mean(), it is still faster, which reflects that it is implemented in C and compiled.

3.1.3 Choose \(k\) by cross-validation

Cross-validation relies predictions of \(y_i\) from \(x_i\) for data points \((x_i, y_i)\) left out of the data set when the predictor is fitted to data. Many (linear) smoothers have a natural definition of an “out-of-sample” prediction, that is, how \(\hat{f}(x)\) is computed for \(x\) not in the data. If so, it becomes possible to define \[\hat{f}^{-i}_i = \hat{f}^{-i}(x_i)\] as the prediction at \(x_i\) using the smoother computed from data with \((x_i, y_i)\) excluded. However, here we directly define \[\hat{f}^{-i}_i = \sum_{j \neq i} \frac{S_{ij}y_j}{1 - S_{ii}}\] for any linear smoother. This definition concurs with the “out-of-sample” predictor in \(x_i\) for most smoothers, but this has to be verified case-by-case.

The running mean is a little special in this respect. In the previous section, the running mean was only considered for odd \(k\) and using a symmetric neighbor definition. This is convenient when considering the running mean in the observations \(x_i\). When considering the running mean in any other point, a symmetric neighbor definition works better with an even \(k\). This is exactly what the definition of \(\hat{f}^{-i}_i\) above amounts to. If \(\mathbf{S}\) is the running mean smoother matrix for an odd \(k\), then \(\hat{f}^{-i}_i\) corresponds to symmetric \((k-1)\)-nearest neighbor smoothing excluding \((x_i, y_i)\) from the data.

Using the definition above, we get that the leave-one-out cross-validation squared error criterion becomes \[\mathrm{LOOCV} = \sum_{i=1}^n (y_i - \hat{f}^{-i}_i)^2 = \sum_{i=1}^n \left(\frac{y_i - \hat{f}_i}{1 - S_{ii}}\right)^2.\] The important observation from the identity above is that LOOCV can be computed without actually computing all the \(\hat{f}^{-i}_i\).

For the running mean, all diagonal elements of the smoother matrix are identical. We disregard boundary values (with the NA value), so to get a comparable quantity across different choices of \(k\) we use mean() instead of sum() in the implementation.

loocv <- function(k, y) {
  f_hat <- run_mean(y, k)
  mean(((y - f_hat) / (1 - 1/k))^2, na.rm = TRUE) 
}
k <- seq(3, 40, 2)
CV <- sapply(k, loocv, y = Nuuk_year$Temperature)
k_opt <- k[which.min(CV)]
ggplot(mapping = aes(k, CV)) + geom_line() + 
  geom_vline(xintercept = k_opt, color = "red")
The leave-one-out cross-validation criterion for the running mean as a function of the number of neighbors $k$.

Figure 3.4: The leave-one-out cross-validation criterion for the running mean as a function of the number of neighbors \(k\).

The optimal choice of \(k\) is 15, but the LOOCV criterion jumps quite a lot up and down with changing neighbor size, and \(k = 9\) as well as \(k = 25\) give rather low values as well.

p_Nuuk + 
  geom_line(aes(y = run_mean(Nuuk_year$Temperature, 9)), color = "red") +
  geom_line(aes(y = run_mean(Nuuk_year$Temperature, k_opt)), color = "blue") +
  geom_line(aes(y = run_mean(Nuuk_year$Temperature, 25)), color = "purple")
The $k$-nearest neighbor smoother with the optimal choice of $k$ based on LOOCV (blue) and with $k = 9$ (red) and $k = 25$ (purple).

Figure 3.5: The \(k\)-nearest neighbor smoother with the optimal choice of \(k\) based on LOOCV (blue) and with \(k = 9\) (red) and \(k = 25\) (purple).

3.1.4 Nadaraya–Watson kernel smoothing

The section before introduced the basic idea of nearest neighbor smoothing using a fixed number of neighbors. A very similar idea is to use a fixed neighborhood, \(B_i\), say, around \(x_i\). This leads to the estimator \[\hat{f}_i = \frac{\sum_{j=1}^n y_j 1_{B_i}(x_j)}{\sum_{j=1}^n 1_{B_i}(x_j)},\] which is simply the average of the \(y\)-s for which the corresponding \(x\)-s fall in the neighborhood \(B_i\) of \(x_i\). Note that contrary to the nearest neighbor estimator, the denominator now also depends on the \(x\)-s.

In a metric space the natural choice of \(B_i\) is the ball, \(B(x_i, h)\), around \(x_i\) with some radius \(h\). In \(\mathbb{R}\) with the usual metric (or even \(\mathbb{R}^p\) equipped with any norm-induced metric) we have that
\[1_{B(x_i, h)}(x) = 1_{B(0, 1)}\left(\frac{x - x_i}{h}\right),\] thus since \(B(0, 1) = [-1, 1]\) in \(\mathbb{R}\) \[\hat{f}_i = \frac{\sum_{j=1}^n y_j 1_{[-1, 1]}\left(\frac{x_j - x_i}{h}\right)}{\sum_{l=1}^n 1_{[-1, 1]}\left(\frac{x_l - x_i}{h}\right)}.\] This nonparametric estimator of the conditional expectation \(E(Y \mid X = x_i)\) is closely related to the kernel density estimator with the rectangular kernel, see Section 2.2, and just as for this estimator there is a natural generalization allowing for arbitrary kernels instead of the indicator function \(1_{[-1, 1]}\).

With \(K : \mathbb{R} \mapsto \mathbb{R}\) a fixed kernel the corresponding kernel smoother with bandwidth \(h\) becomes \[\hat{f}_i = \frac{\sum_{j=1}^n y_j K\left(\frac{x_j - x_i}{h}\right)}{\sum_{l=1}^n K\left(\frac{x_l - x_i}{h}\right)}.\] This smoother is known as the Nadaraya–Watson kernel smoother or Nadaraya–Watson estimator. See Exercise 3.1 for an additional perspective based on bivariate kernel density estimation.

The Nadaraya–Watson kernel smoother can be implemented in much the same way as the kernel density estimator, and the run time will inevitably scale like \(O(n^2)\) unless we exploit special properties of the kernel or use approximation techniques such as binning. In this section we will focus on implementing bandwidth selection rather than the kernel smoother itself. The basic kernel smoothing computation is also implemented in the ksmooth function from the stats package.

f_hat <- ksmooth(
  Nuuk_year$Year, 
  Nuuk_year$Temperature, 
  kernel = "normal",
  bandwidth = 10, 
  x.points = Nuuk_year$Year
)
p_Nuuk + geom_line(aes(y = f_hat$y), color = "blue")
Nadaraya–Watson kernel smoother of the annual average temperature in Nuuk computed using `ksmooth` with the Gaussian kernel and bandwidth 10.

Figure 3.6: Nadaraya–Watson kernel smoother of the annual average temperature in Nuuk computed using ksmooth with the Gaussian kernel and bandwidth 10.

The kernel smoother is clearly a linear smoother, and we will implement its computation and the bandwidth selection using LOOCV by direct computation of the the smoother matrix \(\mathbf{S}\), which is given by \[S_{ij} = \frac{K\left(\frac{x_j - x_i}{h}\right)}{\sum_{l=1}^n K\left(\frac{x_l - x_i}{h}\right)}.\] The rows sum to 1 by construction, and the diagonal elements are \[S_{ii} = \frac{K(0)}{\sum_{l=1}^n K\left(\frac{x_l - x_i}{h}\right)}.\] The computation of \(\mathbf{S}\) for the Gaussian kernel is implemented using outer and rowSums.

kern <- function(x) exp(- x^2 / 2)  # The Gaussian kernel
Kij <- outer(Nuuk_year$Year, Nuuk_year$Year, function(x, y) kern((x - y) / 10))
S <- Kij / rowSums(Kij)
f_hat <- S %*% Nuuk_year$Temperature
p_Nuuk + geom_line(aes(y = f_hat), color = "blue")
Nadaraya–Watson kernel smoother of the annual average temperature in Nuuk computed using the smoother matrix and the Gaussian kernel with bandwidth 10.

Figure 3.7: Nadaraya–Watson kernel smoother of the annual average temperature in Nuuk computed using the smoother matrix and the Gaussian kernel with bandwidth 10.

The implementation above will work with any kernel and any sequence of \(x\)-s. In this example, the kernel is symmetric and the \(x\)-s are equidistant. Exercise 3.2 explores how to exploit this in the computation of the smoother matrix as well as the diagonal elements of the smoother matrix.

The smoother computed using ksmooth with bandwidth 10, as shown in Figure 3.6, is different from the smoother computed directly from the smoother matrix, see Figure 3.7. Though both computations use the Gaussian kernel and allegedly a bandwidth of 10, the resulting smoothers differ because ksmooth internally rescales the bandwidth. The rescaling amounts to multiplying \(h\) by the factor 0.3706506, which will make the kernel have quartiles in \(\pm 0.25 h\) (see also ?ksmooth).

The LOOCV computation is implemented as a function that computes the smoother matrix and the corresponding LOOCV value as a function of the bandwidth. For comparison with previous implementations the mean is computed instead of the sum.

loocv <- function(h) {
  Kij <- outer(Nuuk_year$Year, Nuuk_year$Year, function(x, y) kern((x - y) / h))
  S <- Kij / rowSums(Kij)
  mean(((Nuuk_year$Temperature - S %*% Nuuk_year$Temperature) / (1 - diag(S)))^2) 
}
h <- seq(1, 5, 0.05)
CV <- sapply(h, loocv)
h_opt <- h[which.min(CV)]
ggplot(mapping = aes(h, CV)) + geom_line() + 
  geom_vline(xintercept = h_opt, color = "red")
The leave-one-out cross-validation criterion for the kernel smoother using the Gaussian kernel as a function of the bandwidth $h$.

Figure 3.8: The leave-one-out cross-validation criterion for the kernel smoother using the Gaussian kernel as a function of the bandwidth \(h\).

The optimal bandwith is 1.55. We compute the resulting optimal smoother and compare it to the smoother computed using ksmooth.

Kij <- outer(Nuuk_year$Year, Nuuk_year$Year, function(x, y) kern((x - y) / h_opt))
S <- Kij / rowSums(Kij)
f_hat <- S %*% Nuuk_year$Temperature
f_hat_ksmooth <- ksmooth(
  Nuuk_year$Year, Nuuk_year$Temperature, 
  kernel = "normal",
  bandwidth = h_opt /  0.3706506,  # Rescaling!
  x.points = Nuuk_year$Year
)
range(f_hat - f_hat_ksmooth$y)
## [1] -4.535e-05  4.599e-05

The differences are of the order \(10^{-5}\), which is small but larger than can be explained by rounding errors alone. In fact, ksmooth truncates the tails of the Gaussian kernel to 0 beyond \(4h\). This is where the kernel becomes less than \(0.000335 \times K(0) = 0.000134\), which for practical purposes effectively equals zero. The truncation explains the relatively large differences between the two results that should otherwise be equivalent. It is an example where the approximate solution computed by ksmooth is acceptable because it substantially reduces the run time.

p_Nuuk + 
  geom_line(aes(y = run_mean(Nuuk_year$Temperature, 9)), color = "red") +
  geom_line(aes(y = f_hat), color = "blue")
Nadaraya–Watson kernel smoother of the annual average temperature in Nuuk for the optimal bandwidth using LOOCV (blue) compared to the $k$-nearest neighbor smoother with $k = 9$ (red).

Figure 3.9: Nadaraya–Watson kernel smoother of the annual average temperature in Nuuk for the optimal bandwidth using LOOCV (blue) compared to the \(k\)-nearest neighbor smoother with \(k = 9\) (red).

Figure 3.9 shows the optimal kernel smoother, which is actually somewhat wiggly. It is locally more smooth than the \(k\)-nearest neighbor smoother but overall comparable to this smoother with \(k = 9\).

3.2 Orthogonal basis expansions

3.2.1 Polynomial expansions

Degree 19 polynomial fitted to the temperature data.

We can extract the model matrix from the lm-object.

intercept <- rep(1/sqrt(n), n)  # To make intercept column have norm one
polylm <- lm(Temperature ~ intercept + poly(Year, 19) - 1, data = Nuuk_year)
Phi <- model.matrix(polylm)
The model matrix columns as functions

Figure 3.10: The model matrix columns as functions

The model matrix is (almost) orthogonal, and estimation becomes quite simple. With an orthogonal model matrix the normal equation reduces to the estimate \[\hat{\beta} = \Phi^T Y\] since \(\Phi^T \Phi = I\). The predicted (or fitted) values are \(\Phi \Phi^T Y\) with smoother matrix \(\mathbf{S} = \Phi \Phi^T\) being a projection.

(t(Phi) %*% Nuuk_year$Temperature)[1:10, 1]
##       intercept poly(Year, 19)1 poly(Year, 19)2 poly(Year, 19)3 poly(Year, 19)4 
##        -17.2470          4.9002         -1.7969          0.8175          5.9669 
## poly(Year, 19)5 poly(Year, 19)6 poly(Year, 19)7 poly(Year, 19)8 poly(Year, 19)9 
##          1.4265         -1.9259         -0.2524         -2.1355         -0.8046
coef(polylm)[1:10]
##       intercept poly(Year, 19)1 poly(Year, 19)2 poly(Year, 19)3 poly(Year, 19)4 
##        -17.2470          4.9002         -1.7969          0.8175          5.9669 
## poly(Year, 19)5 poly(Year, 19)6 poly(Year, 19)7 poly(Year, 19)8 poly(Year, 19)9 
##          1.4265         -1.9259         -0.2524         -2.1355         -0.8046

With homogeneous variance \[\hat{\beta}_i \overset{\text{approx}}{\sim} \mathcal{N}(\beta_i, \sigma^2),\] and for \(\beta_i = 0\) we have \(P(|\hat{\beta}_i| \geq 1.96\sigma) \simeq 0.05.\)

Thresholding:

Polynomial fit using all 19 basis functions (blue) and using a degree 5 polynomial (red).

Figure 3.11: Polynomial fit using all 19 basis functions (blue) and using a degree 5 polynomial (red).

3.2.2 Fourier expansions

Introducing \[x_{k,m} = \frac{1}{\sqrt{n}} e^{2 \pi i k m / n},\] then

\[\sum_{k=0}^{n-1} |x_{k,m}|^2 = 1\]

and for \(m_1 \neq m_2\) \[\sum_{k=0}^{n-1} x_{k,m_1}\overline{x_{k,m_2}} = 0\]

Thus \(\Phi = (x_{k,m})_{k,m}\) is an \(n \times n\) unitary matrix; \[\Phi^*\Phi = I\] where \(\Phi^*\) is the conjugate transposed of \(\Phi\).

\(\hat{\beta} = \Phi^* y\) is the discrete Fourier transform of \(y\). It is the basis coefficients in the orthonormal basis given by \(\Phi\); \[y_k = \frac{1}{\sqrt{n}} \sum_{m=0}^{n-1} \hat{\beta}_m e^{2 \pi i k m / n}\]

or \(y = \Phi \hat{\beta}.\)

Phi <- outer(
  0:(n - 1), 
  0:(n - 1), 
  function(k, m) exp(2 * pi * 1i * (k * m) / n) / sqrt(n)
)

The matrix \(\Phi\) generates an interesting pattern.

Columns in the matrix \(\Phi\):

We can estimate by matrix multiplication

betahat <- Conj(t(Phi)) %*% Nuuk_year$Temperature # t(Phi) = Phi for Fourier bases
betahat[c(1, 2:4, 73, n:(n - 2))]
## [1] -17.2470+0.000i  -2.4643+2.387i   3.5481+0.910i   1.6721+0.741i
## [5]   0.0321+0.709i  -2.4643-2.387i   3.5481-0.910i   1.6721-0.741i

For real \(y\) it holds that \(\hat{\beta}_0\) is real, and the symmetry \[\hat{\beta}_{n-m} = \hat{\beta}_m^*\] holds for \(m = 1, \ldots, n - 1\). (For \(n\) even, \(\hat{\beta}_{n/2}\) is real too).

Modulus distribution:

Note that for \(m \neq 0, n/2\), \(\beta_m = 0\) and \(y \sim \mathcal{N}(\Phi\beta, \sigma^2 I_n)\) then
\[(\mathrm{Re}(\hat{\beta}_m), \mathrm{Im}(\hat{\beta}_m))^T \sim \mathcal{N}\left(0, \frac{\sigma^2}{2} I_2\right),\]

hence \[|\hat{\beta}_m|^2 = \mathrm{Re}(\hat{\beta}_m)^2 + \mathrm{Im}(\hat{\beta}_m)^2 \sim \frac{\sigma^2}{2} \chi^2_2,\] that is, \(P(|\hat{\beta}_m| \geq 1.73 \sigma) = 0.05.\) There is a clear case of multiple testing if we use this threshold at face value, and we would expect around \(0.05 \times n/2\) false positive if there is no signal at all. Lowering the probability using the Bonferroni correction yields a threshold of around \(2.7 \sigma\) instead.

Thresholding Fourier:

The coefficients are not independent (remember the symmetry), and one can alternatively consider \[\hat{\gamma}_m = \sqrt{2} \mathrm{Re}(\hat{\beta}_m) \quad \text{and} \quad \hat{\gamma}_{n' + m} = - \sqrt{2} \mathrm{Im}(\hat{\beta}_m)\] for \(1 \leq m < n / 2\). Here \(n' = \lfloor n / 2 \rfloor\). Here, \(\hat{\gamma}_0 = \hat{\beta}_0\), and \(\hat{\gamma}_{n/2} = \hat{\beta}_{n/2}\) for \(n\) even.

These coefficients are the coefficients in a real cosine, \(\sqrt{2} \cos(2\pi k m / n)\), and sine, \(\sqrt{2} \sin(2\pi k m / n)\), basis expansion, and they are i.i.d. \(\mathcal{N}(0, \sigma^2)\) distributed.

Thresholding Fourier:

Fourier based smoother by thresholding (blue) and polynomial fit of degree 5 (red).

Figure 3.12: Fourier based smoother by thresholding (blue) and polynomial fit of degree 5 (red).

What is the point using the discrete Fourier transform? The point is that the discrete Fourier transform can be computed via the fast Fourier transform (FFT), which has an \(O(n\log(n))\) time complexity. The FFT works optimally for \(n = 2^p\).

fft(Nuuk_year$Temperature)[1:4] / sqrt(n)
## [1] -17.247+0.000i  -2.464+2.387i   3.548+0.910i   1.672+0.741i
betahat[1:4]
## [1] -17.247+0.000i  -2.464+2.387i   3.548+0.910i   1.672+0.741i

3.3 Splines

In the previous section orthogonality of basis functions played an important role for computing basis function expansions efficiently as well as for the statistical assessment of estimated coefficients. This section will deal with bivariate smoothing via basis functions that are not necessarily orthogonal.

Though some of the material of this section will apply to any choice of basis, we restrict attention to splines and consider almost exclusively the widely used B-splines (the “B” is for basis).

3.3.1 Smoothing splines

To motivate splines we briefly consider the following penalized least squares criterion for finding a smooth approximation to bivariate data: minimize \[\begin{equation} L(f) = \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \|f''\|_2^2 \tag{3.1} \end{equation}\] over all twice differentiable functions \(f\). The first term is the standard squared error, and we can easily find a smooth function interpolating the \(y\)-values (if all the \(x\)-values are different), which will thus drive the squared error to 0. The squared 2-norm regularizes the minimization problem so that the minimizer finds a balance between interpolation and having a small second derivative (note that \(\|f''\|_2 = 0\) if and only if \(f\) is an affine function). The tuning parameter \(\lambda\) controls this balance.

It is possible to show that the minimizer of (3.1) is a natural cubic spline with knots in the data points \(x_i\). That is, the spline is a \(C^2\)-function that equals a third degree polynomial in between the knots. At the knots, the two polynomials that meet fit together up to the second derivative, but they may differ on the third derivative. That the solution is natural means that it has zero second and third derivative at and beyond the two boundary knots.

It is not particularly difficult to show that the space of natural cubic splines is a vector space of dimension \(n\) if all the \(x\)-values are different. It is therefore possible to find a basis of splines, \(\varphi_1, \ldots, \varphi_n\), such that the \(f\) that minimizes (3.1) is of the form \[f = \sum_{i=1}^n \beta_i \varphi_i.\] What is remarkable about this is that the basis (and the finite dimensional vector space it spans) doesn’t depend upon the \(y\)-values. Though the optimization is over an infinite dimensional space, the penalization ensures that the minimizer is always in the same finite dimensional space nomatter what \(y_1, \ldots, y_n\) are. Moreover, since (3.1) is a quite natural criterion to minimize to find a smooth function fitting the bivariate data, splines appear as good candidates for producing such smooth fits. On top of that, splines have several computational advantages and are widely used.

If we let \(\hat{f}_i = \hat{f}(x_i)\) with \(\hat{f}\) the minimizer of (3.1), we have in vector notation that \[\hat{\mathbf{f}} = \boldsymbol{\Phi}\hat{\beta}\] with \(\boldsymbol{\Phi}_{ij} = \varphi_j(x_i)\). The minimizer can be found by observing that

\[\begin{align} L(\mathbf{f}) & = (\mathbf{y} - \mathbf{f})^T (\mathbf{y} - \mathbf{f}) + \lambda \| f'' \|_2^2 \\ & = ( \mathbf{y} - \boldsymbol{\Phi}\beta)^T (\mathbf{y} - \boldsymbol{\Phi}\beta) + \lambda \beta^T \mathbf{\Omega} \beta \end{align}\]

where \[\mathbf{\Omega}_{ij} = \langle \varphi_i'', \varphi_j'' \rangle = \int \varphi_i''(z) \varphi_j''(z) \mathrm{d}z.\] The matrix \(\mathbf{\Omega}\) is positive semidefinite by construction, and we refer to it as the penalty matrix. It induces a seminorm on \(\mathbb{R}^n\) so that we can express the seminorm, \(\|f''\|_2\), of \(f\) in terms of the parameters in the basis expansion using \(\varphi_i\).

This is a standard penalized least squares problem, whose solution is \[\hat{\beta} = (\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \lambda \mathbf{\Omega})^{-1}\boldsymbol{\Phi}^T \mathbf{y}\] and with resulting smoother \[\hat{\mathbf{f}} = \underbrace{\boldsymbol{\Phi} ((\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \lambda \mathbf{\Omega})^{-1}\boldsymbol{\Phi}^T}_{\mathbf{S}_{\lambda}} \mathbf{y}.\] This linear smoother with smoothing matrix \(\mathbf{S}_{\lambda}\) based on natural cubic splines gives what is known as a smoothing spline that minimizes (3.1). We will pursue spline based smoothing by minimizing (3.1) but using various B-spline bases that may have more or less than \(n\) elements. For the linear algebra, it actually doesn’t matter if we use a spline basis or any other basis – as long as \(\boldsymbol{\Phi}_{ij} = \varphi_j(x_i)\) and \(\mathbf{\Omega}\) is given in terms of \(\varphi''_i\) as above.

3.3.2 Splines in R

The splines package in R implements some of the basic functions needed to work with splines. In particular, the splineDesign() function that computes evaluations of B-splines and their derivatives.

library(splines)
# Note the specification of repeated boundary knots
knots <- c(0, 0, 0, seq(0, 1, 0.2), 1, 1, 1)
xx <- seq(0, 1, 0.005)
B_splines <- splineDesign(knots, xx)
matplot(xx, B_splines, type = "l", lty = 1)
B-spline basis as computed by `splineDesign()`.

Figure 3.13: B-spline basis as computed by splineDesign().

The basis shown in Figure 3.13 is an example of a cubic B-spline basis with the 11 inner knots \(0, 0.1, \ldots, 0.9, 1\). The repeated boundary knots control how the spline basis behaves close to the boundaries of the interval. This basis has 13 basis functions, not 11, and spans a larger space than the space of natural cubic splines. It is possible to compute a basis based on B-splines for the natural cubic splines using the function ns, but for all practical purposes this is not important, and we will work exclusively with the B-spline basis itself.

The computation of the penalty matrix \(\mathbf{\Omega}\) constitutes a practical problem, but observing that \(\varphi''_i\) is an affine function in between knots leads to a simple way of computing \(\mathbf{\Omega}_{ij}\). Letting \(g_{ij} = \varphi''_i \varphi''_j\) it holds that \(g_{ij}\) is quadratic between two consecutive knots \(a\) and \(b\), in which case \[\int_a^b g_{ij}(z) \mathrm{d}z = \frac{b - a}{6}\left(g_{ij}(a) + 4 g_{ij}\left(\frac{a + b}{2}\right) + g_{ij}(b)\right).\] This identity is behind Simpson’s rule for numerical integration, and the fact that this is an identity for quadratic polynomials, and not an approximation, means that Simpson’s rule applied appropriately leads to exact computation of \(\mathbf{\Omega}_{ij}\). All we need is the ability to evaluate \(\varphi''_i\) at certain points, and splineDesign() can be used for that.

pen_mat <- function(inner_knots) {
  knots <- sort(c(rep(range(inner_knots), 3), inner_knots))
  d <- diff(inner_knots)  # The vector of knot differences; b - a 
  g_ab <- splineDesign(knots, inner_knots, derivs = 2) 
  knots_mid <- inner_knots[-length(inner_knots)] + d / 2
  g_ab_mid <- splineDesign(knots, knots_mid, derivs = 2)
  g_a <- g_ab[-nrow(g_ab), ]
  g_b <- g_ab[-1, ]
  (crossprod(d * g_a,  g_a) + 
      4 * crossprod(d * g_ab_mid, g_ab_mid) + 
      crossprod(d * g_b, g_b)) / 6 
}

It is laborious to write good tests of pen_mat(). We would have to work out a set of example matrices by other means, e.g. by hand. Alternatively, we can compare to a simpler numerical integration technique using Riemann sums.

spline_deriv <- splineDesign(
  c(0, 0, 0, 0, 0.5, 1, 1, 1, 1), 
  seq(0, 1, 1e-5), 
  derivs = 2
)
Omega_numeric <- crossprod(spline_deriv[-1, ]) * 1e-5  # Right Riemann sums
Omega <- pen_mat(c(0, 0.5, 1))
Omega_numeric / Omega
##        [,1] [,2]   [,3] [,4] [,5]
## [1,] 1.0000    1 0.9999    1  NaN
## [2,] 1.0000    1 1.0000    1    1
## [3,] 0.9999    1 1.0000    1    1
## [4,] 1.0000    1 1.0000    1    1
## [5,]    NaN    1 1.0001    1    1
range((Omega_numeric - Omega) / (Omega + 0.001))  #  Relative error
## [1] -6e-05  6e-05

And we should also test an example with non-equidistant knots.

spline_deriv <- splineDesign(
  c(0, 0, 0, 0, 0.2, 0.3, 0.5, 0.6, 0.65, 0.7, 1, 1, 1, 1), 
  seq(0, 1, 1e-5), 
  derivs = 2
)
Omega_numeric <- crossprod(spline_deriv[-1, ]) * 1e-5  # Right Riemann sums
Omega <- pen_mat(c(0, 0.2, 0.3, 0.5, 0.6, 0.65, 0.7, 1))
range((Omega_numeric - Omega) / (Omega + 0.001)) # Relative error
## [1] -0.0001607  0.0002545

These examples indicate that pen_mat() computes \(\mathbf{\Omega}\) correctly, in particular as increasing the Riemann sum precision by lowering the number \(10^{-5}\) will decrease the relative error (not shown). Of course, correctness ultimately depends on splineDesign() computing the correct second derivatives, which hasn’t been tested here.

We can also test how our implementation of smoothing splines works on data. We do this here by implementing the matrix-algebra directly for computing \(\mathbf{S}_{\lambda} \mathbf{y}\).

inner_knots <- Nuuk_year$Year
Phi <- splineDesign(c(rep(range(inner_knots), 3), inner_knots), inner_knots)
Omega <- pen_mat(inner_knots)
smoother <- function(lambda) 
  Phi %*% solve(
    crossprod(Phi) + lambda * Omega, 
    t(Phi) %*% Nuuk_year$Temperature
  )
p_Nuuk + 
  geom_line(aes(y = smoother(10)), color = "blue") +      # Undersmooth
  geom_line(aes(y = smoother(1000)), color = "red") +     # Smooth
  geom_line(aes(y = smoother(100000)), color = "purple")  # Oversmooth

Smoothing splines can be computed using the R function smooth.spline() from the stats package. It is possible to manually specify the amount of smoothing using one of the arguments lambda, spar or df (the latter being the trace of the smoother matrix). However, due to internal differences from the splineDesign() basis above, the lambda argument to smooth.spline() does not match the \(\lambda\) parameter above.

If the amount of smoothing is not manually set, smooth.spline() chooses \(\lambda\) by generalized cross validation (GCV), which minimizes \[\mathrm{GCV} = \sum_{i=1}^n \left(\frac{y_i - \hat{f}_i}{1 - \mathrm{df} / n}\right)^2,\] where \[\mathrm{df} = \mathrm{trace}(\mathbf{S}) = \sum_{i=1}^n S_{ii}.\] GCV corresponds to LOOCV with the diagonal entries, \(S_{ii}\), replaced by their average \(\mathrm{df} / n\). The main reason for using GCV over LOOCV is that for some smoothers, such as the spline smoother, it is possible to compute the trace \(\mathrm{df}\) easily without computing \(\mathbf{S}\) or even its diagonal elements.

To compare our results to smooth.spline() we optimize the GCV criterion. First we implement a function that computes GCV for a fixed value of \(\lambda\). Here the implementation is relying on computing the smoother matrix, but this is not the most efficient implementation. Section 3.3.3 provides a diagonalization of the smoother matrix jointly in the tuning parameters. This representation allows for efficient computation with splines, and it will become clear why it is not necessary to compute \(\mathbf{S}\) or even its diagonal elements. The trace is nevertheless easily computable \(\mathbf{S}\).

gcv <- function(lambda, y) {
  S <- Phi %*% solve(crossprod(Phi) + lambda * Omega, t(Phi))
  df <- sum(diag(S))  # The trace of the smoother matrix
  sum(((y - S %*% y) / (1 - df / length(y)))^2, na.rm = TRUE) 
}

Then we apply this function to a grid of \(\lambda\)-values and choose the value of \(\lambda\) that minimizes GCV.

lambda <- seq(50, 250, 2)
GCV <- sapply(lambda, gcv, y = Nuuk_year$Temperature)
lambda_opt <- lambda[which.min(GCV)]
ggplot(mapping = aes(lambda, GCV)) + geom_point() +
  geom_vline(xintercept = lambda_opt, color = "red")
The generalized cross-validation criterion for smoothing splines as a function of the tuning parameter $\lambda$.

Figure 3.14: The generalized cross-validation criterion for smoothing splines as a function of the tuning parameter \(\lambda\).

Finally, we can visualize the resulting smoothing spline.

smooth_opt <- Phi %*% solve(
  crossprod(Phi) + lambda_opt * Omega, 
  t(Phi) %*% Nuuk_year$Temperature
)
p_Nuuk + geom_line(aes(y = smooth_opt), color = "blue")
The smoothing spline that minimizes GCV over the tuning parameter $\lambda$

Figure 3.15: The smoothing spline that minimizes GCV over the tuning parameter \(\lambda\)

The smoothing spline that we found by minimizing GCV can be compared to the smoothing spline that smooth.spline() computes by minimizing GCV as well.

smooth_splines <- smooth.spline(
  Nuuk_year$Year, 
  Nuuk_year$Temperature, 
  all.knots = TRUE        # Don't use heuristic
)  
range(smooth_splines$y - smooth_opt)
## [1] -0.0007757  0.0010726
p_Nuuk + geom_line(aes(y = smooth_splines$y), color = "blue")
The smoothing spline that minimizes GCV as computed by smooth.spline().

Figure 3.16: The smoothing spline that minimizes GCV as computed by smooth.spline().

The differences between the smoothing spline computed by our implementation and by smooth.spline() is hardly detectable visually, and they are at most of the order \(10^{-3}\) as computed above. It is possible to further decrease the differences by finding the optimal value of \(\lambda\) with a higher precision, but we will not pursue this here.

3.3.3 Efficient computation with splines

Using the full B-spline basis with knots in every observation is computationally heavy and from a practical viewpoint unnecessary. Smoothing using B-splines is therefore often done using a knot-selection heuristic that selects much fewer knots than \(n\), in particular if \(n\) is large. This is also what smooth.spline() does unless all.knots = TRUE. The heuristic for selecting the number of knots is a bit complicated, but it is implemented in the function .nknots.smspl(), which can be inspected for details. Once the number of knots gets above 200 it grows extremely slowly with \(n\). With the number of knots selected, a common heuristic for selecting their position is to use the quantiles of the distribution of the \(x\)-values. That is, with 9 knots, say, the knots are positioned in the deciles (0.1-quantile, 0.2-quantile etc.). This is effectively also what smooth.spline() does, and this heuristic places most of the knots where we have most of the data points.

Having implemented a knot-selection heuristic that results in \(p\) B-spline basis functions, the matrix \(\Phi\) will be \(n \times p\), typically with \(p < n\) and with \(\Phi\) of full rank \(p\). In this case we derive a way of computing the smoothing spline that is computationally more efficient and numerically more stable than relying on the matrix-algebraic solution above. This is particularly so when we need to compute the smoother for many different \(\lambda\)-s to optimize the smoother. As we will show, we are effectively computing a simultaneous diagonalization of the (symmetric) smoother matrix \(\mathbf{S}_{\lambda}\) for all values of \(\lambda\).

The matrix \(\Phi\) has a singular value decomposition \[\Phi = \mathbf{U} D \mathbf{V}^T\] where \(D\) is diagonal with entries \(d_1 \geq d_2 \geq \ldots \geq d_p > 0\), \(\mathbf{U}\) is \(n \times p\), \(\mathbf{V}\) is \(p \times p\) and both are orthogonal matrices. This means that \[\mathbf{U}^T \mathbf{U} = \mathbf{V}^T \mathbf{V} = \mathbf{V} \mathbf{V}^T = I\] is the \(p \times p\) dimensional identity matrix. We find that

\[\begin{align} \mathbf{S}_{\lambda} & = \mathbf{U}D\mathbf{V}^T(\mathbf{V}D^2\mathbf{V}^T + \lambda \mathbf{\Omega})^{-1} \mathbf{V}D\mathbf{U}^T \\ & = \mathbf{U}D (D^2 + \lambda \mathbf{V}^T \mathbf{\Omega} \mathbf{V})^{-1} D \mathbf{U}^T \\ & = \mathbf{U} (I + \lambda D^{-1} \mathbf{V}^T \mathbf{\Omega} \mathbf{V} D^{-1})^{-1} \mathbf{U}^T \\ & = \mathbf{U}(I + \lambda \widetilde{\mathbf{\Omega}})^{-1} \mathbf{U}^T, \end{align}\]

where \(\widetilde{\mathbf{\Omega}} = D^{-1} \mathbf{V}^T \mathbf{\Omega} \mathbf{V} D^{-1}\) is a positive semidefinite \(p \times p\) matrix. By diagonalization, \[\widetilde{\mathbf{\Omega}} = \mathbf{W} \Gamma \mathbf{W}^T,\] where \(\mathbf{W}\) is orthogonal and \(\Gamma\) is a diagonal matrix with nonnegative values in the diagonal, and we find that

\[\begin{align} \mathbf{S}_{\lambda} & = \mathbf{U} \mathbf{W} (I + \lambda \Gamma)^{-1} \mathbf{W}^T \mathbf{U}^T \\ & = \widetilde{\mathbf{U}} (I + \lambda \Gamma)^{-1} \widetilde{\mathbf{U}}^T \end{align}\]

where \(\widetilde{\mathbf{U}} = \mathbf{U} \mathbf{W}\) is an orthogonal \(n \times p\) matrix.

The interpretation of this representation is as follows.

  • First, the coefficients, \(\hat{\beta} = \widetilde{\mathbf{U}}^Ty\), are computed for expanding \(y\) in the basis given by the columns of \(\widetilde{\mathbf{U}}\).
  • Second, the \(i\)-th coefficient is shrunk towards 0, \[\hat{\beta}_i(\lambda) = \frac{\hat{\beta}_i}{1 + \lambda \gamma_i}.\]
  • Third, the smoothed values, \(\widetilde{\mathbf{U}} \hat{\beta}(\lambda)\), are computed as an expansion using the shrunken coefficients.

Thus the smoother works by shrinking the coefficients toward zero in the orthonormal basis given by the columns of \(\widetilde{\mathbf{U}}\). The coefficients corresponding to the largest eigenvalues \(\gamma_i\) are shrunk relatively more toward zero than those corresponding to the small eigenvalues. The basis formed by the columns of \(\widetilde{\mathbf{U}}\) is known as the Demmler-Reinsch basis with reference to Demmler and Reinsch (1975).

We implement the computation of the diagonalization for the Nuuk temperature data using \(p = 20\) basis functions (18 inner knots) equidistantly distributed over the range of the years for which we have data.

inner_knots <- seq(1867, 2013, length.out = 18)
Phi <- splineDesign(c(rep(range(inner_knots), 3), inner_knots), Nuuk_year$Year)
Omega <- pen_mat(inner_knots)
Phi_svd <- svd(Phi)
Omega_tilde <- t(crossprod(Phi_svd$v, Omega %*% Phi_svd$v)) / Phi_svd$d
Omega_tilde <- t(Omega_tilde) / Phi_svd$d
# It is safer to use the numerical singular value decomposition ('svd()')
# for diagonalizing a positive semidefinite matrix than to use a 
# more general numerical diagonalization implementation such as 'eigen()'. 
Omega_tilde_svd <- svd(Omega_tilde)  
U_tilde <- Phi_svd$u %*% Omega_tilde_svd$u
The eigenvalues $\gamma_i$ that determine how much the different basis coefficients in the orthonormal spline expansion are shrunk toward zero. Left plot shows the eigenvalues, and the right plot shows the eigenvalues on a log-scale.The eigenvalues $\gamma_i$ that determine how much the different basis coefficients in the orthonormal spline expansion are shrunk toward zero. Left plot shows the eigenvalues, and the right plot shows the eigenvalues on a log-scale.

Figure 3.17: The eigenvalues \(\gamma_i\) that determine how much the different basis coefficients in the orthonormal spline expansion are shrunk toward zero. Left plot shows the eigenvalues, and the right plot shows the eigenvalues on a log-scale.

The columns of $\widetilde{\mathbf{U}}$  that consitute an orthonormal basis known as the Demmler-Reinsch basis for computing the spline based smoother.

Figure 3.18: The columns of \(\widetilde{\mathbf{U}}\) that consitute an orthonormal basis known as the Demmler-Reinsch basis for computing the spline based smoother.

We observe from Figures 3.17 and 3.18 that there are two relatively large eigenvalues corresponding to the two basis functions with erratic behavior close to the boundaries, and there are two eigenvalues that are effectively zero corresponding to the two affine basis functions. In addition, the more oscillating the basis function is, the larger is the corresponding eigenvalue, and the more is the corresponding coefficient shrunk toward zero by the spline smoother.

Observe also that \[\mathrm{df}(\lambda) = \mathrm{trace}(\mathbf{S}_\lambda) = \sum_{i=1}^p \frac{1}{1 + \lambda \gamma_i},\] which makes it possible to implement GCV without even computing the diagonal entries of \(\mathbf{S}_{\lambda}.\)

3.4 The Kalman smoother

The Kalman smoother is a classical algorithm used for smoothing time-indexed observations. It can be seen as an example of Gaussian prediction algorithms, where we imagine that
\((X_1, Y_1), \ldots, (X_n,Y_n)\) follow a joint Gaussian distribution, and where we observe only \(Y_1, \ldots, Y_n\) but want to predict \(X_1, \ldots, X_n\). If the \(X\)-s form a relatively smooth but unobserved Gaussian process and the \(Y\)-s are noisy measurements of the \(X\)-s, we see how the predictions of the \(X\)-s from the \(Y\)-s can be seen as a smoother. In contrast to the bivariate smoothers developed in the sections above, the approach taken in this section is fully probabilistic. That is, the algorithms and the parameters are justified and entirely determined by a probabilistic model of the data. That said, we will see some clear similarities with other smoothers.

3.4.1 Gaussian processes

We suppose that we have a time grid \(t_1 < \ldots < t_n\) and observe \(Y_1, \ldots, Y_n\) where \(Y_i\) is observed at time \(t_i\). The distribution of the \(Y\)-s is given in terms of \(X = (X_1, \ldots, X_n)\), where we suppose \(X \sim \mathcal{N}(\xi_x, \Sigma_{x})\) with \[(\Sigma_{x})_{i,j} = \mathrm{cov}(X_i, X_j) = K(t_i - t_j)\] for a kernel function \(K : \mathbb{R} \to \mathbb{R}\).

With the observation equation \(Y_i = X_i + \delta_i\) for \(\delta = (\delta_1, \ldots, \delta_n) \sim \mathcal{N}(0, \Omega)\) and \(\delta \perp \! \! \perp X\) we get

\[(X, Y) \sim \mathcal{N}\left(\left(\begin{array}{c} \xi_x \\ \xi_x \end{array}\right), \left(\begin{array}{cc} \Sigma_x & \Sigma_x \\ \Sigma_x & \Sigma_x + \Omega \end{array} \right) \right).\]

Hence \[E(X \mid Y) = \xi_x + \Sigma_x (\Sigma_x + \Omega)^{-1} (Y - \xi_x).\]

Assuming that \(\xi_x = 0\) the conditional expectation is a linear smoother with smoother matrix \[S = \Sigma_x (\Sigma_x + \Omega)^{-1}.\]

This is also true if \(\Sigma_x (\Sigma_x + \Omega)^{-1} \xi_x = \xi_x\). If this identity holds approximately, we can argue that for computing \(E(X \mid Y)\) we don’t need to know \(\xi_x\).

If the observation variance is \(\Omega = \sigma^2 I\) then the smoother matrix simplifies to \[S = \Sigma_x (\Sigma_x + \sigma^2 I)^{-1} = (I + \sigma^2 \Sigma_x^{-1})^{-1}.\] Thus knowing the observation variance \(\sigma^2\) and the covariance matrix \(\Sigma_x\), which is determined entirely from the time-grid and the covariance kernel \(K\), we can compute the smoother matrix \(S\). Observing \(Y = y\), we can then compute the smoother as the solution to the equation \[(I + \sigma^2 \Sigma_x^{-1})x = y.\] Standard algorithms for matrix inversion, needed to compute \(S\), have time complexity \(O(n^3)\), and standard algorithms for solving the linear equation have time complexity \(O(n^2)\).

The main point of the Kalman smoother is that its have time complexity is \(O(n)\), and its computation is thus very fast and scales well with \(n\). The tradeoff is that the Kalman smoother puts some restrictions on \(K\) and thus \(\Sigma_x\). We consider the AR(1)-example in the section below, where we show that \(I + \sigma^2 \Sigma_x^{-1}\) is a tridiagonal matrix, and the smoother can be computed as a solution to ?? in time that is linear in \(n\).

3.4.2 AR(1) Kalman smoothing

The simplest example with an efficient Kalman smoother is the AR(1)-model, where we assume an equidistant time grid (that is, \(t_i = i\)). Suppose that \(|\alpha| < 1\), \(X_1 = \epsilon_1 / \sqrt{1 - \alpha^2}\) and \[X_i = \alpha X_{i-1} + \epsilon_i\] for \(i = 2, \ldots, n\) with \(\epsilon = (\epsilon_1, \ldots, \epsilon_n) \sim \mathcal{N}(0, \sigma^2 I)\).

We have \(\mathrm{cov}(X_i, X_j) = \alpha^{|i-j|} / (1 - \alpha^2)\), thus we can find \(\Sigma_x\) and compute \[E(X_i \mid Y) = ((I + \sigma^2 \Sigma_x^{-1})^{-1} Y)_i\]

Gaussian smoother matrix $S_{25,j}$ with $\alpha = 0.3, 0.9$, $\sigma^2 = 2, 20$

Figure 3.19: Gaussian smoother matrix \(S_{25,j}\) with \(\alpha = 0.3, 0.9\), \(\sigma^2 = 2, 20\)

From the identity \(\epsilon_i = X_i - \alpha X_{i-1}\) it follows that \(\epsilon = A X\) where

\[A = \left( \begin{array}{cccccc} \sqrt{1 - \alpha^2} & 0 & 0 & \ldots & 0 & 0 \\ -\alpha & 1 & 0 & \ldots & 0 & 0 \\ 0 & -\alpha & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & 1 & 0 \\ 0 & 0 & 0 & \ldots & -\alpha & 1 \\ \end{array}\right),\]

This gives \(I = V(\epsilon) = A \Sigma_x A^T\), hence \[\Sigma_x^{-1} = (A^{-1}(A^T)^{-1})^{-1} = A^T A.\]

We have shown that \[\Sigma_x^{-1} = \left( \begin{array}{cccccc} 1 & -\alpha & 0 & \ldots & 0 & 0 \\ -\alpha & 1 + \alpha^2 & -\alpha & \ldots & 0 & 0 \\ 0 & -\alpha & 1 + \alpha^2 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & 1 + \alpha^2 & -\alpha \\ 0 & 0 & 0 & \ldots & -\alpha & 1 \\ \end{array}\right).\]

Hence \[I + \sigma^2 \Sigma_x^{-1} = \left( \begin{array}{cccccc} \gamma_0 & \rho & 0 & \ldots & 0 & 0 \\ \rho & \gamma & \rho & \ldots & 0 & 0 \\ 0 & \rho & \gamma & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & \gamma & \rho \\ 0 & 0 & 0 & \ldots & \rho & \gamma_0 \\ \end{array}\right)\]

with \(\gamma_0 = 1 + \sigma^2\), \(\gamma = 1 + \sigma^2 (1 + \alpha^2)\) and \(\rho = -\sigma^2 \alpha\) is a tridiagonal matrix.

The equation

\[\left( \begin{array}{cccccc} \gamma_0 & \rho & 0 & \ldots & 0 & 0 \\ \rho & \gamma & \rho & \ldots & 0 & 0 \\ 0 & \rho & \gamma & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & \gamma & \rho \\ 0 & 0 & 0 & \ldots & \rho & \gamma_0 \\ \end{array}\right) \left( \begin{array}{c} x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_{n-1} \\ x_n \end{array}\right) = \left(\begin{array}{c} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_{n-1} \\ y_n \end{array}\right)\]

can be solved by a forward and backward sweep.

Forward sweep:

  • Set \(\rho_1' = \rho / \gamma_0\) and \(y_1' = y_1 / \gamma_0\),
  • then recursively \[\rho_i' = \frac{\rho}{\gamma - \rho \rho_{i-1}'} \quad \text{and} \quad y_i' = \frac{y_i - \rho y_{i-1}'}{\gamma - \rho \rho_{i-1}'}\] for \(i = 2, \ldots, n-1\)
  • and finally \[y_n' = \frac{y_n - \rho y_{n-1}'}{\gamma_0 - \rho \rho_{n-1}'}.\]

By the forward sweep the equation is transformed to

\[\left( \begin{array}{cccccc} 1 & \rho_1' & 0 & \ldots & 0 & 0 \\ 0 & 1 & \rho_2' & \ldots & 0 & 0 \\ 0 & 0 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & 1 & \rho_{n-1}' \\ 0 & 0 & 0 & \ldots & 0 & 1 \\ \end{array}\right) \left( \begin{array}{c} x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_{n-1} \\ x_n \end{array}\right) = \left(\begin{array}{c} y_1' \\ y_2' \\ y_3' \\ \vdots \\ y_{n-1}' \\ y_n' \end{array}\right),\]

which is then solved by backsubstitution from below; \(x_n = y_n'\) and \[x_{i} = y_i' - \rho_{i}' x_{i+1}, \quad i = n-1, \ldots, 1.\]

3.4.3 Implementation

#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector KalmanSmooth(NumericVector y, double alpha, double sigmasq) {
  double tmp, gamma0 = 1 + sigmasq, rho = - sigmasq * alpha;
  double gamma = 1 + sigmasq * (1 + alpha * alpha);
  int n = y.size();
  NumericVector x(n), rhop(n - 1);
  rhop[0] = rho / gamma0;
  x[0] = y[0] / gamma0;
  for(int i = 1; i < n - 1; ++i) { /* Forward sweep */
    tmp = (gamma - rho * rhop[i - 1]);
    rhop[i] = rho / tmp;
    x[i] = (y[i] - rho * x[i - 1]) / tmp;
  }
  x[n - 1] = (y[n - 1] - rho * x[n - 2]) / (gamma0 - rho * rhop[n - 2]);
  for(int i = n - 2; i >= 0; --i) { /* Backsubstitution */
    x[i] = x[i] - rhop[i] * x[i + 1];
  }
  return x;
}

Result, \(\alpha = 0.95\), \(\sigma^2 = 10\)

Comparing results

Sigma <- outer(1:n, 1:n, 
               function(i, j) alpha^(abs(i - j))) / (1 - alpha^2)  
Smooth <- Sigma %*% solve(Sigma + sigmasq * diag(n))
ggplot(mapping = aes(1:n, Smooth %*% Nuuk_year$Temperature - ySmooth)) + 
  geom_point() + 
  ylab("Difference")

Note that the forward sweep computes \(x_n = E(X_n \mid Y)\), and from this, the backsubstitution solves the smoothing problem of computing \(E(X \mid Y)\).

The Gaussian process used here (the AR(1)-process) is not very smooth and nor is the smoothing of the data. This is related to the kernel function \(K(s) = \alpha^{|s|}\) being non-differentiable in 0.

Many smoothers are equivalent to a Gaussian process smoother with an appropriate choice of kernel. Not all have a simple inverse covariance matrix and a Kalman filter algorithm.

3.4.4 The Kalman filter

#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector KalmanFilt(NumericVector y, double alpha, double sigmasq) {
  double tmp, gamma0 = 1 + sigmasq, rho = - sigmasq * alpha, yp;
  double gamma = 1 + sigmasq * (1 + alpha * alpha);
  int n = y.size();
  NumericVector x(n), rhop(n);
  rhop[0] = rho / gamma0;
  yp = y[0] / gamma0;
  x[0] = y[0] / (1 + sigmasq * (1 - alpha * alpha));
  for(int i = 1; i < n; ++i) { 
    tmp = (gamma - rho * rhop[i - 1]);
    rhop[i] = rho / tmp;
    /* Note differences when compared to smoother */
    x[i] = (y[i] - rho * yp) / (gamma0 - rho * rhop[i - 1]); 
    yp = (y[i] - rho * yp) / tmp;         
  }
  return x;
}

Result, \(\alpha = 0.95\), \(\sigma^2 = 10\)

3.5 Exercises

Nearest neighbors

Kernel estimators

Exercise 3.1 Consider a bivariate data set \((x_1, y_1), \ldots, (x_n, y_n)\) and let \(K\) be a probability density with mean 0. Then \[\hat{f}(x, y) = \frac{1}{n h^2} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right) K\left(\frac{y - y_i}{h}\right)\] is a bivariate kernel density estimator of the joint density of \(x\) and \(y\). Show that the kernel density estimator \[\hat{f}_1(x) = \frac{1}{n h} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)\] is also the marginal distribution of \(x\) under \(\hat{f}\), and that the Nadaraya-Watson kernel smoother is the conditional expectation of \(y\) given \(x\) under \(\hat{f}\).

Exercise 3.2 Suppose that \(K\) is a symmetric kernel and the \(x\)-s are equidistant. Implement a function that computes the smoother matrix using the toeplitz function and \(O(n)\) kernel evaluations where \(n\) is the number of data points. Implement also a function that computes the diagonal elements of the smoother matrix directly with run time \(O(n)\). Hint: find inspiration in the implementation of the running mean.