# 3 Bivariate smoothing

This chapter is on the estimation of a smooth relation between two variables \((X, Y)\). We will focus on direct estimation of (aspects of) the conditional distribution of \(Y\) given \(X\). Thus the variables are treated asymmetrically. When both variables are real valued, we get a good idea about their relation from a scatter plot of \(Y\) against \(X\), and the aim is basically to smooth the relation that the scatter plot shows.

For most of the methods considered, it does not matter if \(X\) represents a deterministic or a random variable. Thus \(X\) could be fixed by an experimental design, it could represent time or it could be an observed variable. In Figure 3.1 (left), \(Y\) is the annual average temperature in Nuuk, Greenland, and \(X\) is time in calendar years. In Figure 3.1 (right), \(Y\) is the monthly average temperature in Nuuk, while \(X\) is the monthly average temperature in Qaqortoq – a town further south of Nuuk in Greenland. In the former example, time is deterministic and we are interested in how the temperature in Nuuk changes over time. Is there, for instance, a trend? In the second example, we are interested in how the temperature in Qaqortoq is predictive of the temperature in Nuuk. We will use those two temperature examples throughout, see Vinther et al. (2006). The raw data is available from the site SW Greenland temperature data, but is also available from the CSwR package.

```
p_Nuuk <- ggplot(Nuuk_year, aes(x = Year, y = Temperature)) +
geom_point()
p_Nuuk +
geom_smooth(se = FALSE) + # A loess smoother
geom_smooth(
method = "lm",
formula = y ~ poly(x, 10), # A degree-10 polynomial expansion
color = "red",
se = FALSE
) +
geom_smooth(
method = "gam",
formula = y ~ s(x, bs = "cr"), # A spline smoother via 'mgcv::gam()'
color = "purple",
se = FALSE
)
p_Nuuk_Qaqortoq <- ggplot(Nuuk_Qaqortoq, aes(Temp_Qaqortoq, Temp_Nuuk)) +
geom_point()
p_Nuuk_Qaqortoq +
geom_smooth(
method = "lm",
formula = y ~ x, # A straight line smoother
color = "red",
se = FALSE
) +
geom_smooth(
method = "gam",
formula = y ~ s(x, bs = "cr"), # A spline smoother via 'mgcv::gam()'
color = "purple",
se = FALSE
)
```

## 3.1 Nearest neighbor and kernel smoothers

One of the most basic ideas on smoothing bivariate data is to use a running mean or moving average. This is particularly sensible when the \(x\)-values are equidistant, e.g. when the observations constitute a time series such as the Nuuk temperature data. The running mean is an example of the more general nearest neighbor smoothers.

Mathematically, the \(k\) nearest neighbor smoother in \(x_i\) is defined as \[\hat{f}_i = \frac{1}{k} \sum_{j \in N_i} y_j\] where \(N_i\) is the set of indices for the \(k\) nearest neighbors of \(x_i\). This simple idea is actually very general and powerful. It works as long as the \(x\)-values lie in a metric space, and by letting \(k\) grow with \(n\) it is possible to construct consistent nonparametric estimators of regression functions, \(f(x) = E(Y \mid X = x)\), under minimal assumptions. The practical problem is that \(k\) must grow slowly in high dimensions, and the estimator is not a panacea.

In this chapter we focus mostly on \(x\) being real valued with the
ordinary metric used to define the nearest neighbors. The total ordering of the
real line adds a couple of extra possibilities to the definition of \(N_i\).
When \(k\) is odd, the *symmetric* nearest neighbor smoother takes \(N_i\) to consist
of \(x_i\) together with the \((k-1)/2\) smaller \(x_j\)-s closest to
\(x_i\) and the \((k-1)/2\) larger \(x_j\)-s closest to \(x_i\). It is
also possible to choose a one-sided smoother with \(N_i\) corresponding
to the \(k\) smaller \(x_j\)-s closest to \(x_i\), in which case the smoother would
be known as a causal filter.

The symmetric definition of neighbors makes it very easy to handle the neighbors computationally; we don’t need to compute and keep track of the \(n^2\) pairwise distances between the \(x_i\)-s, we only need to sort data according to the \(x\)-values. Once data is sorted, \[N_i = \{i - (k - 1) / 2, i - (k - 1) / 2 + 1, \ldots, i - 1 , i, i + 1, \ldots, i + (k - 1) / 2\}\] for \((k - 1) / 2 \leq i \leq n - (k - 1) / 2\). The symmetric \(k\) nearest neighbor smoother is thus a running mean of the \(y\)-values when sorted according to the \(x\)-values. There are a couple of possibilities for handling the boundaries, one being simply to not define a value of \(\hat{f}_i\) outside of the interval above.

With \(\hat{\mathbf{f}}\) denoting the vector of smoothed values by a nearest neighbor smoother we can observe that it is always possible to write \(\hat{\mathbf{f}} = \mathbf{S}\mathbf{y}\) for a matrix \(\mathbf{S}\). For the symmetric nearest neighbor smoother and with data sorted according to the \(x\)-values, the matrix has the following band diagonal form

\[ \mathbf{S} = \left( \begin{array}{cccccccccc} \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & 0 & 0 & \ldots & 0 & 0 \\ 0 & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & 0 & \ldots & 0 & 0\\ 0 & 0 & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ldots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & \ldots & \frac{1}{5} & \frac{1}{5} \\ \end{array} \right) \]

here given for \(k = 5\) and with dimensions \((n - 4) \times n\) due to the undefined boundary values.

### 3.1.1 Linear smoothers

A smoother of the form \(\hat{\mathbf{f}} = \mathbf{S}\mathbf{y}\) for a *smoother matrix* \(\mathbf{S}\),
such as the nearest neighbor smoother, is known as a *linear smoother*.
The linear form is often beneficial for theoretical arguments, and many smoothers
considered in this chapter will be linear smoothers. For computing \(\mathbf{f}\)
there may, however, be many alternatives to forming the matrix \(\mathbf{S}\)
and computing the matrix-vector product. Indeed, this is often not the best
way to compute the smoothed values.

It is, on the other hand, useful to see how \(\mathbf{S}\) can be constructed for the symmetric nearest neighbor smoother.

```
## Warning in matrix(w, 147 - 10, 147, byrow = TRUE): data length [148] is not a sub-multiple or
## multiple of the number of rows [137]
```

The construction above relies on vector recycling of `w`

in the construction of `S`

and the fact that `w`

has length \(147 + 1\), which will effectively cause `w`

to be
translated by one to the right every time it is recycled for a new row. As seen,
the code triggers a warning by R, but in this case we get what we want.

`S`

We can use the matrix to smooth the annual average temperature in Nuuk using a
running mean with a window of \(k = 11\) years. That is, the smoothed temperature
at a given year is the average of the temperatures in the period
from five years before to five years after. Note that to add the smoothed
values to the previous plot we need to pad the values at the boundaries
with `NA`

s to get a vector of length 147.

```
# Check first if data is sorted correctly.
# The test is backwards, but confirms that data isn't unsorted :-)
is.unsorted(Nuuk_year$Year)
```

`## [1] FALSE`

```
f_hat <- c(rep(NA, 5), S %*% Nuuk_year$Temperature, rep(NA, 5))
p_Nuuk + geom_line(aes(y = f_hat), color = "blue")
```

### 3.1.2 Implementing the running mean

The running mean smoother fulfills the following identity \[\hat{f}_{i+1} = \hat{f}_{i} - y_{i - (k-1)/2} / k + y_{i + (k + 1)/2} / k,\] which can be used for a much more efficient implementation than the matrix-vector multiplication. It should be emphasized again that the identity above and the implementation below assume that data is sorted according to \(x\)-values.

```
# The vector 'y' must be sorted according to the x-values
run_mean <- function(y, k) {
n <- length(y)
m <- floor((k - 1) / 2)
k <- 2 * m + 1 # Ensures k to be odd and m = (k - 1) / 2
y <- y / k
s <- rep(NA, n)
s[m + 1] <- sum(y[1:k])
for(i in (m + 1):(n - m - 1))
s[i + 1] <- s[i] - y[i - m] + y[i + 1 + m]
s
}
```

The R function `filter()`

(from the stats package) can be used to compute running
means and general moving averages using any weight vector. We compare our two
implementations to `filter()`

.

```
f_hat_filter <- stats::filter(Nuuk_year$Temperature, rep(1/11, 11))
range(f_hat_filter - f_hat, na.rm = TRUE)
```

`## [1] -4.440892e-16 4.440892e-16`

`range(f_hat_filter - run_mean(Nuuk_year$Temperature, 11), na.rm = TRUE)`

`## [1] -1.332268e-15 4.440892e-16`

Note that `filter()`

uses the same boundary convention as used in `run_mean()`

.

A benchmark comparison between matrix-vector multiplication, `run_mean()`

and `filter()`

gives the following table with median run time in microseconds.

```
## expr median
## 1 S1 %*% y[1:512] 484.8890
## 2 S2 %*% y[1:1024] 1950.6210
## 3 S3 %*% y[1:2048] 8258.3855
## 4 S4 %*% y[1:4096] 33075.7655
## 5 run_mean(y[1:512], k = 11) 112.4760
## 6 run_mean(y[1:1024], k = 11) 197.7215
## 7 run_mean(y[1:2048], k = 11) 376.1560
## 8 run_mean(y[1:4096], k = 11) 734.0545
## 9 stats::filter(y[1:512], rep(1/11, 11)) 135.1490
## 10 stats::filter(y[1:1024], rep(1/11, 11)) 180.7525
## 11 stats::filter(y[1:2048], rep(1/11, 11)) 229.8480
## 12 stats::filter(y[1:4096], rep(1/11, 11)) 389.4675
```

The matrix-vector computation is clearly much slower than the two alternatives,
and the time to construct the \(\mathbf{S}\)-matrix has not even been included
in the benchmark above. There is also a difference in how the matrix-vector
multiplication scales with the size of data compared to the alternatives.
Whenever the data size doubles the run time approximately doubles for
both `filter()`

and `run_mean()`

, while it quadruples for the matrix-vector
multiplication. This shows the difference between an algorithm
that scales like \(O(n)\) and an algorithm that scales like \(O(n^2)\)
as the matrix-vector product does.

Despite that `filter()`

is implementing a more general
algorithm than `run_mean()`

, it is still faster, which reflects that
it is implemented in C and compiled.

### 3.1.3 Choose \(k\) by cross-validation

Cross-validation relies predictions of \(y_i\) from \(x_i\)
for data points \((x_i, y_i)\) left out of the data set when the predictor
is fitted to data. Many (linear) smoothers have a natural definition of an “out-of-sample”
prediction, that is, how \(\hat{f}(x)\) is computed for \(x\) not
in the data. If so, it becomes possible to define
\[\hat{f}^{-i}_i = \hat{f}^{-i}(x_i)\]
as the prediction at \(x_i\) using the smoother computed from data
with \((x_i, y_i)\) excluded. However, here we derectly *define*
\[\hat{f}^{-i}_i = \sum_{j \neq i} \frac{S_{ij}y_j}{1 - S_{ii}}\]
for any linear smoother. This definition concurs with the
“out-of-sample” predictor in \(x_i\) for most smoothers,
but this has to be verified case-by-case.

The running mean is a little special in this respect. In the
previous section, the
running mean was only considered for odd \(k\) and using
a symmetric neighbor definition. This is convenient
when considering the running mean *in the observations* \(x_i\).
When considering the running mean in any other point, a symmetric
neighbor definition works better with an even \(k\). This
is exactly what the definition of \(\hat{f}^{-i}_i\) above amounts to.
If \(\mathbf{S}\) is the running mean smoother matrix for an odd \(k\),
then \(\hat{f}^{-i}_i\) corresponds to symmetric \((k-1)\)-nearest
neighbor smoothing excluding \((x_i, y_i)\) from the data.

Using the definition above, we get that the *leave-one-out cross-validation*
squared error criterion becomes
\[\mathrm{LOOCV} = \sum_{i=1}^n (y_i - \hat{f}^{-i}_i)^2 =
\sum_{i=1}^n \left(\frac{y_i - \hat{f}_i}{1 - S_{ii}}\right)^2.\]
The important observation from the identity above is that LOOCV
can be computed without actually computing all the \(\hat{f}^{-i}_i\).

For the running mean, all diagonal elements of the smoother matrix are identical.
We disregard boundary values (with the `NA`

value), so to get
a comparable quantity across different choices of \(k\) we use `mean()`

instead of
`sum()`

in the implementation.

```
loocv <- function(k, y) {
f_hat <- run_mean(y, k)
mean(((y - f_hat) / (1 - 1/k))^2, na.rm = TRUE)
}
```

```
k <- seq(3, 40, 2)
CV <- sapply(k, loocv, y = Nuuk_year$Temperature)
k_opt <- k[which.min(CV)]
qplot(k, CV) + geom_line() + geom_vline(xintercept = k_opt, color = "red")
```

The optimal choice of \(k\) is 15, but the LOOCV criterion jumps quite a lot up and down with changing neighbor size, and \(k = 9\) as well as \(k = 25\) give rather low values as well.

```
p_Nuuk +
geom_line(aes(y = run_mean(Nuuk_year$Temperature, 9)), color = "red") +
geom_line(aes(y = run_mean(Nuuk_year$Temperature, k_opt)), color = "blue") +
geom_line(aes(y = run_mean(Nuuk_year$Temperature, 25)), color = "purple")
```

### 3.1.4 Nadaraya–Watson kernel smoothing

The section before introduced the basic idea of nearest neighbor smoothing using a fixed number of neighbors. A very similar idea is to use a fixed neighborhood, \(B_i\), say, around \(x_i\). This leads to the estimator \[\hat{f}_i = \frac{\sum_{j=1}^n y_j 1_{B_i}(x_j)}{\sum_{j=1}^n 1_{B_i}(x_j)},\] which is simply the average of the \(y\)-s for which the corresponding \(x\)-s fall in the neighborhood \(B_i\) of \(x_i\). Note that contrary to the nearest neighbor estimator, the denominator now also depends on the \(x\)-s.

In a metric space the natural
choice of \(B_i\) is the ball, \(B(x_i, h)\), around \(x_i\) with some radius \(h\).
In \(\mathbb{R}\) with the usual metric (or even \(\mathbb{R}^p\) equipped with any
norm-induced metric) we have that

\[1_{B(x_i, h)}(x) = 1_{B(0, 1)}\left(\frac{x - x_i}{h}\right),\]
thus since \(B(0, 1) = [-1, 1]\) in \(\mathbb{R}\)
\[\hat{f}_i = \frac{\sum_{j=1}^n y_j 1_{[-1, 1]}\left(\frac{x_j - x_i}{h}\right)}{\sum_{l=1}^n 1_{[-1, 1]}\left(\frac{x_l - x_i}{h}\right)}.\]
This nonparametric estimator of the conditional expectation \(E(Y \mid X = x_i)\)
is closely related to the kernel density estimator with the
rectangular kernel, see Section 2.2, and just as for this
estimator there is a natural generalization allowing for arbitrary kernels
instead of the indicator function \(1_{[-1, 1]}\).

With \(K : \mathbb{R} \mapsto \mathbb{R}\) a fixed kernel the corresponding kernel smoother with bandwidth \(h\) becomes \[\hat{f}_i = \frac{\sum_{j=1}^n y_j K\left(\frac{x_j - x_i}{h}\right)}{\sum_{l=1}^n K\left(\frac{x_l - x_i}{h}\right)}.\] This smoother is known as the Nadaraya–Watson kernel smoother or Nadaraya–Watson estimator. See Exercise 3.1 for an additional perspective based on bivariate kernel density estimation.

The Nadaraya–Watson kernel smoother can be implemented in much the same way
as the kernel density estimator, and the run time will inevitably scale
like \(O(n^2)\) unless we exploit special properties of the kernel or use
approximation techniques such as binning. In this section we will focus on
implementing bandwidth selection rather than the kernel smoother itself.
The basic kernel smoothing computation is also implemented in the
`ksmooth`

function from the stats package.

```
f_hat <- ksmooth(
Nuuk_year$Year,
Nuuk_year$Temperature,
kernel = "normal",
bandwidth = 10,
x.points = Nuuk_year$Year
)
p_Nuuk + geom_line(aes(y = f_hat$y), color = "blue")
```

The kernel smoother is clearly a linear smoother, and we will implement its
computation and the bandwidth selection using LOOCV by direct computation
of the the smoother matrix \(\mathbf{S}\), which is given by
\[S_{ij} = \frac{K\left(\frac{x_j - x_i}{h}\right)}{\sum_{l=1}^n K\left(\frac{x_l - x_i}{h}\right)}.\]
The rows sum to 1 by construction, and the diagonal elements are
\[S_{ii} = \frac{K(0)}{\sum_{l=1}^n K\left(\frac{x_l - x_i}{h}\right)}.\]
The computation of \(\mathbf{S}\) for the Gaussian kernel is implemented
using `outer`

and `rowSums`

.

```
kern <- function(x) exp(- x^2 / 2) # The Gaussian kernel
Kij <- outer(Nuuk_year$Year, Nuuk_year$Year, function(x, y) kern((x - y) / 10))
S <- Kij / rowSums(Kij)
f_hat <- S %*% Nuuk_year$Temperature
p_Nuuk + geom_line(aes(y = f_hat), color = "blue")
```

The implementation above will work with any kernel and any sequence of \(x\)-s. In this example, the kernel is symmetric and the \(x\)-s are equidistant. Exercise 3.2 explores how to exploit this in the computation of the smoother matrix as well as the diagonal elements of the smoother matrix.

The smoother computed using `ksmooth`

with bandwidth 10,
as shown in Figure 3.6, is different from the smoother
computed directly from the smoother matrix, see Figure 3.7.
Though both computations use the Gaussian kernel and allegedly a bandwidth of 10,
the resulting smoothers differ because `ksmooth`

internally rescales the
bandwidth. The rescaling amounts to multiplying \(h\) by the factor 0.3706506,
which will make the kernel have quartiles in \(\pm 0.25 h\) (see also `?ksmooth`

).

The LOOCV computation is implemented as a function that computes the smoother matrix and the corresponding LOOCV value as a function of the bandwidth. For comparison with previous implementations the mean is computed instead of the sum.

```
loocv <- function(h) {
Kij <- outer(Nuuk_year$Year, Nuuk_year$Year, function(x, y) kern((x - y) / h))
S <- Kij / rowSums(Kij)
mean(((Nuuk_year$Temperature - S %*% Nuuk_year$Temperature) / (1 - diag(S)))^2)
}
```

```
h <- seq(1, 5, 0.05)
CV <- sapply(h, loocv)
h_opt <- h[which.min(CV)]
qplot(h, CV) + geom_line() + geom_vline(xintercept = h_opt, color = "red")
```

The optimal bandwith is 1.55. We compute the resulting optimal smoother
and compare it to the smoother computed using `ksmooth`

.

```
Kij <- outer(Nuuk_year$Year, Nuuk_year$Year, function(x, y) kern((x - y) / h_opt))
S <- Kij / rowSums(Kij)
f_hat <- S %*% Nuuk_year$Temperature
f_hat_ksmooth <- ksmooth(
Nuuk_year$Year, Nuuk_year$Temperature,
kernel = "normal",
bandwidth = h_opt / 0.3706506, # Rescaling!
x.points = Nuuk_year$Year
)
range(f_hat - f_hat_ksmooth$y)
```

`## [1] -4.535467e-05 4.598776e-05`

The differences are of the order \(10^{-5}\), which is small but larger than
can be explained by rounding errors alone. In fact, `ksmooth`

truncates the
tails of the Gaussian kernel to 0 beyond \(4h\). This is where the kernel becomes
less than \(0.000335 \times K(0) = 0.000134\), which for practical purposes
effectively equals zero. The truncation explains the relatively
large differences between the two results that should otherwise be equivalent. It is an
example where the approximate solution computed by `ksmooth`

is acceptable because
it substantially reduces the run time.

```
p_Nuuk +
geom_line(aes(y = run_mean(Nuuk_year$Temperature, 9)), color = "red") +
geom_line(aes(y = f_hat), color = "blue")
```

Figure 3.9 shows the optimal kernel smoother, which is actually somewhat wiggly. It is locally more smooth than the \(k\)-nearest neighbor smoother but overall comparable to this smoother with \(k = 9\).

## 3.2 Orthogonal basis expansions

### 3.2.1 Polynomial expansions

Degree 19 polynomial fitted to the temperature data.

We can extract the model matrix from the lm-object.

```
intercept <- rep(1/sqrt(n), n) # To make intercept column have norm one
polylm <- lm(Temperature ~ intercept + poly(Year, 19) - 1, data = Nuuk_year)
Phi <- model.matrix(polylm)
```

The model matrix is (almost) orthogonal, and estimation becomes quite simple. With an orthogonal model matrix the normal equation reduces to the estimate \[\hat{\beta} = \Phi^T Y\] since \(\Phi^T \Phi = I\). The predicted (or fitted) values are \(\Phi \Phi^T Y\) with smoother matrix \(\mathbf{S} = \Phi \Phi^T\) being a projection.

`(t(Phi) %*% Nuuk_year$Temperature)[1:10, 1]`

```
## intercept poly(Year, 19)1 poly(Year, 19)2 poly(Year, 19)3 poly(Year, 19)4
## -17.2469646 4.9002430 -1.7968913 0.8175400 5.9668689
## poly(Year, 19)5 poly(Year, 19)6 poly(Year, 19)7 poly(Year, 19)8 poly(Year, 19)9
## 1.4265091 -1.9258864 -0.2523581 -2.1355117 -0.8046267
```

`coef(polylm)[1:10]`

```
## intercept poly(Year, 19)1 poly(Year, 19)2 poly(Year, 19)3 poly(Year, 19)4
## -17.2469646 4.9002430 -1.7968913 0.8175400 5.9668689
## poly(Year, 19)5 poly(Year, 19)6 poly(Year, 19)7 poly(Year, 19)8 poly(Year, 19)9
## 1.4265091 -1.9258864 -0.2523581 -2.1355117 -0.8046267
```

With homogeneous variance \[\hat{\beta}_i \overset{\text{approx}}{\sim} \mathcal{N}(\beta_i, \sigma^2),\] and for \(\beta_i = 0\) we have \(P(|\hat{\beta}_i| \geq 1.96\sigma) \simeq 0.05.\)

Thresholding:

### 3.2.2 Fourier expansions

Introducing \[x_{k,m} = \frac{1}{\sqrt{n}} e^{2 \pi i k m / n},\] then

\[\sum_{k=0}^{n-1} |x_{k,m}|^2 = 1\]

and for \(m_1 \neq m_2\) \[\sum_{k=0}^{n-1} x_{k,m_1}\overline{x_{k,m_2}} = 0\]

Thus \(\Phi = (x_{k,m})_{k,m}\) is an \(n \times n\) unitary matrix; \[\Phi^*\Phi = I\] where \(\Phi^*\) is the conjugate transposed of \(\Phi\).

\(\hat{\beta} = \Phi^* y\) is the *discrete Fourier transform* of \(y\).
It is the basis coefficients in the orthonormal basis given by \(\Phi\);
\[y_k = \frac{1}{\sqrt{n}} \sum_{m=0}^{n-1} \hat{\beta}_m e^{2 \pi i k m / n}\]

or \(y = \Phi \hat{\beta}.\)

The matrix \(\Phi\) generates an interesting pattern.

Columns in the matrix \(\Phi\):

We can estimate by matrix multiplication

```
betahat <- Conj(t(Phi)) %*% Nuuk_year$Temperature # t(Phi) = Phi for Fourier bases
betahat[c(1, 2:4, 73, n:(n - 2))]
```

```
## [1] -17.2469646+0.0000000i -2.4642887+2.3871189i 3.5481329+0.9099226i
## [4] 1.6721444+0.7413580i 0.0321232+0.7089991i -2.4642887-2.3871189i
## [7] 3.5481329-0.9099226i 1.6721444-0.7413580i
```

For real \(y\) it holds that \(\hat{\beta}_0\) is real, and the symmetry \[\hat{\beta}_{n-m} = \hat{\beta}_m^*\] holds for \(m = 1, \ldots, n - 1\). (For \(n\) even, \(\hat{\beta}_{n/2}\) is real too).

Modulus distribution:

Note that for \(m \neq 0, n/2\), \(\beta_m = 0\) and \(y \sim \mathcal{N}(\Phi\beta, \sigma^2 I_n)\) then

\[(\mathrm{Re}(\hat{\beta}_m), \mathrm{Im}(\hat{\beta}_m))^T \sim \mathcal{N}\left(0, \frac{\sigma^2}{2} I_2\right),\]

hence \[|\hat{\beta}_m|^2 = \mathrm{Re}(\hat{\beta}_m)^2 + \mathrm{Im}(\hat{\beta}_m)^2 \sim \frac{\sigma^2}{2} \chi^2_2,\] that is, \(P(|\hat{\beta}_m| \geq 1.73 \sigma) = 0.05.\) There is a clear case of multiple testing if we use this threshold at face value, and we would expect around \(0.05 \times n/2\) false positive if there is no signal at all. Lowering the probability using the Bonferroni correction yields a threshold of around \(2.7 \sigma\) instead.

Thresholding Fourier:

The coefficients are not independent (remember the symmetry), and one can alternatively consider \[\hat{\gamma}_m = \sqrt{2} \mathrm{Re}(\hat{\beta}_m) \quad \text{and} \quad \hat{\gamma}_{n' + m} = - \sqrt{2} \mathrm{Im}(\hat{\beta}_m)\] for \(1 \leq m < n / 2\). Here \(n' = \lfloor n / 2 \rfloor\). Here, \(\hat{\gamma}_0 = \hat{\beta}_0\), and \(\hat{\gamma}_{n/2} = \hat{\beta}_{n/2}\) for \(n\) even.

These coefficients are the coefficients in a real cosine, \(\sqrt{2} \cos(2\pi k m / n)\), and sine, \(\sqrt{2} \sin(2\pi k m / n)\), basis expansion, and they are i.i.d. \(\mathcal{N}(0, \sigma^2)\) distributed.

Thresholding Fourier:

What is the point using the discrete Fourier transform?
The point is that the discrete Fourier transform can be computed via the
*fast Fourier transform* (FFT), which has an \(O(n\log(n))\) time complexity.
The FFT works optimally for \(n = 2^p\).

```
## [1] -17.246965+0.000000i -2.464289+2.387119i 3.548133+0.909923i
## [4] 1.672144+0.741358i
```

`betahat[1:4]`

```
## [1] -17.246965+0.000000i -2.464289+2.387119i 3.548133+0.909923i
## [4] 1.672144+0.741358i
```

## 3.3 Splines

In the previous section orthogonality of basis functions played an important role for computing basis function expansions efficiently as well as for the statistical assessment of estimated coefficients. This section will deal with bivariate smoothing via basis functions that are not necessarily orthogonal.

Though some of the material of this section will apply to any choice of basis, we restrict attention to splines and consider almost exclusively the widely used B-splines (the “B” is for basis).

### 3.3.1 Smoothing splines

To motivate splines we briefly consider the following penalized least squares criterion for finding a smooth approximation to bivariate data: minimize \[\begin{equation} L(f) = \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \|f''\|_2^2 \tag{3.1} \end{equation}\] over all twice differentiable functions \(f\). The first term is the standard squared error, and we can easily find a smooth function interpolating the \(y\)-values (if all the \(x\)-values are different), which will thus drive the squared error to 0. The squared 2-norm regularizes the minimization problem so that the minimizer finds a balance between interpolation and having a small second derivative (note that \(\|f''\|_2 = 0\) if and only if \(f\) is an affine function). The tuning parameter \(\lambda\) controls this balance.

It is possible to show that the minimizer of (3.1) is a
natural cubic spline with
*knots* in the data points \(x_i\). That is, the spline is a \(C^2\)-function
that equals a third degree polynomial in between the knots. At the knots,
the two polynomials that meet fit together up to the second derivative, but they
may differ on the third derivative. That the solution is *natural* means
that it has zero second and third derivative at and beyond the two boundary knots.

It is not particularly difficult to show that the space of natural cubic splines is a vector space of dimension \(n\) if all the \(x\)-values are different. It is therefore possible to find a basis of splines, \(\varphi_1, \ldots, \varphi_n\), such that the \(f\) that minimizes (3.1) is of the form \[f = \sum_{i=1}^n \beta_i \varphi_i.\] What is remarkable about this is that the basis (and the finite dimensional vector space it spans) doesn’t depend upon the \(y\)-values. Though the optimization is over an infinite dimensional space, the penalization ensures that the minimizer is always in the same finite dimensional space nomatter what \(y_1, \ldots, y_n\) are. Moreover, since (3.1) is a quite natural criterion to minimize to find a smooth function fitting the bivariate data, splines appear as good candidates for producing such smooth fits. On top of that, splines have several computational advantages and are widely used.

If we let \(\hat{f}_i = \hat{f}(x_i)\) with \(\hat{f}\) the minimizer of (3.1), we have in vector notation that \[\hat{\mathbf{f}} = \boldsymbol{\Phi}\hat{\beta}\] with \(\boldsymbol{\Phi}_{ij} = \varphi_j(x_i)\). The minimizer can be found by observing that

\[\begin{align} L(\mathbf{f}) & = (\mathbf{y} - \mathbf{f})^T (\mathbf{y} - \mathbf{f}) + \lambda \| f'' \|_2^2 \\ & = ( \mathbf{y} - \boldsymbol{\Phi}\beta)^T (\mathbf{y} - \boldsymbol{\Phi}\beta) + \lambda \beta^T \mathbf{\Omega} \beta \end{align}\]

where
\[\mathbf{\Omega}_{ij} = \langle \varphi_i'', \varphi_j'' \rangle =
\int \varphi_i''(z) \varphi_j''(z) \mathrm{d}z.\]
The matrix \(\mathbf{\Omega}\) is positive semidefinite by construction, and
we refer to it as the *penalty matrix*. It induces a seminorm on \(\mathbb{R}^n\)
so that we can express the seminorm, \(\|f''\|_2\), of \(f\) in terms of the
parameters in the basis expansion using \(\varphi_i\).

This is a standard penalized least squares problem, whose solution is \[\hat{\beta} = (\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \lambda \mathbf{\Omega})^{-1}\boldsymbol{\Phi}^T \mathbf{y}\] and with resulting smoother \[\hat{\mathbf{f}} = \underbrace{\boldsymbol{\Phi} ((\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \lambda \mathbf{\Omega})^{-1}\boldsymbol{\Phi}^T}_{\mathbf{S}_{\lambda}} \mathbf{y}.\] This linear smoother with smoothing matrix \(\mathbf{S}_{\lambda}\) based on natural cubic splines gives what is known as a smoothing spline that minimizes (3.1). We will pursue spline based smoothing by minimizing (3.1) but using various B-spline bases that may have more or less than \(n\) elements. For the linear algebra, it actually doesn’t matter if we use a spline basis or any other basis – as long as \(\boldsymbol{\Phi}_{ij} = \varphi_j(x_i)\) and \(\mathbf{\Omega}\) is given in terms of \(\varphi''_i\) as above.

### 3.3.2 Splines in R

The splines package in R implements some of the basic functions needed to work
with splines. In particular, the `splineDesign()`

function that computes evaluations
of B-splines and their derivatives.

`library(splines)`

```
# Note the specification of repeated boundary knots
knots <- c(0, 0, 0, seq(0, 1, 0.2), 1, 1, 1)
xx <- seq(0, 1, 0.005)
B_splines <- splineDesign(knots, xx)
matplot(xx, B_splines, type = "l", lty = 1)
```

The basis shown in Figure 3.13 is an example of a cubic B-spline
basis with the 11 inner knots \(0, 0.1, \ldots, 0.9, 1\). The repeated
boundary knots control how the spline basis behaves close to the boundaries
of the interval. This basis has 13 basis functions, not 11, and spans a
larger space than the space of *natural* cubic splines. It is
possible to compute a basis based on B-splines for the natural cubic splines
using the function `ns`

, but for all practical purposes this is not
important, and we will work exclusively with the B-spline basis itself.

The computation of the penalty matrix \(\mathbf{\Omega}\) constitutes a
practical problem, but observing that \(\varphi''_i\) is an affine
function in between knots leads to a simple way of
computing \(\mathbf{\Omega}_{ij}\). Letting \(g_{ij} = \varphi''_i \varphi''_j\)
it holds that \(g_{ij}\) is quadratic between two consecutive knots \(a\) and \(b\),
in which case
\[\int_a^b g_{ij}(z) \mathrm{d}z = \frac{b - a}{6}\left(g_{ij}(a) + 4 g_{ij}\left(\frac{a + b}{2}\right) + g_{ij}(b)\right).\]
This identity is behind Simpson’s rule
for numerical integration, and the fact that this is an identity for quadratic
polynomials, and not an approximation, means that Simpson’s rule
applied appropriately leads to exact computation of
\(\mathbf{\Omega}_{ij}\). All we need is the ability to evaluate
\(\varphi''_i\) at certain points, and `splineDesign()`

can be used for that.

```
pen_mat <- function(inner_knots) {
knots <- sort(c(rep(range(inner_knots), 3), inner_knots))
d <- diff(inner_knots) # The vector of knot differences; b - a
g_ab <- splineDesign(knots, inner_knots, derivs = 2)
knots_mid <- inner_knots[-length(inner_knots)] + d / 2
g_ab_mid <- splineDesign(knots, knots_mid, derivs = 2)
g_a <- g_ab[-nrow(g_ab), ]
g_b <- g_ab[-1, ]
(crossprod(d * g_a, g_a) +
4 * crossprod(d * g_ab_mid, g_ab_mid) +
crossprod(d * g_b, g_b)) / 6
}
```

It is laborious to write good tests of `pen_mat()`

. We would have to work out a
set of example matrices by other means, e.g. by hand. Alternatively, we can
compare to a simpler numerical integration technique using Riemann sums.

```
spline_deriv <- splineDesign(
c(0, 0, 0, 0, 0.5, 1, 1, 1, 1),
seq(0, 1, 1e-5),
derivs = 2
)
Omega_numeric <- crossprod(spline_deriv[-1, ]) * 1e-5 # Right Riemann sums
Omega <- pen_mat(c(0, 0.5, 1))
Omega_numeric / Omega
```

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9999700 0.9999673 0.999940 1.000000 NaN
## [2,] 0.9999673 0.9999663 0.999955 1.000000 1.000000
## [3,] 0.9999400 0.9999550 1.000000 1.000045 1.000060
## [4,] 1.0000000 1.0000000 1.000045 1.000034 1.000033
## [5,] NaN 1.0000000 1.000060 1.000033 1.000030
```

`range((Omega_numeric - Omega) / (Omega + 0.001)) # Relative error`

`## [1] -5.99967e-05 5.99983e-05`

And we should also test an example with non-equidistant knots.

```
spline_deriv <- splineDesign(
c(0, 0, 0, 0, 0.2, 0.3, 0.5, 0.6, 0.65, 0.7, 1, 1, 1, 1),
seq(0, 1, 1e-5),
derivs = 2
)
Omega_numeric <- crossprod(spline_deriv[-1, ]) * 1e-5 # Right Riemann sums
Omega <- pen_mat(c(0, 0.2, 0.3, 0.5, 0.6, 0.65, 0.7, 1))
range((Omega_numeric - Omega) / (Omega + 0.001)) # Relative error
```

`## [1] -0.0001607084 0.0002545494`

These examples indicate that `pen_mat()`

computes \(\mathbf{\Omega}\) correctly,
in particular as increasing the Riemann sum precision by
lowering the number \(10^{-5}\) will decrease the relative error (not shown).
Of course, correctness ultimately depends on `splineDesign()`

computing
the correct second derivatives, which hasn’t been tested here.

We can also test how our implementation of smoothing splines works on data. We do this here by implementing the matrix-algebra directly for computing \(\mathbf{S}_{\lambda} \mathbf{y}\).

```
inner_knots <- Nuuk_year$Year
Phi <- splineDesign(c(rep(range(inner_knots), 3), inner_knots), inner_knots)
Omega <- pen_mat(inner_knots)
smoother <- function(lambda)
Phi %*% solve(
crossprod(Phi) + lambda * Omega,
t(Phi) %*% Nuuk_year$Temperature
)
p_Nuuk +
geom_line(aes(y = smoother(10)), color = "blue") + # Undersmooth
geom_line(aes(y = smoother(1000)), color = "red") + # Smooth
geom_line(aes(y = smoother(100000)), color = "purple") # Oversmooth
```

Smoothing splines can be computed using the R function `smooth.spline()`

from the stats package. It is possible to manually specify the amount of
smoothing using one of the arguments `lambda`

, `spar`

or `df`

(the latter being the trace of the smoother matrix). However,
due to internal differences from the `splineDesign()`

basis
above, the `lambda`

argument to `smooth.spline()`

does not match the
\(\lambda\) parameter above.

If the amount of smoothing is not manually set, `smooth.spline()`

chooses
\(\lambda\) by *generalized cross validation* (GCV), which minimizes
\[\mathrm{GCV} = \sum_{i=1}^n \left(\frac{y_i - \hat{f}_i}{1 - \mathrm{df} / n}\right)^2,\]
where
\[\mathrm{df} = \mathrm{trace}(\mathbf{S}) = \sum_{i=1}^n S_{ii}.\]
GCV corresponds to LOOCV with the diagonal entries, \(S_{ii}\),
replaced by their average \(\mathrm{df} / n\). The main reason for
using GCV over LOOCV is that for some smoothers, such as the
spline smoother, it is possible to compute the trace \(\mathrm{df}\) easily
without computing \(\mathbf{S}\) or even its diagonal elements.

To compare our results to `smooth.spline()`

we optimize the GCV criterion.
First we implement a function that computes GCV for a fixed value of \(\lambda\).
Here the implementation is relying on computing the smoother matrix, but
this is not the most efficient implementation. Section 3.3.3
provides a diagonalization of the smoother matrix jointly in the tuning
parameters. This representation allows for efficient computation
with splines, and it will become clear why it is not necessary to compute
\(\mathbf{S}\) or even its diagonal elements. The trace is nevertheless
easily computable \(\mathbf{S}\).

```
gcv <- function(lambda, y) {
S <- Phi %*% solve(crossprod(Phi) + lambda * Omega, t(Phi))
df <- sum(diag(S)) # The trace of the smoother matrix
sum(((y - S %*% y) / (1 - df / length(y)))^2, na.rm = TRUE)
}
```

Then we apply this function to a grid of \(\lambda\)-values and choose the value of \(\lambda\) that minimizes GCV.

```
lambda <- seq(50, 250, 2)
GCV <- sapply(lambda, gcv, y = Nuuk_year$Temperature)
lambda_opt <- lambda[which.min(GCV)]
qplot(lambda, GCV) + geom_vline(xintercept = lambda_opt, color = "red")
```

Finally, we can visualize the resulting smoothing spline.

```
smooth_opt <- Phi %*% solve(
crossprod(Phi) + lambda_opt * Omega,
t(Phi) %*% Nuuk_year$Temperature
)
p_Nuuk + geom_line(aes(y = smooth_opt), color = "blue")
```

The smoothing spline that we found by minimizing GCV can be compared to
the smoothing spline that `smooth.spline()`

computes by minimizing GCV
as well.

```
smooth_splines <- smooth.spline(
Nuuk_year$Year,
Nuuk_year$Temperature,
all.knots = TRUE # Don't use heuristic
)
range(smooth_splines$y - smooth_opt)
```

`## [1] -0.000775662 0.001072587`

The differences between the smoothing spline computed by our implementation
and by `smooth.spline()`

is hardly detectable visually, and they are at most
of the order \(10^{-3}\) as computed above. It is possible to further
decrease the differences by finding the optimal value of \(\lambda\) with a
higher precision, but we will not pursue this here.

### 3.3.3 Efficient computation with splines

Using the full B-spline basis with knots in every observation is computationally
heavy and from a practical viewpoint unnecessary. Smoothing using B-splines
is therefore often done using a knot-selection heuristic that selects much fewer
knots than \(n\), in particular if \(n\) is large. This is also what `smooth.spline()`

does unless `all.knots = TRUE`

. The heuristic for selecting
the number of knots is a bit complicated, but it is implemented in the function `.nknots.smspl()`

,
which can be inspected for details. Once the number of knots gets above 200 it
grows extremely slowly with \(n\). With the number of knots selected,
a common heuristic for selecting their position is to use the quantiles of
the distribution of the \(x\)-values. That is, with 9 knots, say, the knots
are positioned in the deciles (0.1-quantile, 0.2-quantile etc.). This is
effectively also what `smooth.spline()`

does, and this heuristic places most of
the knots where we have most of the data points.

Having implemented a knot-selection heuristic that results in \(p\) B-spline basis functions, the matrix \(\Phi\) will be \(n \times p\), typically with \(p < n\) and with \(\Phi\) of full rank \(p\). In this case we derive a way of computing the smoothing spline that is computationally more efficient and numerically more stable than relying on the matrix-algebraic solution above. This is particularly so when we need to compute the smoother for many different \(\lambda\)-s to optimize the smoother. As we will show, we are effectively computing a simultaneous diagonalization of the (symmetric) smoother matrix \(\mathbf{S}_{\lambda}\) for all values of \(\lambda\).

The matrix \(\Phi\) has a singular value decomposition \[\Phi = \mathbf{U} D \mathbf{V}^T\] where \(D\) is diagonal with entries \(d_1 \geq d_2 \geq \ldots \geq d_p > 0\), \(\mathbf{U}\) is \(n \times p\), \(\mathbf{V}\) is \(p \times p\) and both are orthogonal matrices. This means that \[\mathbf{U}^T \mathbf{U} = \mathbf{V}^T \mathbf{V} = \mathbf{V} \mathbf{V}^T = I\] is the \(p \times p\) dimensional identity matrix. We find that

\[\begin{align} \mathbf{S}_{\lambda} & = \mathbf{U}D\mathbf{V}^T(\mathbf{V}D^2\mathbf{V}^T + \lambda \mathbf{\Omega})^{-1} \mathbf{V}D\mathbf{U}^T \\ & = \mathbf{U}D (D^2 + \lambda \mathbf{V}^T \mathbf{\Omega} \mathbf{V})^{-1} D \mathbf{U}^T \\ & = \mathbf{U} (I + \lambda D^{-1} \mathbf{V}^T \mathbf{\Omega} \mathbf{V} D^{-1})^{-1} \mathbf{U}^T \\ & = \mathbf{U}(I + \lambda \widetilde{\mathbf{\Omega}})^{-1} \mathbf{U}^T, \end{align}\]

where \(\widetilde{\mathbf{\Omega}} = D^{-1} \mathbf{V}^T \mathbf{\Omega} \mathbf{V} D^{-1}\) is a positive semidefinite \(p \times p\) matrix. By diagonalization, \[\widetilde{\mathbf{\Omega}} = \mathbf{W} \Gamma \mathbf{W}^T,\] where \(\mathbf{W}\) is orthogonal and \(\Gamma\) is a diagonal matrix with nonnegative values in the diagonal, and we find that

\[\begin{align} \mathbf{S}_{\lambda} & = \mathbf{U} \mathbf{W} (I + \lambda \Gamma)^{-1} \mathbf{W}^T \mathbf{U}^T \\ & = \widetilde{\mathbf{U}} (I + \lambda \Gamma)^{-1} \widetilde{\mathbf{U}}^T \end{align}\]

where \(\widetilde{\mathbf{U}} = \mathbf{U} \mathbf{W}\) is an orthogonal \(n \times p\) matrix.

The interpretation of this representation is as follows.

- First, the coefficients, \(\hat{\beta} = \widetilde{\mathbf{U}}^Ty\), are computed for expanding \(y\) in the basis given by the columns of \(\widetilde{\mathbf{U}}\).
- Second, the \(i\)-th coefficient is shrunk towards 0, \[\hat{\beta}_i(\lambda) = \frac{\hat{\beta}_i}{1 + \lambda \gamma_i}.\]
- Third, the smoothed values, \(\widetilde{\mathbf{U}} \hat{\beta}(\lambda)\), are computed as an expansion using the shrunken coefficients.

Thus the smoother works by shrinking the coefficients toward zero in the orthonormal basis given by the columns of \(\widetilde{\mathbf{U}}\). The coefficients corresponding to the largest eigenvalues \(\gamma_i\) are shrunk relatively more toward zero than those corresponding to the small eigenvalues. The basis formed by the columns of \(\widetilde{\mathbf{U}}\) is known as the Demmler-Reinsch basis with reference to Demmler and Reinsch (1975).

We implement the computation of the diagonalization for the Nuuk temperature data using \(p = 20\) basis functions (18 inner knots) equidistantly distributed over the range of the years for which we have data.

```
inner_knots <- seq(1867, 2013, length.out = 18)
Phi <- splineDesign(c(rep(range(inner_knots), 3), inner_knots), Nuuk_year$Year)
Omega <- pen_mat(inner_knots)
Phi_svd <- svd(Phi)
Omega_tilde <- t(crossprod(Phi_svd$v, Omega %*% Phi_svd$v)) / Phi_svd$d
Omega_tilde <- t(Omega_tilde) / Phi_svd$d
# It is safer to use the numerical singular value decomposition ('svd()')
# for diagonalizing a positive semidefinite matrix than to use a
# more general numerical diagonalization implementation such as 'eigen()'.
Omega_tilde_svd <- svd(Omega_tilde)
U_tilde <- Phi_svd$u %*% Omega_tilde_svd$u
```

We observe from Figures 3.17 and 3.18 that there are two relatively large eigenvalues corresponding to the two basis functions with erratic behavior close to the boundaries, and there are two eigenvalues that are effectively zero corresponding to the two affine basis functions. In addition, the more oscillating the basis function is, the larger is the corresponding eigenvalue, and the more is the corresponding coefficient shrunk toward zero by the spline smoother.

Observe also that \[\mathrm{df}(\lambda) = \mathrm{trace}(\mathbf{S}_\lambda) = \sum_{i=1}^p \frac{1}{1 + \lambda \gamma_i},\] which makes it possible to implement GCV without even computing the diagonal entries of \(\mathbf{S}_{\lambda}.\)

## 3.4 The Kalman smoother

The Kalman smoother is a classical algorithm
used for smoothing time-indexed observations. It can be seen as an example
of Gaussian prediction algorithms, where we imagine that

\((X_1, Y_1), \ldots, (X_n,Y_n)\) follow a joint Gaussian distribution,
and where we observe only \(Y_1, \ldots, Y_n\) but want to predict
\(X_1, \ldots, X_n\). If the \(X\)-s form a relatively smooth
but unobserved Gaussian process and the \(Y\)-s are noisy measurements
of the \(X\)-s, we see how the predictions of the \(X\)-s from the
\(Y\)-s can be seen as a smoother. In contrast to the bivariate
smoothers developed in the sections above, the approach taken
in this section is fully probabilistic. That is, the algorithms and
the parameters are justified and entirely determined by a
probabilistic model of the data. That said, we will see some
clear similarities with other smoothers.

### 3.4.1 Gaussian processes

We suppose that we have a time grid \(t_1 < \ldots < t_n\) and observe \(Y_1, \ldots, Y_n\) where \(Y_i\) is observed at time \(t_i\). The distribution of the \(Y\)-s is given in terms of \(X = (X_1, \ldots, X_n)\), where we suppose \(X \sim \mathcal{N}(\xi_x, \Sigma_{x})\) with \[(\Sigma_{x})_{i,j} = \mathrm{cov}(X_i, X_j) = K(t_i - t_j)\] for a kernel function \(K : \mathbb{R} \to \mathbb{R}\).

With the *observation equation* \(Y_i = X_i + \delta_i\)
for \(\delta = (\delta_1, \ldots, \delta_n) \sim \mathcal{N}(0, \Omega)\) and \(\delta \perp \! \! \perp X\) we get

\[(X, Y) \sim \mathcal{N}\left(\left(\begin{array}{c} \xi_x \\ \xi_x \end{array}\right), \left(\begin{array}{cc} \Sigma_x & \Sigma_x \\ \Sigma_x & \Sigma_x + \Omega \end{array} \right) \right).\]

Hence \[E(X \mid Y) = \xi_x + \Sigma_x (\Sigma_x + \Omega)^{-1} (Y - \xi_x).\]

Assuming that \(\xi_x = 0\) the conditional expectation is a linear smoother with smoother matrix \[S = \Sigma_x (\Sigma_x + \Omega)^{-1}.\]

This is also true if \(\Sigma_x (\Sigma_x + \Omega)^{-1} \xi_x = \xi_x\). If this identity holds approximately, we can argue that for computing \(E(X \mid Y)\) we don’t need to know \(\xi_x\).

If the observation variance is \(\Omega = \sigma^2 I\) then the smoother matrix simplifies to \[S = \Sigma_x (\Sigma_x + \sigma^2 I)^{-1} = (I + \sigma^2 \Sigma_x^{-1})^{-1}.\] Thus knowing the observation variance \(\sigma^2\) and the covariance matrix \(\Sigma_x\), which is determined entirely from the time-grid and the covariance kernel \(K\), we can compute the smoother matrix \(S\). Observing \(Y = y\), we can then compute the smoother as the solution to the equation \[(I + \sigma^2 \Sigma_x^{-1})x = y.\] Standard algorithms for matrix inversion, needed to compute \(S\), have time complexity \(O(n^3)\), and standard algorithms for solving the linear equation have time complexity \(O(n^2)\).

The main point of the Kalman smoother is that its have time complexity is \(O(n)\), and its computation is thus very fast and scales well with \(n\). The tradeoff is that the Kalman smoother puts some restrictions on \(K\) and thus \(\Sigma_x\). We consider the AR(1)-example in the section below, where we show that \(I + \sigma^2 \Sigma_x^{-1}\) is a tridiagonal matrix, and the smoother can be computed as a solution to ?? in time that is linear in \(n\).

### 3.4.2 AR(1) Kalman smoother

The simplest example with an efficient Kalman smoother is the AR(1)-model, where we assume an equidistant time grid (that is, \(t_i = i\)). Suppose that \(|\alpha| < 1\), \(X_1 = \epsilon_1 / \sqrt{1 - \alpha^2}\) and \[X_i = \alpha X_{i-1} + \epsilon_i\] for \(i = 2, \ldots, n\) with \(\epsilon = (\epsilon_1, \ldots, \epsilon_n) \sim \mathcal{N}(0, \sigma^2 I)\).

We have \(\mathrm{cov}(X_i, X_j) = \alpha^{|i-j|} / (1 - \alpha^2)\), thus we can find \(\Sigma_x\) and compute \[E(X_i \mid Y) = ((I + \sigma^2 \Sigma_x^{-1})^{-1} Y)_i\]

From the identity \(\epsilon_i = X_i - \alpha X_{i-1}\) it follows that \(\epsilon = A X\) where

\[A = \left( \begin{array}{cccccc} \sqrt{1 - \alpha^2} & 0 & 0 & \ldots & 0 & 0 \\ -\alpha & 1 & 0 & \ldots & 0 & 0 \\ 0 & -\alpha & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & 1 & 0 \\ 0 & 0 & 0 & \ldots & -\alpha & 1 \\ \end{array}\right),\]

This gives \(I = V(\epsilon) = A \Sigma_x A^T\), hence \[\Sigma_x^{-1} = (A^{-1}(A^T)^{-1})^{-1} = A^T A.\]

We have shown that \[\Sigma_x^{-1} = \left( \begin{array}{cccccc} 1 & -\alpha & 0 & \ldots & 0 & 0 \\ -\alpha & 1 + \alpha^2 & -\alpha & \ldots & 0 & 0 \\ 0 & -\alpha & 1 + \alpha^2 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & 1 + \alpha^2 & -\alpha \\ 0 & 0 & 0 & \ldots & -\alpha & 1 \\ \end{array}\right).\]

Hence \[I + \sigma^2 \Sigma_x^{-1} = \left( \begin{array}{cccccc} \gamma_0 & \rho & 0 & \ldots & 0 & 0 \\ \rho & \gamma & \rho & \ldots & 0 & 0 \\ 0 & \rho & \gamma & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & \gamma & \rho \\ 0 & 0 & 0 & \ldots & \rho & \gamma_0 \\ \end{array}\right)\]

with \(\gamma_0 = 1 + \sigma^2\), \(\gamma = 1 + \sigma^2 (1 + \alpha^2)\) and \(\rho = -\sigma^2 \alpha\) is a *tridiagonal matrix.*

The equation

\[\left( \begin{array}{cccccc} \gamma_0 & \rho & 0 & \ldots & 0 & 0 \\ \rho & \gamma & \rho & \ldots & 0 & 0 \\ 0 & \rho & \gamma & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & \gamma & \rho \\ 0 & 0 & 0 & \ldots & \rho & \gamma_0 \\ \end{array}\right) \left( \begin{array}{c} x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_{n-1} \\ x_n \end{array}\right) = \left(\begin{array}{c} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_{n-1} \\ y_n \end{array}\right)\]

can be solved by a forward and backward sweep.

**Forward sweep:**

- Set \(\rho_1' = \rho / \gamma_0\) and \(y_1' = y_1 / \gamma_0\),
- then recursively \[\rho_i' = \frac{\rho}{\gamma - \rho \rho_{i-1}'} \quad \text{and} \quad y_i' = \frac{y_i - \rho y_{i-1}'}{\gamma - \rho \rho_{i-1}'}\] for \(i = 2, \ldots, n-1\)
- and finally \[y_n' = \frac{y_n - \rho y_{n-1}'}{\gamma_0 - \rho \rho_{n-1}'}.\]

By the forward sweep the equation is transformed to

\[\left( \begin{array}{cccccc} 1 & \rho_1' & 0 & \ldots & 0 & 0 \\ 0 & 1 & \rho_2' & \ldots & 0 & 0 \\ 0 & 0 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & 1 & \rho_{n-1}' \\ 0 & 0 & 0 & \ldots & 0 & 1 \\ \end{array}\right) \left( \begin{array}{c} x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_{n-1} \\ x_n \end{array}\right) = \left(\begin{array}{c} y_1' \\ y_2' \\ y_3' \\ \vdots \\ y_{n-1}' \\ y_n' \end{array}\right),\]

which is then solved by backsubstitution from below; \(x_n = y_n'\) and \[x_{i} = y_i' - \rho_{i}' x_{i+1}, \quad i = n-1, \ldots, 1.\]

### 3.4.3 Implementation

```
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double alpha, double sigmasq) {
NumericVector KalmanSmooth(NumericVector y, double tmp, gamma0 = 1 + sigmasq, rho = - sigmasq * alpha;
double gamma = 1 + sigmasq * (1 + alpha * alpha);
int n = y.size();
1);
NumericVector x(n), rhop(n - 0] = rho / gamma0;
rhop[0] = y[0] / gamma0;
x[for(int i = 1; i < n - 1; ++i) { /* Forward sweep */
1]);
tmp = (gamma - rho * rhop[i -
rhop[i] = rho / tmp;1]) / tmp;
x[i] = (y[i] - rho * x[i -
}1] = (y[n - 1] - rho * x[n - 2]) / (gamma0 - rho * rhop[n - 2]);
x[n - for(int i = n - 2; i >= 0; --i) { /* Backsubstitution */
1];
x[i] = x[i] - rhop[i] * x[i +
}return x;
}
```

Result, \(\alpha = 0.95\), \(\sigma^2 = 10\)

Comparing results

```
Sigma <- outer(1:n, 1:n,
function(i, j) alpha^(abs(i - j))) / (1 - alpha^2)
Smooth <- Sigma %*% solve(Sigma + sigmasq * diag(n))
qplot(1:n, Smooth %*% Nuuk_year$Temperature - ySmooth) +
ylab("Difference")
```

Note that the forward sweep computes \(x_n = E(X_n \mid Y)\), and from this, the backsubstitution solves the smoothing problem of computing \(E(X \mid Y)\).

The Gaussian process used here (the AR(1)-process) is not very smooth and nor is the smoothing of the data. This is related to the kernel function \(K(s) = \alpha^{|s|}\) being non-differentiable in 0.

Many smoothers are equivalent to a Gaussian process smoother with an appropriate choice of kernel. Not all have a simple inverse covariance matrix and a Kalman filter algorithm.

### 3.4.4 The Kalman filter

```
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double alpha, double sigmasq) {
NumericVector KalmanFilt(NumericVector y, double tmp, gamma0 = 1 + sigmasq, rho = - sigmasq * alpha, yp;
double gamma = 1 + sigmasq * (1 + alpha * alpha);
int n = y.size();
NumericVector x(n), rhop(n);0] = rho / gamma0;
rhop[0] / gamma0;
yp = y[0] = y[0] / (1 + sigmasq * (1 - alpha * alpha));
x[for(int i = 1; i < n; ++i) {
1]);
tmp = (gamma - rho * rhop[i -
rhop[i] = rho / tmp;/* Note differences when compared to smoother */
1]);
x[i] = (y[i] - rho * yp) / (gamma0 - rho * rhop[i -
yp = (y[i] - rho * yp) / tmp;
}return x;
}
```

Result, \(\alpha = 0.95\), \(\sigma^2 = 10\)

## 3.5 Exercises

### Kernel estimators

**Exercise 3.1 **Consider a bivariate data set \((x_1, y_1), \ldots, (x_n, y_n)\) and let \(K\) be a
probability density with mean 0. Then
\[\hat{f}(x, y) = \frac{1}{n h^2} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right) K\left(\frac{y - y_i}{h}\right)\]
is a bivariate kernel density estimator of the joint density of \(x\) and \(y\). Show
that the kernel density estimator
\[\hat{f}_1(x) = \frac{1}{n h} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)\]
is also the marginal distribution of \(x\) under \(\hat{f}\), and that the
Nadaraya-Watson kernel smoother is the conditional expectation of \(y\)
given \(x\) under \(\hat{f}\).

**Exercise 3.2**Suppose that \(K\) is a symmetric kernel and the \(x\)-s are equidistant. Implement a function that computes the smoother matrix using the

`toeplitz`

function and
\(O(n)\) kernel evaluations where \(n\) is the number of data points. Implement also
a function that computes the diagonal elements of the smoother matrix directly
with run time \(O(n)\). *Hint: find inspiration in the implementation of the running mean*.