## 3.1 Nearest neighbor smoothers

One of the most basic ideas on smoothing bivariate data is to use a running mean or moving average. This is particularly sensible when the \(x\)-values are equidistant, e.g. when the observations constitute a time series such as the Nuuk temperature data. The running mean is an example of the more general nearest neighbor smoothers.

Mathematically, the \(k\) nearest neighbor smoother in \(x_i\) is defined as \[\hat{f}_i = \frac{1}{k} \sum_{j \in N_i} y_j\] where \(N_i\) is the set of indices for the \(k\) nearest neighbors of \(x_i\). This simple idea is actually very general and powerful. It works as long as the \(x\)-values lie in a metric space, and by letting \(k\) grow with \(n\) it is possible to construct consistent nonparametric estimators of regression functions, \(f(x) = E(Y \mid X = x)\), under minimal assumptions. The practical problem is that \(k\) must grow slowly in high dimensions, and the estimator is not a panacea.

In this chapter we focus exclusively on \(x\) being real valued with the
ordinary metric used to define the nearest neighbors. The total ordering of the
real line adds a couple of extra possibilities to the definition of \(N_i\).
When \(k\) is odd, the *symmetric* nearest neighbor smoother takes \(N_i\) to consist
of \(x_i\) together with the \((k-1)/2\) smaller \(x_j\)-s closest to
\(x_i\) and the \((k-1)/2\) larger \(x_j\)-s closest to \(x_i\). It is
also possible to choose a one-sided smoother with \(N_i\) corresponding
to the \(k\) smaller \(x_j\)-s closest to \(x_i\), in which case the smoother would
be known as a causal filter.

The symmetric definition of neighbors makes it very easy to handle the neighbors computationally; we don’t need to compute and keep track of the \(n^2\) pairwise distances between the \(x_i\)-s, we only need to sort data according to the \(x\)-values. Once data is sorted, \[N_i = \{i - (k - 1) / 2, i - (k - 1) / 2 + 1, \ldots, i - 1 , i, i + 1, \ldots, i + (k - 1) / 2\}\] for \((k - 1) / 2 \leq i \leq n - (k - 1) / 2\). The symmetric \(k\) nearest neighbor smoother is thus a running mean of the \(y\)-values when sorted according to the \(x\)-values. There are a couple of possibilities for handling the boundaries, one being simply to not define a value of \(\hat{f}_i\) outside of the interval above.

With \(\hat{\mathbf{f}}\) denoting the vector of smoothed values by a nearest neighbor smoother we can observe that it is always possible to write \(\hat{\mathbf{f}} = \mathbf{S}\mathbf{y}\) for a matrix \(\mathbf{S}\). For the symmetric nearest neighbor smoother and with data sorted according to the \(x\)-values, the matrix has the following band diagonal form

\[ \mathbf{S} = \left( \begin{array}{cccccccccc} \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & 0 & 0 & \ldots & 0 & 0 \\ 0 & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & 0 & \ldots & 0 & 0\\ 0 & 0 & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ldots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & \ldots & \frac{1}{5} & \frac{1}{5} \\ \end{array} \right) \]

here given for \(k = 5\) and with dimensions \((n - 4) \times n\) due to the undefined boundary values.

### 3.1.1 Linear smoothers

A smoother of the form \(\hat{\mathbf{f}} = \mathbf{S}\mathbf{y}\) for a *smoother matrix* \(\mathbf{S}\),
such as the nearest neighbor smoother, is known as a *linear smoother*.
The linear form is often beneficial for theoretical arguments, and many smoothers
considered in this chapter will be linear smoothers. For computing \(\mathbf{f}\)
there may, however, be many alternatives to forming the matrix \(\mathbf{S}\)
and computing the matrix-vector product. Indeed, this is often not the best
way to compute the smoothed values.

It is, on the other hand, useful to see how \(\mathbf{S}\) can be constructed for the symmetric nearest neighbor smoother.

```
## Warning in matrix(w, 147 - 10, 147, byrow = TRUE): data length [148] is not a
## sub-multiple or multiple of the number of rows [137]
```

The construction above relies on vector recycling of `w`

in the construction of `S`

and the fact that `w`

has length \(147 + 1\), which will effectively cause `w`

to be
translated by one to the right every time it is recycled for a new row. As seen,
the code triggers a warning by R, but in this case we get what we want.

We can use the matrix to smooth the annual average temperature in Nuuk using a
running mean with a window of \(k = 11\) years. That is, the smoothed temperature
at a given year is the average of the temperatures in the period
from five years before to five years after. Note that to add the smoothed
values to the previous plot we need to pad the values at the boundaries
with `NA`

s to get a vector of length 147.

```
# Check first if data is sorted correctly.
# The test is backwards, but confirms that data isn't unsorted :-)
is.unsorted(Nuuk_year$Year)
```

`## [1] FALSE`

```
f_hat <- c(rep(NA, 5), S %*% Nuuk_year$Temperature, rep(NA, 5))
p_Nuuk + geom_line(aes(y = f_hat), color = "blue")
```

### 3.1.2 Implementing the running mean

The running mean smoother fulfills the following identity \[\hat{f}_{i+1} = \hat{f}_{i} - y_{i - (k-1)/2} / k + y_{i + (k + 1)/2} / k,\] which can be used for a much more efficient implementation than the matrix-vector multiplication. It should be emphasized again that the identity above and the implementation below assume that data is sorted according to \(x\)-values.

```
# The vector 'y' must be sorted according to the x-values
run_mean <- function(y, k) {
n <- length(y)
m <- floor((k - 1) / 2)
k <- 2 * m + 1 # Ensures k to be odd and m = (k - 1) / 2
y <- y / k
s <- rep(NA, n)
s[m + 1] <- sum(y[1:k])
for(i in (m + 1):(n - m - 1))
s[i + 1] <- s[i] - y[i - m] + y[i + 1 + m]
s
}
```

The R function `filter()`

(from the stats package) can be used to compute running
means and general moving averages using any weight vector. We compare our two
implementations to `filter()`

.

```
f_hat_filter <- stats::filter(Nuuk_year$Temperature, rep(1/11, 11))
range(f_hat_filter - f_hat, na.rm = TRUE)
```

`## [1] -4.440892e-16 4.440892e-16`

`## [1] -1.332268e-15 4.440892e-16`

Note that `filter()`

uses the same boundary convention as used in `run_mean()`

.

A benchmark comparison between matrix-vector multiplication, `run_mean()`

and `filter()`

gives the following table with median run time in microseconds.

```
## expr median
## 1 S1 %*% y[1:512] 408.7950
## 2 S2 %*% y[1:1024] 1700.9515
## 3 S3 %*% y[1:2048] 6847.7495
## 4 S4 %*% y[1:4096] 27651.4370
## 5 run_mean(y[1:512], k = 11) 81.3415
## 6 run_mean(y[1:1024], k = 11) 149.2005
## 7 run_mean(y[1:2048], k = 11) 296.2385
## 8 run_mean(y[1:4096], k = 11) 621.4225
## 9 stats::filter(y[1:512], rep(1/11, 11)) 120.6280
## 10 stats::filter(y[1:1024], rep(1/11, 11)) 135.5900
## 11 stats::filter(y[1:2048], rep(1/11, 11)) 200.1870
## 12 stats::filter(y[1:4096], rep(1/11, 11)) 325.7340
```

The matrix-vector computation is clearly much slower than the two alternatives,
and the time to construct the \(\mathbf{S}\)-matrix has not even been included
in the benchmark above. There is also a difference in how the matrix-vector
multiplication scales with the size of data compared to the alternatives.
Whenever the data size doubles the run time approximately doubles for
both `filter()`

and `run_mean()`

, while it quadruples for the matrix-vector
multiplication. This shows the difference between an algorithm
that scales like \(O(n)\) and an algorithm that scales like \(O(n^2)\)
as the matrix-vector product does.

Despite that `filter()`

is implementing a more general
algorithm than `run_mean()`

, it is still faster, which reflects that
it is implemented in C and compiled.

### 3.1.3 Choose \(k\) by cross-validation

Cross-validation relies predictions of \(y_i\) from \(x_i\)
for data points \((x_i, y_i)\) left out of the data set when the predictor
is fitted to data. Many (linear) smoothers have a natural definition of an “out-of-sample”
prediction, that is, how \(\hat{f}(x)\) is computed for \(x\) not
in the data. If so, it becomes possible to define
\[\hat{f}^{-i}_i = \hat{f}^{-i}(x_i)\]
as the prediction at \(x_i\) using the smoother computed from data
with \((x_i, y_i)\) excluded. However, here we derectly *define*
\[\hat{f}^{-i}_i = \sum_{j \neq i} \frac{S_{ij}y_j}{1 - S_{ii}}\]
for any linear smoother. This definition concurs with the
“out-of-sample” predictor in \(x_i\) for most smoothers,
but this has to be verified case-by-case.

The running mean is a little special in this respect. In the
previous section, the
running mean was only considered for odd \(k\) and using
a symmetric neighbor definition. This is convenient
when considering the running mean *in the observations* \(x_i\).
When considering the running mean in any other point, a symmetric
neighbor definition works better with an even \(k\). This
is exactly what the definition of \(\hat{f}^{-i}_i\) above amounts to.
If \(\mathbf{S}\) is the running mean smoother matrix for an odd \(k\),
then \(\hat{f}^{-i}_i\) corresponds to symmetric \((k-1)\)-nearest
neighbor smoothing excluding \((x_i, y_i)\) from the data.

Using the definition above, we get that the *leave-one-out cross-validation*
squared error criterion becomes
\[\mathrm{LOOCV} = \sum_{i=1}^n (y_i - \hat{f}^{-i}_i)^2 =
\sum_{i=1}^n \left(\frac{y_i - \hat{f}_i}{1 - S_{ii}}\right)^2.\]
The important observation from the identity above is that LOOCV
can be computed without actually computing all the \(\hat{f}^{-i}_i\).

For the running mean, all diagonal elements of the smoother matrix are identical.
We disregard boundary values (with the `NA`

value), so to get
a comparable quantity across different choices of \(k\) we use `mean()`

instead of
`sum()`

in the implementation.

```
k <- seq(3, 40, 2)
CV <- sapply(k, loocv, y = Nuuk_year$Temperature)
k_opt <- k[which.min(CV)]
qplot(k, CV) + geom_line() + geom_vline(xintercept = k_opt, color = "red")
```

The optimal choice of \(k\) is 15, but the LOOCV criterion jumps quite a lot up and down with changing neighbor size, and \(k = 9\) as well as \(k = 25\) give rather low values as well.

```
p_Nuuk +
geom_line(aes(y = run_mean(Nuuk_year$Temperature, 9)), color = "red") +
geom_line(aes(y = run_mean(Nuuk_year$Temperature, k_opt)), color = "blue") +
geom_line(aes(y = run_mean(Nuuk_year$Temperature, 25)), color = "purple")
```