## 2.2 Kernel methods

A simple approach to nonparametric density estimation relies on the approximation \[P(X \in (x-h, x+h)) = \int_{x-h}^{x+h} f_0(z) \ dz \simeq f_0(x) 2h,\] which is valid for any continuous density \(f_0\). Inverting this approximation and using the law of large numbers,

\[\begin{align*} f_0(x) & \simeq \frac{1}{2h}P(X \in (x-h, x+h)) \\ & \simeq \frac{1}{2hn} \sum_{j=1}^n 1_{(x-h, x+h)}(x_j) \\ & = \underbrace{\frac{1}{2hn} \sum_{j=1}^n 1_{(-h, h)}(x - x_j)}_{\hat{f}_h(x)} \end{align*}\]

for i.i.d. observations \(x_1, \ldots, x_n\) from the distribution \(f_0 \cdot m\). The function \(\hat{f}_h\) defined as above is an example of a kernel density estimator with a rectangular kernel. We immediately note that \(h\) has to be chosen appropriately. If \(h\) is large, \(\hat{f}_h\) will be flat and close to a constant. If \(h\) is small, \(\hat{f}_h\) will make large jumps close to the observations.

What do we then mean by an “appropriate” choice of \(h\) above? To answer this we must have some prior assumptions about what we expect \(f_0\) to look like. Typically, we expect \(f_0\) to have few oscillations and to be fairly smooth, and we want \(\hat{f}_h\) to reflect that. A too large \(h\) will oversmooth the data relative to \(f_0\) by effectively ignoring the data, while a too small \(h\) will undersmooth the data relative to \(f_0\) by allowing individual data points to have large local effects that make the estimate wiggly. More formally, we can look at the mean and variance of \(\hat{f}_h\). Letting \(p(x, h) = P(X \in (x-h, x+h))\), it follows that \(f_h(x) = E(\hat{f}_h(x)) = p(x, h) / (2h)\) while

\[\begin{equation} V(\hat{f}_h(x)) = \frac{p(x, h) (1 - p(x, h))}{4h^2 n} \simeq f_h(x) \frac{1}{2hn}. \tag{2.1} \end{equation}\]

We see from these computations that for \(\hat{f}_h(x)\) to be approximately unbiased for any \(x\) we need \(h\) to be small – ideally letting \(h \to 0\) since then \(f_h(x) \to f_0(x)\). However, this will make the variance blow up, and to minimize variance we should instead choose \(h\) as large as possible. One way to define “appropriate” is then to strike a balance between the bias and the variance as a function of \(h\) so as to minimize the mean squared error of \(\hat{f}_h(x)\).

We will find the optimal tradeoff for the rectangular kernel in Section 2.3 on bandwidth selection. It’s not difficult, and you are encouraged to try finding it yourself at this point. In this section we will focus on computational aspects of kernel density estimation, but first we will generalize the estimator by allowing for other kernels.

The estimate \(\hat{f}_h(x)\) will be unbiased if \(f_0\) is constantly equal to \(f_0(x)\)
in the entire interval \((x-h, x+h)\). This is atypical and can only happen for all \(x\)
if \(f_0\) is constant. We expect the typical situation to be that
\(f_0\) deviates the most from \(f_0(x)\) close to \(x \pm h\), and
that this causes a bias of \(\hat{f}_h(x).\)
Observations falling close to \(x + h\), say, should thus count less
than observations falling close to \(x\)? The rectangular kernel makes a sharp
cut; either a data point is in or it is out. If we use a smooth weighting
function instead of a sharp cut, we might be able to include more
data points and lower the variance while keeping the bias small. This is
precisely the idea of *kernel estimators*, defined generally as

\[\begin{equation} \hat{f}_h(x) = \frac{1}{hn} \sum_{j=1}^n K\left(\frac{x - x_j}{h}\right) \tag{2.2} \end{equation}\]

for a kernel \(K : \mathbb{R} \to \mathbb{R}\). The parameter \(h > 0\) is known
as the *bandwidth*. Examples of kernels include the *uniform* or *rectangular kernel*
\[K(x) = \frac{1}{2} 1_{(-1,1)}(x),\]
and the *Gaussian kernel*
\[K(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}.\]

One direct benefit of considering other kernels than the rectangular is that \(\hat{f}_h\) inherits all smoothness properties from \(K\). Whereas the rectangular kernel is not even continuous, the Gaussian kernel is \(C^{\infty}\) and so is the resulting kernel density estimate.

### 2.2.1 Implementation

What should be computed to compute a kernel density estimate? That is, in fact, a good question, because the definition actually just specifies how to evaluate \(\hat{f}_h\) in any given point \(x\), but there is really not anything to compute until we need to evaluate \(\hat{f}_h\). Thus when we implement kernel density estimation we really implement algorithms for evaluating a density estimate in a finite number of points.

Our first implementation is a fairly low-level implementation that returns
the evaluation of the density estimate in a given number of equidistant
points. The function mimics some of the defaults of `density()`

so that it
actually evaluates the estimate in the same points as `density()`

.

```
# This is an implementation of the function 'kern_dens' that computes
# evaluations of Gaussian kernel density estimates in a grid of points.
#
# The function has three formal arguments: 'x' is the numeric vector of data
# points, 'h' is the bandwidth and 'm' is the number of grid points.
# The default value of 512 is chosen to match the default of 'density()'.
kern_dens <- function(x, h, m = 512) {
rg <- range(x)
# xx is equivalent to grid points in 'density()'
xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
y <- numeric(m) # The evaluations, initialized as a vector of zeros
# The actual computation is done using nested for-loops. The outer loop
# is over the grid points, and the inner loop is over the data points.
for (i in seq_along(xx))
for (j in seq_along(x))
y[i] <- y[i] + exp(- (xx[i] - x[j])^2 / (2 * h^2))
y <- y / (sqrt(2 * pi) * h * length(x))
list(x = xx, y = y)
}
```

Note that the function returns a list containing the grid points (`x`

) where
the density estimate is evaluated as well as the estimated density
evaluations (`y`

). Note also that the argument `m`

above sets the number of
grid points, whereas `density()`

uses the argument `n`

for that. The
latter can be a bit confusing as \(n\) is often used to denote the
number of data points.

We will immediately test if the implementation works as expected – in this
case by comparing it to our reference implementation `density()`

.

```
f_hat <- kern_dens(psi, 0.2)
f_hat_dens <- density(psi, 0.2)
plot(f_hat, type = "l", lwd = 4, xlab = "x", ylab = "Density")
lines(f_hat_dens, col = "red", lwd = 2)
plot(f_hat$x, f_hat$y - f_hat_dens$y,
type = "l",
lwd = 2,
xlab = "x",
ylab = "Difference"
)
```

Figure 2.3 suggests that the estimates computed by our
implementation and by `density()`

are the same when we just visually compare the
plotted densities. However, if we look at the differences instead,
we see that they are as large as \(4 \times 10^{-4}\) in absolute value. This is
way above what we should expect from rounding errors alone when using
double precision arithmetic. Thus the two implementations only compute
*approximately* the same, which is, in fact, because `density()`

relies
on certain approximations for run time efficiency.

In R we can often beneficially implement computations in a vectorized way
instead of using an explicit loop. It is fairly easy to change the implementation
to be more vectorized by computing each evaluation in one single line using
the `sum()`

function and the fact that `exp()`

and squaring are vectorized.

```
kern_dens_vec <- function(x, h, m = 512) {
rg <- range(x)
xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
y <- numeric(m)
# The inner loop from 'kern_dens' has been vectorized, and only the
# outer loop over the grid points remains.
const <- (sqrt(2 * pi) * h * length(x))
for (i in seq_along(xx))
y[i] <- sum(exp(-(xx[i] - x)^2 / (2 * h^2))) / const
list(x = xx, y = y)
}
```

We test this new implementation by comparing it to our previous implementation.

`## [1] -5.551115e-16 3.885781e-16`

The magnitude of the differences are of order at most \(10^{-16}\), which is what
can be expected due to rounding errors. Thus we conclude that up to rounding
errors, `kern_dens()`

and `kern_dens_vec()`

return the same on this data set. This is,
of course, not a comprehensive test, but it is an example of one among a
number of tests that should be considered.

There are several ways to get completely rid of the explicit loops and write
an entirely vectorized implementation in R. One of the solutions will use the
`sapply()`

function, which belongs to the family of `*apply()`

functions
that apply a function to each element in a vector or a list. In the
parlance of functional programming the `*apply()`

functions are variations
of the functional, or higher-order-function, known as map.

```
kern_dens_apply <- function(x, h, m = 512) {
rg <- range(x)
xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
const <- sqrt(2 * pi) * h * length(x)
y <- sapply(xx, function(z) sum(exp(-(z - x)^2 / (2 * h^2))) / const)
list(x = xx, y = y)
}
```

The `sapply()`

call above will apply the function `function(z) sum(dnorm(...`

to every element in the vector `xx`

and return the result as a vector. The
function is an example of an *anonymous function* that does not get a name and
exists only during the `sapply()`

evaluation. Instead of `sapply()`

it is possible
to use `lapply()`

that returns a list. In fact, `sapply()`

is a
simple wrapper around `lapply()`

that attempts to “simplify” the result from
a list to an array (and in this case to a vector).

An alternative, and also completely vectorized, solution can be based on
the functions `outer()`

and `rowMeans()`

.

```
kern_dens_outer <- function(x, h, m = 512) {
rg <- range(x)
xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
y <- outer(xx, x, function(zz, z) exp(-(zz - z)^2 / (2 * h^2)))
y <- rowMeans(y) / (sqrt(2 * pi) * h)
list(x = xx, y = y)
}
```

The `outer()`

function evaluates the kernel in all combinations
of the grid and data points and returns a matrix of
dimensions \(m \times n\). The function `rowMeans()`

computes
the means of each row and returns a vector of length \(m\).

We should, of course, also remember to test these two last implementations.

`## [1] -5.551115e-16 3.885781e-16`

`## [1] -4.996004e-16 3.885781e-16`

The natural question is then how to choose between the different implementations?
Besides being correct it is important that the code is easy to read
and understand. Which of the four implementations above that is best in
this respect may depend a lot on the background of the reader. If you strip
the implementations for comments, all four are arguably quite readable, but
`kern_dens()`

with the double loop might appeal a bit more to people
with a background in imperative programming, while `kern_dens_apply()`

might
appeal more to people with a preference for functional programming.
This functional and vectorized solution is also
a bit closer to the mathematical notation with
e.g. the sum sign \(\Sigma\) being mapped directly to the `sum()`

function
instead of the incremental addition in the for-loop. For
these specific implementations these differences are mostly
aesthetic nuances and preferences may be more subjective than substantial.

To make a qualified choice between the implementations we should investigate if they differ in terms of run time and memory consumption.

### 2.2.2 Benchmarking

Benchmarking is about measuring and comparing performance. For software this often means measuring run time and memory usage, though there are clearly many other aspects of software that should be benchmarked in general. This includes user experience, energy consumption and implementation and maintenance time. In this section we focus on benchmarking run time.

The function `system.time`

in R provides a simple way of benchmarking run time
measured in seconds.

```
system.time(kern_dens(psi, 0.2))
system.time(kern_dens_vec(psi, 0.2))
system.time(kern_dens_apply(psi, 0.2))
system.time(kern_dens_outer(psi, 0.2))
```

```
## kern_dens:
## user system elapsed
## 0.031 0.000 0.031
## kern_dens_vec:
## user system elapsed
## 0.003 0.000 0.003
## kern_dens_apply:
## user system elapsed
## 0.003 0.000 0.004
## kern_dens_outer:
## user system elapsed
## 0.004 0.000 0.003
```

The “elapsed” time is the total run time as experienced, while the “user” and “system” times are how long the CPU spent on executing your code and operating system code on behalf of your code, respectively.

From this simple benchmark, `kern_dens()`

is clearly substantially slower
than the three other implementations. For more systematic benchmarking of run
time, the R package microbenchmark is useful.

```
kern_bench <- microbenchmark(
kern_dens(psi, 0.2),
kern_dens_vec(psi, 0.2),
kern_dens_apply(psi, 0.2),
kern_dens_outer(psi, 0.2)
)
```

The result stored in `kern_bench()`

is a data frame with two columns. The first
contains the R expressions evaluated, and the second is the evaluation
time measured in nanoseconds. Each of the four expressions were evaluated 100 times,
and the data frame thus has 400 rows. The `times`

argument to `microbenchmark()`

can be used to change the number of evaluations per expression if needed.

It may not be immediately obvious that `kern_bench()`

is a data frame, because printing will automatically summarize the
data, but the actual data structure is revealed by the R function `str()`

.

```
## Classes 'microbenchmark' and 'data.frame': 400 obs. of 2 variables:
## $ expr: Factor w/ 4 levels "kern_dens(psi, 0.2)",..: 3 3 3 4 3 2 1 1 2 2 ...
## $ time: num 3058121 3670484 3446023 4136851 3379003 ...
```

A total of 400 evaluations were done for the above benchmark, and `microbenchmark()`

does the evaluations in a random order by default. Measuring evaluation time on a
complex system like a modern computer is an empirical science, and
the order of evaluation can potentially affect the results as the
conditions for the evaluation change over time. The purpose of the
randomization is to avoid that the ordering causes systematically
misleading results.

The microbenchmark package implements some methods for summarizing and printing the results such as the following summary table with times in milliseconds.

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## kern_dens(psi, 0.2) 28.62 29.66 30.85 30.20 31.28 41.2 100
## kern_dens_vec(psi, 0.2) 2.50 2.64 3.43 2.77 3.24 10.9 100
## kern_dens_apply(psi, 0.2) 2.80 2.96 3.42 3.08 3.41 10.3 100
## kern_dens_outer(psi, 0.2) 3.12 3.37 4.51 3.61 4.35 10.8 100
```

The summary table shows some key statistics like median and mean evaluation
times but also extremes and upper and lower quartiles. The distributions
of run times can be investigated further using the `autoplot()`

function, which is
based on ggplot2 and thus easy to modify.

```
autoplot(kern_bench) +
geom_jitter(position = position_jitter(0.2, 0),
aes(color = expr), alpha = 0.4) +
aes(fill = I("gray")) +
theme(legend.position = "none")
```

This more refined benchmark study does not change our initial impression from
using `system.time()`

substantially. The function `kern_dens()`

is notably
slower than the three vectorized implementations, but we are now able to
more clearly see the minor differences among them. For instance, `kern_dens_vec()`

and `kern_dens_apply()`

have very similar run time distributions, while
`kern_dens_outer()`

clearly has a larger median run time and also a
run time distribution that is more spread out to the right.

In many cases when we benchmark run time it is of interest
to investigate how run time depends on various parameters. This is so
for kernel density estimation, where we want to understand how changes in
the number of data points, \(n\), and the number of grid points, \(m\), affect
run time. We can still use `microbenchmark()`

for running the benchmark
experiment, but we will typically process and plot the benchmark data afterwards
in a customized way.

Figure 2.4 shows median run times for an experiment with
28 combinations of parameters for each of the four different implementations
yielding a total of 112 different R expressions being benchmarked. The number of replications
for each expression was set to 40. The results confirm that `kern_dens()`

is substantially slower than the vectorized implementations for all combinations
of \(n\) and \(m\). However, Figure 2.4 also reveals a new
pattern; `kern_dens_outer()`

appears to scale with \(n\) in a slightly different way
than the two other vectorized implementations for small \(n\). It is comparable to
or even a bit faster than
`kern_dens_vec()`

and `kern_dens_apply()`

for very small data sets, while it becomes
slower for the larger data sets.

Run time for many algorithms have to a good approximation a dominating power law behavior as a function of typical size parameters, that is, the run time will scale approximately like \(n \mapsto C n^a\) for constants \(C\) and \(a\) and with \(n\) denoting a generic size parameter. Therefore it is beneficial to plot run time using log-log scales and to design benchmark studies with size parameters being equidistant on a log-scale. With approximate power law scaling, the log run time behaves like \[\log(C) + a \log(n),\] that is, on a log-log scale we see approximate straight lines. The slope reveals the exponent \(a\), and two different algorithms for solving the same problem might have different exponents and thus different slopes on the log-log-scale. Two different implementations of the same algorithm should have approximately the same slope but may differ in the constant \(C\) depending upon how efficient the particular implementation is in the particular programming language used. Differences in \(C\) correspond to vertical translations on the log-log scale.

In practice, we will see some deviations from straight lines on the log-log plot for a number of reasons. Writing the run time as \(C n^a + R(n)\), the residual term \(R(n)\) will often be noticeable or even dominating and positive for small \(n\). It is only for large enough \(n\) that the power law term, \(C n^a\), will dominate. In addition, run time can be affected by hardware constraints such as cache and memory sizes, which can cause abrupt jumps in run time.

Using `microbenchmark()`

over `system.time()`

has two main benefits. First,
it handles the replication and randomization automatically, which is
convenient. Second, it attempts to provide more accurate timings. The latter
is mostly important when we benchmark very fast computations.

It can be debated if a median summary of randomly ordered evaluations
is the best way to summarize run time. This is due to the way R does memory
management. R allocates and deallocates memory automatically
and uses garbage collection
for the deallocation. This means that computations occasionally, and in a somewhat
unpredictable manner, trigger the garbage collector, and as a result a small
fraction of the evaluations may take substantially longer time than the
rest. The median will typically be almost unaffected, and memory deallocation
is thus effectively (and wrongly) disregarded from run time when the median
summary is used. This is an argument for using the mean instead of the median,
but due to the randomization the computation that triggered
the garbage collector might not be the one that caused the memory allocation
in the first place. Using the mean instead of the median will therefore smear
out the garbage collection run time on all benchmarked expressions. Setting
the argument `control = list(order = "block")`

for `microbenchmark()`

will
evaluate the expressions in blocks, which in combination with a mean
summary more correctly accounts for memory allocation and deallocation in the
run time. The downside is that without the randomization the results might
suffer from other artefacts. This book will use randomization
and median summaries throughout, but we keep in mind that this
could underestimate actual average run time depending upon how much
memory a given computation requires. Memory usage and how it affects
run time by triggering garbage collection will be dealt with via
code profiling tools instead.