2.2 Kernel methods

A simple approach to nonparametric density estimation relies on the approximation \[P(X \in (x-h, x+h)) = \int_{x-h}^{x+h} f_0(z) \ dz \simeq f_0(x) 2h,\] which is valid for any continuous density \(f_0\). Inverting this approximation and using the law of large numbers,

\[\begin{align*} f_0(x) & \simeq \frac{1}{2h}P(X \in (x-h, x+h)) \\ & \simeq \frac{1}{2hn} \sum_{j=1}^n 1_{(x-h, x+h)}(x_j) \\ & = \underbrace{\frac{1}{2hn} \sum_{j=1}^n 1_{(-h, h)}(x - x_j)}_{\hat{f}_h(x)} \end{align*}\]

for i.i.d. observations \(x_1, \ldots, x_n\) from the distribution \(f_0 \cdot m\). The function \(\hat{f}_h\) defined as above is an example of a kernel density estimator with a rectangular kernel. We immediately note that \(h\) has to be chosen appropriately. If \(h\) is large, \(\hat{f}_h\) will be flat and close to a constant. If \(h\) is small, \(\hat{f}_h\) will make large jumps close to the observations.

What do we then mean by an “appropriate” choice of \(h\) above? To answer this we must have some prior assumptions about what we expect \(f_0\) to look like. Typically, we expect \(f_0\) to have few oscillations and to be fairly smooth, and we want \(\hat{f}_h\) to reflect that. A too large \(h\) will oversmooth the data relative to \(f_0\) by effectively ignoring the data, while a too small \(h\) will undersmooth the data relative to \(f_0\) by allowing individual data points to have large local effects that make the estimate wiggly. More formally, we can look at the mean and variance of \(\hat{f}_h\). Letting \(p(x, h) = P(X \in (x-h, x+h))\), it follows that \(f_h(x) = E(\hat{f}_h(x)) = p(x, h) / (2h)\) while

\[\begin{equation} V(\hat{f}_h(x)) = \frac{p(x, h) (1 - p(x, h))}{4h^2 n} \simeq f_h(x) \frac{1}{2hn}. \tag{2.1} \end{equation}\]

We see from these computations that for \(\hat{f}_h(x)\) to be approximately unbiased for any \(x\) we need \(h\) to be small – ideally letting \(h \to 0\) since then \(f_h(x) \to f_0(x)\). However, this will make the variance blow up, and to minimize variance we should instead choose \(h\) as large as possible. One way to define “appropriate” is then to strike a balance between the bias and the variance as a function of \(h\) so as to minimize the mean squared error of \(\hat{f}_h(x)\).

We will find the optimal tradeoff for the rectangular kernel in Section 2.3 on bandwidth selection. It’s not difficult, and you are encouraged to try finding it yourself at this point. In this section we will focus on computational aspects of kernel density estimation, but first we will generalize the estimator by allowing for other kernels.

The estimate \(\hat{f}_h(x)\) will be unbiased if \(f_0\) is constantly equal to \(f_0(x)\) in the entire interval \((x-h, x+h)\). This is atypical and can only happen for all \(x\) if \(f_0\) is constant. We expect the typical situation to be that \(f_0\) deviates the most from \(f_0(x)\) close to \(x \pm h\), and that this causes a bias of \(\hat{f}_h(x).\) Observations falling close to \(x + h\), say, should thus count less than observations falling close to \(x\)? The rectangular kernel makes a sharp cut; either a data point is in or it is out. If we use a smooth weighting function instead of a sharp cut, we might be able to include more data points and lower the variance while keeping the bias small. This is precisely the idea of kernel estimators, defined generally as

\[\begin{equation} \hat{f}_h(x) = \frac{1}{hn} \sum_{j=1}^n K\left(\frac{x - x_j}{h}\right) \tag{2.2} \end{equation}\]

for a kernel \(K : \mathbb{R} \to \mathbb{R}\). The parameter \(h > 0\) is known as the bandwidth. Examples of kernels include the uniform or rectangular kernel \[K(x) = \frac{1}{2} 1_{(-1,1)}(x),\] and the Gaussian kernel \[K(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}.\]

One direct benefit of considering other kernels than the rectangular is that \(\hat{f}_h\) inherits all smoothness properties from \(K\). Whereas the rectangular kernel is not even continuous, the Gaussian kernel is \(C^{\infty}\) and so is the resulting kernel density estimate.

2.2.1 Implementation

What should be computed to compute a kernel density estimate? That is, in fact, a good question, because the definition actually just specifies how to evaluate \(\hat{f}_h\) in any given point \(x\), but there is really not anything to compute until we need to evaluate \(\hat{f}_h\). Thus when we implement kernel density estimation we really implement algorithms for evaluating a density estimate in a finite number of points.

Our first implementation is a fairly low-level implementation that returns the evaluation of the density estimate in a given number of equidistant points. The function mimics some of the defaults of density() so that it actually evaluates the estimate in the same points as density().

# This is an implementation of the function 'kern_dens' that computes 
# evaluations of Gaussian kernel density estimates in a grid of points.
# The function has three formal arguments: 'x' is the numeric vector of data 
# points, 'h' is the bandwidth and 'm' is the number of grid points. 
# The default value of 512 is chosen to match the default of 'density()'. 
kern_dens <- function(x, h, m = 512) {
  rg <- range(x)
  # xx is equivalent to grid points in 'density()'
  xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
  y <- numeric(m) # The evaluations, initialized as a vector of zeros
  # The actual computation is done using nested for-loops. The outer loop
  # is over the grid points, and the inner loop is over the data points.
  for (i in seq_along(xx))
    for (j in seq_along(x))
      y[i] <- y[i] + exp(- (xx[i] - x[j])^2 / (2 * h^2))  
  y <- y / (sqrt(2 * pi) * h * length(x))
  list(x = xx, y = y)

Note that the function returns a list containing the grid points (x) where the density estimate is evaluated as well as the estimated density evaluations (y). Note also that the argument m above sets the number of grid points, whereas density() uses the argument n for that. The latter can be a bit confusing as \(n\) is often used to denote the number of data points.

We will immediately test if the implementation works as expected – in this case by comparing it to our reference implementation density().

f_hat <- kern_dens(psi, 0.2)
f_hat_dens <- density(psi, 0.2)
plot(f_hat, type = "l", lwd = 4, xlab = "x", ylab = "Density")
lines(f_hat_dens, col = "red", lwd = 2)
plot(f_hat$x, f_hat$y - f_hat_dens$y, 
  type = "l", 
  lwd = 2,
  xlab = "x", 
  ylab = "Difference"
Kernel density estimates with the Gaussian kernel (left) using R's implementation (black) and our implementation (red) together with differences of the estimates (right).Kernel density estimates with the Gaussian kernel (left) using R's implementation (black) and our implementation (red) together with differences of the estimates (right).

Figure 2.3: Kernel density estimates with the Gaussian kernel (left) using R’s implementation (black) and our implementation (red) together with differences of the estimates (right).

Figure 2.3 suggests that the estimates computed by our implementation and by density() are the same when we just visually compare the plotted densities. However, if we look at the differences instead, we see that they are as large as \(4 \times 10^{-4}\) in absolute value. This is way above what we should expect from rounding errors alone when using double precision arithmetic. Thus the two implementations only compute approximately the same, which is, in fact, because density() relies on certain approximations for run time efficiency.

In R we can often beneficially implement computations in a vectorized way instead of using an explicit loop. It is fairly easy to change the implementation to be more vectorized by computing each evaluation in one single line using the sum() function and the fact that exp() and squaring are vectorized.

kern_dens_vec <- function(x, h, m = 512) {
  rg <- range(x)
  xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
  y <- numeric(m) 
  # The inner loop from 'kern_dens' has been vectorized, and only the 
  # outer loop over the grid points remains. 
  const <- (sqrt(2 * pi) * h * length(x))
  for (i in seq_along(xx))
      y[i] <- sum(exp(-(xx[i] - x)^2 / (2 * h^2))) / const
  list(x = xx, y = y)

We test this new implementation by comparing it to our previous implementation.

range(kern_dens(psi, 0.2)$y - kern_dens_vec(psi, 0.2)$y)
## [1] -5.551115e-16  3.885781e-16

The magnitude of the differences are of order at most \(10^{-16}\), which is what can be expected due to rounding errors. Thus we conclude that up to rounding errors, kern_dens() and kern_dens_vec() return the same on this data set. This is, of course, not a comprehensive test, but it is an example of one among a number of tests that should be considered.

There are several ways to get completely rid of the explicit loops and write an entirely vectorized implementation in R. One of the solutions will use the sapply() function, which belongs to the family of *apply() functions that apply a function to each element in a vector or a list. In the parlance of functional programming the *apply() functions are variations of the functional, or higher-order-function, known as map.

kern_dens_apply <- function(x, h, m = 512) {
  rg <- range(x)
  xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
  const <- sqrt(2 * pi) * h * length(x)
  y <- sapply(xx, function(z) sum(exp(-(z - x)^2 / (2 * h^2))) / const)
  list(x = xx, y = y)

The sapply() call above will apply the function function(z) sum(dnorm(... to every element in the vector xx and return the result as a vector. The function is an example of an anonymous function that does not get a name and exists only during the sapply() evaluation. Instead of sapply() it is possible to use lapply() that returns a list. In fact, sapply() is a simple wrapper around lapply() that attempts to “simplify” the result from a list to an array (and in this case to a vector).

An alternative, and also completely vectorized, solution can be based on the functions outer() and rowMeans().

kern_dens_outer <- function(x, h, m = 512) {
  rg <- range(x)
  xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
  y <- outer(xx, x, function(zz, z) exp(-(zz - z)^2 / (2 * h^2)))
  y <- rowMeans(y) / (sqrt(2 * pi) * h)
  list(x = xx, y = y)

The outer() function evaluates the kernel in all combinations of the grid and data points and returns a matrix of dimensions \(m \times n\). The function rowMeans() computes the means of each row and returns a vector of length \(m\).

We should, of course, also remember to test these two last implementations.

range(kern_dens(psi, 0.2)$y - kern_dens_apply(psi, 0.2)$y)
## [1] -5.551115e-16  3.885781e-16
range(kern_dens(psi, 0.2)$y - kern_dens_outer(psi, 0.2)$y)
## [1] -4.996004e-16  3.885781e-16

The natural question is then how to choose between the different implementations? Besides being correct it is important that the code is easy to read and understand. Which of the four implementations above that is best in this respect may depend a lot on the background of the reader. If you strip the implementations for comments, all four are arguably quite readable, but kern_dens() with the double loop might appeal a bit more to people with a background in imperative programming, while kern_dens_apply() might appeal more to people with a preference for functional programming. This functional and vectorized solution is also a bit closer to the mathematical notation with e.g. the sum sign \(\Sigma\) being mapped directly to the sum() function instead of the incremental addition in the for-loop. For these specific implementations these differences are mostly aesthetic nuances and preferences may be more subjective than substantial.

To make a qualified choice between the implementations we should investigate if they differ in terms of run time and memory consumption.

2.2.2 Benchmarking

Benchmarking is about measuring and comparing performance. For software this often means measuring run time and memory usage, though there are clearly many other aspects of software that should be benchmarked in general. This includes user experience, energy consumption and implementation and maintenance time. In this section we focus on benchmarking run time.

The function system.time in R provides a simple way of benchmarking run time measured in seconds.

system.time(kern_dens(psi, 0.2))
system.time(kern_dens_vec(psi, 0.2))
system.time(kern_dens_apply(psi, 0.2))
system.time(kern_dens_outer(psi, 0.2))
## kern_dens:
##    user  system elapsed 
##   0.031   0.000   0.031 
## kern_dens_vec:
##    user  system elapsed 
##   0.003   0.000   0.003 
## kern_dens_apply:
##    user  system elapsed 
##   0.003   0.000   0.004 
## kern_dens_outer:
##    user  system elapsed 
##   0.004   0.000   0.003

The “elapsed” time is the total run time as experienced, while the “user” and “system” times are how long the CPU spent on executing your code and operating system code on behalf of your code, respectively.

From this simple benchmark, kern_dens() is clearly substantially slower than the three other implementations. For more systematic benchmarking of run time, the R package microbenchmark is useful.

kern_bench <- microbenchmark(
  kern_dens(psi, 0.2),
  kern_dens_vec(psi, 0.2),
  kern_dens_apply(psi, 0.2),
  kern_dens_outer(psi, 0.2)

The result stored in kern_bench() is a data frame with two columns. The first contains the R expressions evaluated, and the second is the evaluation time measured in nanoseconds. Each of the four expressions were evaluated 100 times, and the data frame thus has 400 rows. The times argument to microbenchmark() can be used to change the number of evaluations per expression if needed.

It may not be immediately obvious that kern_bench() is a data frame, because printing will automatically summarize the data, but the actual data structure is revealed by the R function str().

## Classes 'microbenchmark' and 'data.frame':	400 obs. of  2 variables:
##  $ expr: Factor w/ 4 levels "kern_dens(psi, 0.2)",..: 3 3 3 4 3 2 1 1 2 2 ...
##  $ time: num  3058121 3670484 3446023 4136851 3379003 ...

A total of 400 evaluations were done for the above benchmark, and microbenchmark()
does the evaluations in a random order by default. Measuring evaluation time on a complex system like a modern computer is an empirical science, and the order of evaluation can potentially affect the results as the conditions for the evaluation change over time. The purpose of the randomization is to avoid that the ordering causes systematically misleading results.

The microbenchmark package implements some methods for summarizing and printing the results such as the following summary table with times in milliseconds.

## Unit: milliseconds
##                       expr   min    lq  mean median    uq  max neval
##        kern_dens(psi, 0.2) 28.62 29.66 30.85  30.20 31.28 41.2   100
##    kern_dens_vec(psi, 0.2)  2.50  2.64  3.43   2.77  3.24 10.9   100
##  kern_dens_apply(psi, 0.2)  2.80  2.96  3.42   3.08  3.41 10.3   100
##  kern_dens_outer(psi, 0.2)  3.12  3.37  4.51   3.61  4.35 10.8   100

The summary table shows some key statistics like median and mean evaluation times but also extremes and upper and lower quartiles. The distributions of run times can be investigated further using the autoplot() function, which is based on ggplot2 and thus easy to modify.

autoplot(kern_bench) + 
  geom_jitter(position = position_jitter(0.2, 0), 
              aes(color = expr), alpha = 0.4) + 
  aes(fill = I("gray")) + 
  theme(legend.position = "none")

This more refined benchmark study does not change our initial impression from using system.time() substantially. The function kern_dens() is notably slower than the three vectorized implementations, but we are now able to more clearly see the minor differences among them. For instance, kern_dens_vec() and kern_dens_apply() have very similar run time distributions, while kern_dens_outer() clearly has a larger median run time and also a run time distribution that is more spread out to the right.

In many cases when we benchmark run time it is of interest to investigate how run time depends on various parameters. This is so for kernel density estimation, where we want to understand how changes in the number of data points, \(n\), and the number of grid points, \(m\), affect run time. We can still use microbenchmark() for running the benchmark experiment, but we will typically process and plot the benchmark data afterwards in a customized way.

Median run times for the four different implementations of kernel density estimation. The dashed gray line is a reference line with slope 1.

Figure 2.4: Median run times for the four different implementations of kernel density estimation. The dashed gray line is a reference line with slope 1.

Figure 2.4 shows median run times for an experiment with 28 combinations of parameters for each of the four different implementations yielding a total of 112 different R expressions being benchmarked. The number of replications for each expression was set to 40. The results confirm that kern_dens() is substantially slower than the vectorized implementations for all combinations of \(n\) and \(m\). However, Figure 2.4 also reveals a new pattern; kern_dens_outer() appears to scale with \(n\) in a slightly different way than the two other vectorized implementations for small \(n\). It is comparable to or even a bit faster than kern_dens_vec() and kern_dens_apply() for very small data sets, while it becomes slower for the larger data sets.

Run time for many algorithms have to a good approximation a dominating power law behavior as a function of typical size parameters, that is, the run time will scale approximately like \(n \mapsto C n^a\) for constants \(C\) and \(a\) and with \(n\) denoting a generic size parameter. Therefore it is beneficial to plot run time using log-log scales and to design benchmark studies with size parameters being equidistant on a log-scale. With approximate power law scaling, the log run time behaves like \[\log(C) + a \log(n),\] that is, on a log-log scale we see approximate straight lines. The slope reveals the exponent \(a\), and two different algorithms for solving the same problem might have different exponents and thus different slopes on the log-log-scale. Two different implementations of the same algorithm should have approximately the same slope but may differ in the constant \(C\) depending upon how efficient the particular implementation is in the particular programming language used. Differences in \(C\) correspond to vertical translations on the log-log scale.

In practice, we will see some deviations from straight lines on the log-log plot for a number of reasons. Writing the run time as \(C n^a + R(n)\), the residual term \(R(n)\) will often be noticeable or even dominating and positive for small \(n\). It is only for large enough \(n\) that the power law term, \(C n^a\), will dominate. In addition, run time can be affected by hardware constraints such as cache and memory sizes, which can cause abrupt jumps in run time.

Using microbenchmark() over system.time() has two main benefits. First, it handles the replication and randomization automatically, which is convenient. Second, it attempts to provide more accurate timings. The latter is mostly important when we benchmark very fast computations.

It can be debated if a median summary of randomly ordered evaluations is the best way to summarize run time. This is due to the way R does memory management. R allocates and deallocates memory automatically and uses garbage collection for the deallocation. This means that computations occasionally, and in a somewhat unpredictable manner, trigger the garbage collector, and as a result a small fraction of the evaluations may take substantially longer time than the rest. The median will typically be almost unaffected, and memory deallocation is thus effectively (and wrongly) disregarded from run time when the median summary is used. This is an argument for using the mean instead of the median, but due to the randomization the computation that triggered the garbage collector might not be the one that caused the memory allocation in the first place. Using the mean instead of the median will therefore smear out the garbage collection run time on all benchmarked expressions. Setting the argument control = list(order = "block") for microbenchmark() will evaluate the expressions in blocks, which in combination with a mean summary more correctly accounts for memory allocation and deallocation in the run time. The downside is that without the randomization the results might suffer from other artefacts. This book will use randomization and median summaries throughout, but we keep in mind that this could underestimate actual average run time depending upon how much memory a given computation requires. Memory usage and how it affects run time by triggering garbage collection will be dealt with via code profiling tools instead.