2 Density estimation

This chapter is on nonparametric density estimation. A classical nonparametric estimator of a density is the histogram, which provides discontinuous and piecewise constant estimates. The focus in this chapter is on some of the alternatives that provide continuous or even smooth estimates instead.

Kernel methods form an important class of smooth density estimators as implemented by the R function density(). These estimators are essentially just locally weighted averages, and their computation is relatively straightforward in theory. In practice, different choices of how to implement the computations can, however, have a big effect on the actual computation time, and the implementation of kernel density estimators will illustrate three points:

  • if possible, choose vectorized implementations in R,
  • if a small loss in accuracy is acceptable, an approximate solution can be orders of magnitudes faster than a literal implementation,
  • the time it takes to numerically evaluate different elementary functions can depend a lot on the function and how you implement the computation.

The first point is emphasized because it results in implementations that are short, expressive and easier to understand just as much as it typically results in computationally more efficient implementations. Note also that not every computation can be vectorized in a beneficial way, and one should never go through hoops to vectorize a computation.

Kernel methods rely on one or more regularization parameters that must be selected to achieve the right balance of adapting to data without adapting too much to the random variation in the data. Choosing the right amount of regularization is just as important as choosing the method to use in the first place. It may, in fact, be more important. We actually do not have a complete implementation of a nonparametric estimator until we have implemented a data driven and automatic way of choosing the amount of regularization. Implementing only the computations for evaluating a kernel estimator, say, and leaving it completely to the user to choose the bandwidth is a job half done. Methods and implementations for choosing the bandwidth are therefore treated in some detail in this chapter.

In the final section a likelihood analysis is carried out. This is done to further clarify why regularized estimators are needed to avoid overfitting to the data, and why there is in general no nonparametric maximum likelihood estimator of a density. Regularization of the likelihood can be achieved by constraining the density estimates to belong to a family of increasingly flexible parametric densities that are fitted to data. This is known as the method of sieves. Another approach is based on basis expansions, but in either case, automatic selection of the amount of regularization is just as important as for kernel methods.

2.1 Univariate density estimation

Recall the data on \(\phi\)- and \(\psi\)-angles in polypeptide backbone structures, as considered in Section 1.1.1.

(ref:phipsiDenscap2) Histograms equipped with a rug plot of the distribution of \(\phi\)-angles (left) and \(\psi\)-angles (right) of the peptide planes in the protein human protein 1HMP.


Figure 2.1: (ref:phipsiDenscap2)

We will in this section treat methods for smooth density estimation for univariate data such as data on either the \(\phi\)- or the \(\psi\)-angle.

We let \(f_0\) denote the unknown density that we want to estimate. That is, we imagine that the data points \(x_1, \ldots, x_n\) are all observations drawn from the probability measure with density \(f_0\) w.r.t. Lebesgue measure on \(\mathbb{R}\).

Suppose first that \(f_0\) belongs to a parametrized statistical model \((f_{\theta})_{\theta}\), where \(f_{\theta}\) is a density w.r.t. Lebesgue measure on \(\mathbb{R}\). If \(\hat{\theta}\) is an estimate of the parameter, \(f_{\hat{\theta}}\) is an estimate of the unknown density \(f_0\). For a parametric family we can always try to use the MLE \[\hat{\theta} = \text{arg max}_{\theta} \sum_{j=1}^n \log f_{\theta}(x_j)\] as an estimate of \(\theta\). Likewise, we might compute the empirical mean and variance for the data and plug those numbers into the density for the Gaussian distribution, and in this way obtain a Gaussian density estimate of \(f_0\).

psi_mean <- mean(psi)
psi_sd <- sd(psi)
hist(psi, prob = TRUE)
curve(dnorm(x, psi_mean, psi_sd), add = TRUE, col = "red")
Gaussian density (red) fitted to the $\psi$-angles.

Figure 2.2: Gaussian density (red) fitted to the \(\psi\)-angles.

As Figure 2.2 shows, if we fit a Gaussian distribution to the \(\psi\)-angle data we get a density estimate that clearly does not match the histogram. The Gaussian density matches the data on the first and second moments, but the histogram shows a clear bimodality that the Gaussian distribution by definition cannot match. Thus we need a more flexible parametric model than the Gaussian if we want to fit a density to this data set.

In nonparametric density estimating we want to estimate the target density, \(f_0\), without assuming that it belongs to a particular parametrized family of densities.

2.2 Kernel methods

A simple approach to nonparametric density estimation relies on the approximation \[P(X \in (x-h, x+h)) = \int_{x-h}^{x+h} f_0(z) \ dz \simeq f_0(x) 2h,\] which is valid for any continuous density \(f_0\). Inverting this approximation and using the law of large numbers,

\[\begin{align*} f_0(x) & \simeq \frac{1}{2h}P(X \in (x-h, x+h)) \\ & \simeq \frac{1}{2hn} \sum_{j=1}^n 1_{(x-h, x+h)}(x_j) \\ & = \underbrace{\frac{1}{2hn} \sum_{j=1}^n 1_{(-h, h)}(x - x_j)}_{\hat{f}_h(x)} \end{align*}\]

for i.i.d. observations \(x_1, \ldots, x_n\) from the distribution \(f_0 \cdot m\). The function \(\hat{f}_h\) defined as above is an example of a kernel density estimator with a rectangular kernel. We immediately note that \(h\) has to be chosen appropriately. If \(h\) is large, \(\hat{f}_h\) will be flat and close to a constant. If \(h\) is small, \(\hat{f}_h\) will make large jumps close to the observations.

What do we then mean by an “appropriate” choice of \(h\) above? To answer this we must have some prior assumptions about what we expect \(f_0\) to look like. Typically, we expect \(f_0\) to have few oscillations and to be fairly smooth, and we want \(\hat{f}_h\) to reflect that. A too large \(h\) will oversmooth the data relative to \(f_0\) by effectively ignoring the data, while a too small \(h\) will undersmooth the data relative to \(f_0\) by allowing individual data points to have large local effects that make the estimate wiggly. More formally, we can look at the mean and variance of \(\hat{f}_h\). Letting \(p(x, h) = P(X \in (x-h, x+h))\), it follows that \(f_h(x) = E(\hat{f}_h(x)) = p(x, h) / (2h)\) while

\[\begin{equation} V(\hat{f}_h(x)) = \frac{p(x, h) (1 - p(x, h))}{4h^2 n} \simeq f_h(x) \frac{1}{2hn}. \tag{2.1} \end{equation}\]

We see from these computations that for \(\hat{f}_h(x)\) to be approximately unbiased for any \(x\) we need \(h\) to be small – ideally letting \(h \to 0\) since then \(f_h(x) \to f_0(x)\). However, this will make the variance blow up, and to minimize variance we should instead choose \(h\) as large as possible. One way to define “appropriate” is then to strike a balance between the bias and the variance as a function of \(h\) so as to minimize the mean squared error of \(\hat{f}_h(x)\).

We will find the optimal tradeoff for the rectangular kernel in Section 2.3 on bandwidth selection. It’s not difficult, and you are encouraged to try finding it yourself at this point. In this section we will focus on computational aspects of kernel density estimation, but first we will generalize the estimator by allowing for other kernels.

The estimate \(\hat{f}_h(x)\) will be unbiased if \(f_0\) is constantly equal to \(f_0(x)\) in the entire interval \((x-h, x+h)\). This is atypical and can only happen for all \(x\) if \(f_0\) is constant. We expect the typical situation to be that \(f_0\) deviates the most from \(f_0(x)\) close to \(x \pm h\), and that this causes a bias of \(\hat{f}_h(x).\) Observations falling close to \(x + h\), say, should thus count less than observations falling close to \(x\)? The rectangular kernel makes a sharp cut; either a data point is in or it is out. If we use a smooth weighting function instead of a sharp cut, we might be able to include more data points and lower the variance while keeping the bias small. This is precisely the idea of kernel estimators, defined generally as

\[\begin{equation} \hat{f}_h(x) = \frac{1}{hn} \sum_{j=1}^n K\left(\frac{x - x_j}{h}\right) \tag{2.2} \end{equation}\]

for a kernel \(K : \mathbb{R} \to \mathbb{R}\). The parameter \(h > 0\) is known as the bandwidth. Examples of kernels include the uniform or rectangular kernel \[K(x) = \frac{1}{2} 1_{(-1,1)}(x),\] and the Gaussian kernel \[K(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}.\]

One direct benefit of considering other kernels than the rectangular is that \(\hat{f}_h\) inherits all smoothness properties from \(K\). Whereas the rectangular kernel is not even continuous, the Gaussian kernel is \(C^{\infty}\) and so is the resulting kernel density estimate.

2.2.1 Implementation

What should be computed to compute a kernel density estimate? That is, in fact, a good question, because the definition actually just specifies how to evaluate \(\hat{f}_h\) in any given point \(x\), but there is really not anything to compute until we need to evaluate \(\hat{f}_h\). Thus when we implement kernel density estimation we really implement algorithms for evaluating a density estimate in a finite number of points.

Our first implementation is a fairly low-level implementation that returns the evaluation of the density estimate in a given number of equidistant points. The function mimics some of the defaults of density() so that it actually evaluates the estimate in the same points as density().

# This is an implementation of the function 'kern_dens' that computes 
# evaluations of Gaussian kernel density estimates in a grid of points.
# The function has three formal arguments: 'x' is the numeric vector of data 
# points, 'h' is the bandwidth and 'm' is the number of grid points. 
# The default value of 512 is chosen to match the default of 'density()'. 
kern_dens <- function(x, h, m = 512) {
  rg <- range(x)
  # xx is equivalent to grid points in 'density()'
  xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
  y <- numeric(m) # The evaluations, initialized as a vector of zeros
  # The actual computation is done using nested for-loops. The outer loop
  # is over the grid points, and the inner loop is over the data points.
  for (i in seq_along(xx))
    for (j in seq_along(x))
      y[i] <- y[i] + exp(- (xx[i] - x[j])^2 / (2 * h^2))  
  y <- y / (sqrt(2 * pi) * h * length(x))
  list(x = xx, y = y)

Note that the function returns a list containing the grid points (x) where the density estimate is evaluated as well as the estimated density evaluations (y). Note also that the argument m above sets the number of grid points, whereas density() uses the argument n for that. The latter can be a bit confusing as \(n\) is often used to denote the number of data points.

We will immediately test if the implementation works as expected – in this case by comparing it to our reference implementation density().

f_hat <- kern_dens(psi, 0.2)
f_hat_dens <- density(psi, 0.2)
plot(f_hat, type = "l", lwd = 4, xlab = "x", ylab = "Density")
lines(f_hat_dens, col = "red", lwd = 2)
plot(f_hat$x, f_hat$y - f_hat_dens$y, 
  type = "l", 
  lwd = 2,
  xlab = "x", 
  ylab = "Difference"
Kernel density estimates with the Gaussian kernel (left) using R's implementation (black) and our implementation (red) together with differences of the estimates (right).Kernel density estimates with the Gaussian kernel (left) using R's implementation (black) and our implementation (red) together with differences of the estimates (right).

Figure 2.3: Kernel density estimates with the Gaussian kernel (left) using R’s implementation (black) and our implementation (red) together with differences of the estimates (right).

Figure 2.3 suggests that the estimates computed by our implementation and by density() are the same when we just visually compare the plotted densities. However, if we look at the differences instead, we see that they are as large as \(4 \times 10^{-4}\) in absolute value. This is way above what we should expect from rounding errors alone when using double precision arithmetic. Thus the two implementations only compute approximately the same, which is, in fact, because density() relies on certain approximations for run time efficiency.

In R we can often beneficially implement computations in a vectorized way instead of using an explicit loop. It is fairly easy to change the implementation to be more vectorized by computing each evaluation in one single line using the sum() function and the fact that exp() and squaring are vectorized.

kern_dens_vec <- function(x, h, m = 512) {
  rg <- range(x)
  xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
  y <- numeric(m) 
  # The inner loop from 'kern_dens' has been vectorized, and only the 
  # outer loop over the grid points remains. 
  const <- (sqrt(2 * pi) * h * length(x))
  for (i in seq_along(xx))
      y[i] <- sum(exp(-(xx[i] - x)^2 / (2 * h^2))) / const
  list(x = xx, y = y)

We test this new implementation by comparing it to our previous implementation.

range(kern_dens(psi, 0.2)$y - kern_dens_vec(psi, 0.2)$y)
## [1] 0 0

The magnitude of the differences are of order at most \(10^{-16}\), which is what can be expected due to rounding errors. Thus we conclude that up to rounding errors, kern_dens() and kern_dens_vec() return the same on this data set. This is, of course, not a comprehensive test, but it is an example of one among a number of tests that should be considered.

There are several ways to get completely rid of the explicit loops and write an entirely vectorized implementation in R. One of the solutions will use the sapply() function, which belongs to the family of *apply() functions that apply a function to each element in a vector or a list. In the parlance of functional programming the *apply() functions are variations of the functional, or higher-order-function, known as map.

kern_dens_apply <- function(x, h, m = 512) {
  rg <- range(x)
  xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
  const <- sqrt(2 * pi) * h * length(x)
  y <- sapply(xx, function(z) sum(exp(-(z - x)^2 / (2 * h^2))) / const)
  list(x = xx, y = y)

The sapply() call above will apply the function function(z) sum(dnorm(... to every element in the vector xx and return the result as a vector. The function is an example of an anonymous function that does not get a name and exists only during the sapply() evaluation. Instead of sapply() it is possible to use lapply() that returns a list. In fact, sapply() is a simple wrapper around lapply() that attempts to “simplify” the result from a list to an array (and in this case to a vector).

An alternative, and also completely vectorized, solution can be based on the functions outer() and rowMeans().

kern_dens_outer <- function(x, h, m = 512) {
  rg <- range(x)
  xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
  y <- outer(xx, x, function(zz, z) exp(-(zz - z)^2 / (2 * h^2)))
  y <- rowMeans(y) / (sqrt(2 * pi) * h)
  list(x = xx, y = y)

The outer() function evaluates the kernel in all combinations of the grid and data points and returns a matrix of dimensions \(m \times n\). The function rowMeans() computes the means of each row and returns a vector of length \(m\).

We should, of course, also remember to test these two last implementations.

range(kern_dens(psi, 0.2)$y - kern_dens_apply(psi, 0.2)$y)
## [1] 0 0
range(kern_dens(psi, 0.2)$y - kern_dens_outer(psi, 0.2)$y)
## [1] -5.551e-17  5.551e-17

The natural question is then how to choose between the different implementations? Besides being correct it is important that the code is easy to read and understand. Which of the four implementations above that is best in this respect may depend a lot on the background of the reader. If you strip the implementations for comments, all four are arguably quite readable, but kern_dens() with the double loop might appeal a bit more to people with a background in imperative programming, while kern_dens_apply() might appeal more to people with a preference for functional programming. This functional and vectorized solution is also a bit closer to the mathematical notation with e.g. the sum sign \(\Sigma\) being mapped directly to the sum() function instead of the incremental addition in the for-loop. For these specific implementations these differences are mostly aesthetic nuances and preferences may be more subjective than substantial.

To make a qualified choice between the implementations we should investigate if they differ in terms of run time and memory consumption.

2.2.2 Benchmarking

Benchmarking is about measuring and comparing performance. For software this often means measuring run time and memory usage, though there are clearly many other aspects of software that should be benchmarked in general. This includes user experience, energy consumption and implementation and maintenance time. In this section we focus on benchmarking run time.

The function system.time in R provides a simple way of benchmarking run time measured in seconds.

system.time(kern_dens(psi, 0.2))
system.time(kern_dens_vec(psi, 0.2))
system.time(kern_dens_apply(psi, 0.2))
system.time(kern_dens_outer(psi, 0.2))
## kern_dens:
##    user  system elapsed 
##   0.016   0.000   0.016 
## kern_dens_vec:
##    user  system elapsed 
##   0.002   0.000   0.001 
## kern_dens_apply:
##    user  system elapsed 
##   0.029   0.014   0.042 
## kern_dens_outer:
##    user  system elapsed 
##   0.020   0.042   0.090

The “elapsed” time is the total run time as experienced, while the “user” and “system” times are how long the CPU spent on executing your code and operating system code on behalf of your code, respectively.

From this simple benchmark, kern_dens() is clearly substantially slower than the three other implementations. For more systematic benchmarking of run time, the R package microbenchmark is useful.

kern_bench <- microbenchmark(
  kern_dens(psi, 0.2),
  kern_dens_vec(psi, 0.2),
  kern_dens_apply(psi, 0.2),
  kern_dens_outer(psi, 0.2)
## Warning in microbenchmark(kern_dens(psi, 0.2), kern_dens_vec(psi, 0.2), : less
## accurate nanosecond times to avoid potential integer overflows

The result stored in kern_bench() is a data frame with two columns. The first contains the R expressions evaluated, and the second is the evaluation time measured in nanoseconds. Each of the four expressions were evaluated 100 times, and the data frame thus has 400 rows. The times argument to microbenchmark() can be used to change the number of evaluations per expression if needed.

It may not be immediately obvious that kern_bench() is a data frame, because printing will automatically summarize the data, but the actual data structure is revealed by the R function str().

## Classes 'microbenchmark' and 'data.frame':   400 obs. of  2 variables:
##  $ expr: Factor w/ 4 levels "kern_dens(psi, 0.2)",..: 3 3 3 4 3 2 1 1 2 2 ...
##  $ time: num  1416960 1365136 1377026 1470793 1432868 ...

A total of 400 evaluations were done for the above benchmark, and microbenchmark()
does the evaluations in a random order by default. Measuring evaluation time on a complex system like a modern computer is an empirical science, and the order of evaluation can potentially affect the results as the conditions for the evaluation change over time. The purpose of the randomization is to avoid that the ordering causes systematically misleading results.

The microbenchmark package implements some methods for summarizing and printing the results such as the following summary table with times in milliseconds.

## Unit: milliseconds
##                       expr   min    lq  mean median    uq    max neval
##        kern_dens(psi, 0.2) 14.80 15.32 16.70  15.47 15.76  96.66   100
##    kern_dens_vec(psi, 0.2)  1.10  1.19  1.57   1.25  1.35   7.58   100
##  kern_dens_apply(psi, 0.2)  1.23  1.33  1.46   1.39  1.48   6.59   100
##  kern_dens_outer(psi, 0.2)  1.10  1.34  3.39   1.49  1.65 118.89   100

The summary table shows some key statistics like median and mean evaluation times but also extremes and upper and lower quartiles. The distributions of run times can be investigated further using the autoplot() function, which is based on ggplot2 and thus easy to modify.

autoplot(kern_bench) + 
  geom_jitter(position = position_jitter(0.2, 0), 
              aes(color = expr), alpha = 0.4) + 
  aes(fill = I("gray")) + 
  theme(legend.position = "none")

This more refined benchmark study does not change our initial impression from using system.time() substantially. The function kern_dens() is notably slower than the three vectorized implementations, but we are now able to more clearly see the minor differences among them. For instance, kern_dens_vec() and kern_dens_apply() have very similar run time distributions, while kern_dens_outer() clearly has a larger median run time and also a run time distribution that is more spread out to the right.

In many cases when we benchmark run time it is of interest to investigate how run time depends on various parameters. This is so for kernel density estimation, where we want to understand how changes in the number of data points, \(n\), and the number of grid points, \(m\), affect run time. We can still use microbenchmark() for running the benchmark experiment, but we will typically process and plot the benchmark data afterwards in a customized way.

Median run times for the four different implementations of kernel density estimation. The dashed gray line is a reference line with slope 1.

Figure 2.4: Median run times for the four different implementations of kernel density estimation. The dashed gray line is a reference line with slope 1.

Figure 2.4 shows median run times for an experiment with 28 combinations of parameters for each of the four different implementations yielding a total of 112 different R expressions being benchmarked. The number of replications for each expression was set to 40. The results confirm that kern_dens() is substantially slower than the vectorized implementations for all combinations of \(n\) and \(m\). However, Figure 2.4 also reveals a new pattern; kern_dens_outer() appears to scale with \(n\) in a slightly different way than the two other vectorized implementations for small \(n\). It is comparable to or even a bit faster than kern_dens_vec() and kern_dens_apply() for very small data sets, while it becomes slower for the larger data sets.

Run time for many algorithms have to a good approximation a dominating power law behavior as a function of typical size parameters, that is, the run time will scale approximately like \(n \mapsto C n^a\) for constants \(C\) and \(a\) and with \(n\) denoting a generic size parameter. Therefore it is beneficial to plot run time using log-log scales and to design benchmark studies with size parameters being equidistant on a log-scale. With approximate power law scaling, the log run time behaves like \[\log(C) + a \log(n),\] that is, on a log-log scale we see approximate straight lines. The slope reveals the exponent \(a\), and two different algorithms for solving the same problem might have different exponents and thus different slopes on the log-log-scale. Two different implementations of the same algorithm should have approximately the same slope but may differ in the constant \(C\) depending upon how efficient the particular implementation is in the particular programming language used. Differences in \(C\) correspond to vertical translations on the log-log scale.

In practice, we will see some deviations from straight lines on the log-log plot for a number of reasons. Writing the run time as \(C n^a + R(n)\), the residual term \(R(n)\) will often be noticeable or even dominating and positive for small \(n\). It is only for large enough \(n\) that the power law term, \(C n^a\), will dominate. In addition, run time can be affected by hardware constraints such as cache and memory sizes, which can cause abrupt jumps in run time.

Using microbenchmark() over system.time() has two main benefits. First, it handles the replication and randomization automatically, which is convenient. Second, it attempts to provide more accurate timings. The latter is mostly important when we benchmark very fast computations.

It can be debated if a median summary of randomly ordered evaluations is the best way to summarize run time. This is due to the way R does memory management. R allocates and deallocates memory automatically and uses garbage collection for the deallocation. This means that computations occasionally, and in a somewhat unpredictable manner, trigger the garbage collector, and as a result a small fraction of the evaluations may take substantially longer time than the rest. The median will typically be almost unaffected, and memory deallocation is thus effectively (and wrongly) disregarded from run time when the median summary is used. This is an argument for using the mean instead of the median, but due to the randomization the computation that triggered the garbage collector might not be the one that caused the memory allocation in the first place. Using the mean instead of the median will therefore smear out the garbage collection run time on all benchmarked expressions. Setting the argument control = list(order = "block") for microbenchmark() will evaluate the expressions in blocks, which in combination with a mean summary more correctly accounts for memory allocation and deallocation in the run time. The downside is that without the randomization the results might suffer from other artefacts. This book will use randomization and median summaries throughout, but we keep in mind that this could underestimate actual average run time depending upon how much memory a given computation requires. Memory usage and how it affects run time by triggering garbage collection will be dealt with via code profiling tools instead.

2.3 Bandwidth selection

2.3.1 Revisiting the rectangular kernel

We return to the rectangular kernel and compute the mean squared error. In the analysis it may be helpful to think about \(n\) large and \(h\) small. Indeed, we will eventually choose \(h = h_n\) as a function of \(n\) such that as \(n \to \infty\) we have \(h_n \to 0\). We should also note the \(f_h(x) = E (\hat{f}_h(x))\) is a density, thus \(\int f_h(x) \mathrm{d}x = 1\).

We will assume that \(f_0\) is sufficiently differentiable and use a Taylor expansion of the distribution function \(F_0\) to get that

\[\begin{align*} f_h(x) & = \frac{1}{2h}\left(F_0(x + h) - F_0(x - h)\right) \\ & = \frac{1}{2h}\left(2h f_0(x) + \frac{h^3}{3} f_0''(x) + R_0(x,h) \right) \\ & = f_0(x) + \frac{h^2}{6} f_0''(x) + R_1(x,h) \end{align*}\]

where \(R_1(x, h) = o(h^2)\). One should note how the quadratic terms in \(h\) in the Taylor expansion canceled. This gives the following formula for the squared bias of \(\hat{f}_h\).

\[\begin{align*} \mathrm{bias}(\hat{f}_h(x))^2 & = (f_h(x) - f_0(x))^2 \\ & = \left(\frac{h^2}{6} f_0''(x) + R_1(x,h) \right)^2 \\ & = \frac{h^4}{36} f_0''(x)^2 + R(x,h) \end{align*}\]

where \(R(x,h) = o(h^4)\). For the variance we see from (2.1) that \[V(\hat{f}_h(x)) = f_h(x)\frac{1}{2hn} - f_h(x)^2 \frac{1}{n}.\] Integrating the sum of the bias and the variance over \(x\) gives the integrated mean squared error

\[\begin{align*} \mathrm{MISE}(h) & = \int \mathrm{bias}(\hat{f}_h(x))^2 + V(\hat{f}_h(x)) \mathrm{d}x \\ & = \frac{h^4}{36} \|f_0''\|^2_2 + \frac{1}{2hn} + \int R(x,h) \mathrm{d} x - \frac{1}{n} \int f_h(x)^2 \mathrm{d} x. \end{align*}\]

If \(f_h(x) \leq C\) (which happens if \(f_0\) is bounded), \[\int f_h(x)^2 \mathrm{d} x \leq C \int f_h(x) \mathrm{d} x = C,\] and the last term is \(o((nh)^{-1})\). The second last term is \(o(h^4)\) if we can interchange the limit and integration order. It is conceivable that we can do so under suitable assumptions on \(f_0\), but we will not pursue those at this place. The sum of the two remaining and asymptotically dominating terms in the formula for MISE is
\[\mathrm{AMISE}(h) = \frac{h^4}{36} \|f_0''\|^2_2 + \frac{1}{2hn},\] which is known as the asymptotic mean integrated squared error. Clearly, for this to be a useful formula, we must assume \(\|f_0''\|_2^2 < \infty\). In this case the formula for AMISE can be used to find the asymptotic optimal tradeoff between (integrated) bias and variance. Differentiating w.r.t. \(h\) we find that \[\mathrm{AMISE}'(h) = \frac{h^3}{9} \|f_0''\|^2_2 - \frac{1}{2h^2n},\] and solving for \(\mathrm{AMISE}'(h) = 0\) yields \[h_n = \left(\frac{9}{2 \|f_0''\|_2^2}\right)^{1/5} n^{-1/5}.\] When AMISE is regarded as a function of \(h\) we observe that it tends to \(\infty\) for \(h \to 0\) as well as for \(h \to \infty\), thus the unique stationary point \(h_n\) is a unique global minimizer. Choosing the bandwidth to be \(h_n\) will therefore minimize the asympototic mean integrated squared error, and it is in this sense an optimal choice of bandwidth.

We see how “wiggliness” of \(f_0\) enters into the formula for the optimal bandwidth \(h_n\) via \(\|f_0''\|_2\). This norm of the second derivative is precisely a quantification of how much \(f_0\) oscillates. A large value, indicating a wiggly \(f_0\), will drive the optimal bandwidth down whereas a small value will drive the optimal bandwidth up.

We should also observe that if we plug the optimal bandwidth into the formula for AMISE, we get \[\begin{align*} \mathrm{AMISE}(h_n) & = \frac{h_n^4}{36} \|f_0''\|^2_2 + \frac{1}{2h_n n} \\ & = C n^{-4/5}, \end{align*}\] which indicates that in terms of integrated mean squared error the rate at which we can nonparametrically estimate \(f_0\) is \(n^{-4/5}\). This should be contrasted to the common parametric rate of \(n^{-1}\) for mean squared error.

From a practical viewpoint there is one major problem with the optimal bandwidth \(h_n\); it depends via \(\|f_0''\|^2_2\) upon the unknown \(f_0\) that we are trying to estimate. We therefore refer to \(h_n\) as an oracle bandwidth – it is the bandwidth that an oracle that knows \(f_0\) would tell us to use. In practice, we will have to come up with an estimate of \(\|f_0''\|^2_2\) and plug that estimate into the formula for \(h_n\). We pursue a couple of different options for doing so for general kernel density estimators below together with methods that do not rely on the AMISE formula.

2.3.2 ISE, MISE and MSE for kernel estimators

Bandwidth selection for general kernel estimators can be studied asymptotically just as above. To this end it is useful to formalize how we quantify the quality of an estimate \(\hat{f}_h\). One natural quantification is the integrated squared error, \[\mathrm{ISE}(\hat{f}_h) = \int (\hat{f}_h(x) - f_0(x))^2 \ \mathrm{d}x = \|\hat{f}_h - f_0\|_2^2.\]

The quality of the estimation procedure producing \(\hat{f}_h\) from data can then be quantified by taking the mean ISE, \[\mathrm{MISE}(h) = E(\mathrm{ISE}(\hat{f}_h)),\] where the expectation integral is over the data. Using Tonelli’s theorem we may interchange the expectation and the integration over \(x\) to get \[\mathrm{MISE}(h) = \int \mathrm{MSE}_x(h) \ \mathrm{d}x\] where \[\mathrm{MSE}_h(x) = \mathrm{var}(\hat{f}_h(x)) + \mathrm{bias}(\hat{f}_h(x))^2.\] is the pointwise mean squared error.

Using the same kind of Taylor expansion argument as above we can show that if \(K\) is a square integrable probability density with mean 0 and \[\sigma_K^2 = \int z^2 K(z) \ \mathrm{d}z > 0,\] then \[\mathrm{MISE}(h) = \mathrm{AMISE}(h) + o((nh)^{-1} + h^4)\] where the asymptotic mean integrated squared error is \[\mathrm{AMISE}(h) = \frac{\|K\|_2^2}{nh} + \frac{h^4 \sigma^4_K \|f_0''\|_2^2}{4}\] with \[\|g\|_2^2 = \int g(z)^2 \ \mathrm{d}z \quad (\mathrm{squared } \ L_2\mathrm{-norm}).\] Some regularity assumptions on \(f_0\) are necessary, and from the result we clearly need to require that \(f_0''\) is meaningful and square integrable. However, that is also enough. See Proposition A.1 in Tsybakov (2009) for a rigorous proof.

By minimizing \(\mathrm{AMISE}(h)\) we derive the optimal oracle bandwidth

\[\begin{equation} \tag{2.3} h_n = \left( \frac{\|K\|_2^2}{ \|f_0''\|^2_2 \sigma_K^4} \right)^{1/5} n^{-1/5}. \end{equation}\]

If we plug this formula into the formula for AMISE we arrive at the asymptotic error rate \(\mathrm{AMISE}(h_n) = C n^{-4/5}\) with a constant \(C\) depending on \(f_0''\) and the kernel. It is noteworthy that the asymptotic analysis can be carried out even if \(K\) is allowed to take negative values, though the resulting estimate may not be a valid density as it is. Tsybakov (2009) demonstrates how to improve on the rate \(n^{-4/5}\) by allowing for kernels whose moments of order two or above vanish. Necessarily, such kernels must take negative values.

We observe that for the rectangular kernel, \[\sigma_K^4 = \left(\frac{1}{2} \int_{-1}^1 z^2 \ \mathrm{d} z\right)^2 = \frac{1}{9}\] and \[\|K\|_2^2 = \frac{1}{2^2} \int_{-1}^1 \ \mathrm{d} z = \frac{1}{2}.\] Plugging these numbers into (2.3) we find the oracle bandwidth for the rectangular kernel as derived in Section 2.3.1. For the Gaussian kernel we find that \(\sigma_K^4 = 1\), while \[\|K\|_2^2 = \frac{1}{2 \pi} \int e^{-x^2} \ \mathrm{d} x = \frac{1}{2 \sqrt{\pi}}.\]

2.3.3 Plug-in estimation of the oracle bandwidth

To compute \(\|f_0''\|^2_2\) that enters into the formula for the asymptotically optimal bandwidth we have to know \(f_0\) that we are trying to estimate in the first place. To resolve the circularity we will make a first guess of what \(f_0\) is and plug that guess into the formula for the oracle bandwidth.

Our first guess is that \(f_0\) is Gaussian with mean 0 and variance \(\sigma^2\). Then

\[\begin{align*} \|f_0''\|^2_2 & = \frac{1}{2 \pi \sigma^2} \int \left(\frac{x^2}{\sigma^4} - \frac{1}{\sigma^2}\right)^2 e^{-x^2/\sigma^2} \ \mathrm{d}x \\ & = \frac{1}{2 \sigma^9 \sqrt{\pi}} \frac{1}{\sqrt{\pi \sigma^2}} \int (x^4 - 2 \sigma^2 x^2 + \sigma^4) e^{-x^2/\sigma^2} \ \mathrm{d}x \\ & = \frac{1}{2 \sigma^9 \sqrt{\pi}} (\frac{3}{4} \sigma^4 - \sigma^4 + \sigma^4) \\ & = \frac{3}{8 \sigma^5 \sqrt{\pi}}. \end{align*}\]

Plugging this expression for the squared 2-norm of the second derivative of the density into the formula for the oracle bandwidth gives

\[\begin{equation} \tag{2.4} h_n = \left( \frac{8 \sqrt{\pi} \|K\|_2^2}{3 \sigma_K^4} \right)^{1/5} \sigma n^{-1/5}, \end{equation}\]

with \(\sigma\) the only quantity depending on the unknown density \(f_0\). We can now simply estimate \(\sigma\), using e.g. the empirical standard deviation \(\hat{\sigma}\), and plug this estimate into the formula above to get an estimate of the oracle bandwidth.

It is well known that the empirical standard deviation is sensitive to outliers, and if \(\sigma\) is overestimated for that reason, the bandwidth will be too large and the resulting estimate will be oversmoothed. To get more robust bandwidth selection, Silverman (1986) suggested using the interquartile range to estimate \(\sigma\). In fact, he suggested estimating \(\sigma\) by \[\tilde{\sigma} = \min\{\hat{\sigma}, \mathrm{IQR} / 1.34\}.\] In this estimator, IQR denotes the empirical interquartile range, and \(1.34\) is approximately the interquartile range, \(\Phi^{-1}(0.75) - \Phi^{-1}(0.25)\), of the standard Gaussian distribution. Curiously, the interquartile range for the standard Gaussian distribution is \(1.35\) to two decimals accuracy, but the use of \(1.34\) in the estimator \(\tilde{\sigma}\) has prevailed. Silverman, moreover, suggested to reduce the kernel-dependent constant in the formula (2.4) for \(h_n\) to further reduce oversmoothing.

If we specialize to the Gaussian kernel, formula (2.4) simplifies to \[\hat{h}_n = \left(\frac{4}{3}\right)^{1/5} \tilde{\sigma} n^{-1/5},\] with \(\tilde{\sigma}\) plugged in as a robust estimate of \(\sigma\). Silverman (1986) made the ad hoc suggestion to reduce the factor \((4/3)^{1/5} \simeq 1.06\) to \(0.9\). This results in the bandwidth estimate \[\hat{h}_n = 0.9 \tilde{\sigma} n^{-1/5},\] which has become known as Silverman’s rule of thumb.

Silverman’s rule of thumb is the default bandwidth estimator for density() as implemented by the function bw.nrd0(). The bw.nrd() function implements bandwidth selection using the factor \(1.06\) instead of \(0.9\). Though the theoretical derivations behind these implementations assume that \(f_0\) is Gaussian, either will give reasonable bandwidth selection for a range of unimodal distributions. If \(f_0\) is multimodal, Silverman’s rule of thumb is known to oversmooth the density estimate.

Instead of computing \(\|f_0''\|^2_2\) assuming that the distribution is Gaussian, we can compute the norm for a pilot estimate, \(\tilde{f}\), and plug the result into the formula for \(h_n\). If the pilot estimate is a kernel estimate with kernel \(H\) and bandwidth \(r\) we get \[\|\tilde{f}''\|^2_2 = \frac{1}{n^2r^6} \sum_{i = 1}^n \sum_{j=1}^n \int H''\left( \frac{x - x_i}{r} \right) H''\left( \frac{x - x_j}{r} \right) \mathrm{d} x.\] The problem is, of course, that now we have to choose the pilot bandwidth \(r\). But doing so using a simple method like Silverman’s rule of thumb at this stage is typically not too bad an idea. Thus we arrive at the following plug-in procedure using the Gaussian kernel for the pilot estimate:

  • Compute an estimate, \(\hat{r}\), of the pilot bandwidth using Silverman’s rule of thumb.
  • Compute \(\|\tilde{f}''\|^2_2\) using the Gaussian kernel as pilot kernel \(H\) and using the estimated pilot bandwidth \(\hat{r}\).
  • Plug \(\|\tilde{f}''\|^2_2\) into the oracle bandwidth formula (2.3) to compute \(\hat{h}_n\) for the kernel \(K\).

Note that to use a pilot kernel different from the Gaussian we have to adjust the constant 0.9 (or 1.06) in Silverman’s rule of thumb accordingly by computing \(\|H\|_2^2\) and \(\sigma_H^4\) and using (2.4).

Sheather and Jones (1991) took these plug-in ideas a step further and analyzed in detail how to choose the pilot bandwidth in a good and data adaptive way. The resulting method is somewhat complicated but implementable. We skip the details but simply observe that their method is implemented in R in the function bw.SJ(), and it can be selected when using density() by setting the argument bw = "SJ". This plug-in method is regarded as a solid default that performs well for many different data generating densities \(f_0\).

2.3.4 Cross-validation

An alternative to relying on the asymptotic optimality arguments for integrated mean squared error and the corresponding plug-in estimates of the bandwidth is known as cross-validation. The method mimics the idea of setting aside a subset of the data set, which is then not used for computing an estimate but only for validating the estimator’s performance. The benefit of cross-validation over simply setting aside a validation data set is that we do not “waste” any of the data points on validation only. All data points are used for the ultimate computation of the estimate. The deficit is that cross-validation is usually computationally more demanding.

Suppose that \(I_1, \ldots, I_k\) form a partition of the index set \(\{1, \ldots, n\}\) and define \[I^{-i} = \bigcup_{l: i \not \in I_l} I_l.\] That is, \(I^{-i}\) contains all indices but those that belong to the set \(I_l\) containing \(i\). In particular, \(i \not \in I^{-i}\). Define also \(n_i = |I^{-i}|\) and \[\hat{f}^{-i}_h = \frac{1}{h n_i} \sum_{j \in I^{-i}} K\left(\frac{x_i - x_j}{h}\right).\] That is, \(\hat{f}^{-i}_h\) is the kernel density estimate based on data with indices in \(I^{-i}\) and evaluated in \(x_i\). Since the density estimate evaluated in \(x_i\) is not based on \(x_i\), the quantity \(\hat{f}^{-i}_h\) can be used to assess how well the density estimate computed using a bandwidth \(h\) concur with the data point \(x_i\). This can be summarized using the log-likelihood \[\ell_{\mathrm{CV}}(h) = \sum_{i=1}^n \log (\hat{f}^{-i}_h),\] that we will refer to as the cross-validated log-likelihood, and we define the bandwidth estimate as \[\hat{h}_{\mathrm{CV}} = \textrm{arg max}_h \ \ \ell_{\mathrm{CV}}(h).\] This cross-validation based bandwidth can then be used for computing kernel density estimates using the entire data set.

If the partition of indices consists of \(k\) subsets we usually talk about \(k\)-fold cross-validation. If \(k = n\) so that all subsets consist of just a single index we talk about leave-one-out cross-validation. For leave-one-out cross-validation there is only one possible partition, while for \(k < n\) there are many possible partitions. Which should be chosen then? In practice, we choose the partition by sampling indices randomly without replacement into \(k\) sets of size roughly \(n / k\).

It is also possible to use cross-validation in combination with MISE. Rewriting we find that \[\mathrm{MISE}(h) = E (\| \hat{f}_h\|_2^2) - 2 E (\hat{f}_h(X)) + E(\|f_0^2\|_2^2)\] for \(X\) a random variable independent of the data and with distribution having density \(f_0\). The last term does not depend upon \(h\) and we can ignore it from the point of view of minimizing MISE. For the first term we have an unbiased estimate in \(\| \hat{f}_h\|_2^2\). The middle term can be estimated without bias by \[\frac{2}{n} \sum_{i=1}^n \hat{f}^{-i}_h,\] and this leads to the statistic \[\mathrm{UCV}(h) = \| \hat{f}_h\|_2^2 - \frac{2}{n} \sum_{i=1}^n \hat{f}^{-i}_h\] known as the unbiased cross-validation criterion. The corresponding bandwidth estimate is \[\hat{h}_{\mathrm{UCV}} = \textrm{arg min}_h \ \ \mathrm{UCV}(h).\] Contrary to the log-likelihood based criterion, this criterion requires the computation of \(\| \hat{f}_h\|_2^2\). Bandwidth selection using UCV is implemented in R in the function bw.ucv() and can be used with density() by setting the argument bw = "ucv".

2.4 Likelihood considerations

In Section 2.3.4 the cross-validated likelihood was used for bandwidth selection. It is natural to ask why we did not go all-in and simply maximized the likelihood over all possible densities to find a maximum likelihood estimator instead of using the ad hoc idea behind kernel density estimation.

The log-likelihood \[\ell(f) = \sum_{i=j}^n \log f(x_j)\] is well defined as a function of the density \(f\) – even when \(f\) is not restricted to belong to a finite-dimensional parametric model. To investigate if a nonparametric MLE is meaningful we consider how the likelihood behaves for the densities \[\overline{f}_h(x) = \frac{1}{nh \sqrt{2 \pi}} \sum_{j=1}^n e^{- \frac{(x - x_j)^2}{2 h^2} }\] for different choices of \(h\). Note that \(\overline{f}_h\) is simply the kernel density estimator with the Gaussian kernel and bandwidth \(h\).

## The densities can easily be implemented using the density implementation
## of the Gaussian density in R
f_h <- function(x, h) mean(dnorm(x, psi, h))
## This function does not work as a vectorized function as it is, but there is 
## a convenience function, 'Vectorize', in R that turns the function into a 
## function that can actually be applied correctly to a vector. 
f_h <- Vectorize(f_h)

Figure 2.5 shows what some of these densities look like compared to the histogram of the \(\psi\)-angle data. Clearly, for large \(h\) these densities are smooth and slowly oscillating, while as \(h\) gets smaller the densities become more and more wiggly. As \(h \to 0\) the densities become increasingly dominated by tall narrow peaks around the individual data points.

The densities \(\overline{f}_h\) for different choices of \(h\).The densities \(\overline{f}_h\) for different choices of \(h\).The densities \(\overline{f}_h\) for different choices of \(h\).The densities \(\overline{f}_h\) for different choices of \(h\).The densities \(\overline{f}_h\) for different choices of \(h\).The densities \(\overline{f}_h\) for different choices of \(h\).

Figure 2.5: The densities \(\overline{f}_h\) for different choices of \(h\).

The way that these densities adapt to the data points is reflected in the log-likelihood as well.

# To plot the log-likelihood we need to evaluate it in a grid of h-values.
hseq <- seq(1, 0.001, -0.001)
# For the following computation of the log-likelihood it is necessary 
# that f_h is vectorized. There are other ways to implement this computation
# in R, and some are more efficient, but the computation of the log-likelihood 
# for each h scales as O(n^2) with n the number of data points. 
ll <- sapply(hseq, function(h) sum(log(f_h(psi, h))))
ggplot(mapping = aes(hseq, ll)) + geom_line() + 
  xlab("h") + ylab("Log-likelihood")
ggplot(mapping = aes(hseq, ll)) + geom_line() + 
  scale_x_log10("h") + ylab("")
Log-likelihood, $\ell(\overline{f}_h)$, for the densities $\overline{f}_h$ as a function of $h$. Note the log-scale on the right.Log-likelihood, $\ell(\overline{f}_h)$, for the densities $\overline{f}_h$ as a function of $h$. Note the log-scale on the right.

Figure 2.6: Log-likelihood, \(\ell(\overline{f}_h)\), for the densities \(\overline{f}_h\) as a function of \(h\). Note the log-scale on the right.

From Figure 2.6 it is clear that the likelihood is decreasing in \(h\), and it appears that it is unbounded as \(h \to 0\). This is most clearly seen on the figure when \(h\) is plotted on the log-scale because then it appears that the log-likelihood approximately behaves as \(-\log(h)\) for \(h \to 0\).

We can show that that is, indeed, the case. If \(x_i \neq x_j\) when \(i \neq j\)

\[\begin{align*} \ell(\overline{f}_h) & = \sum_{i} \log\left(1 + \sum_{j \neq i} e^{-(x_i - x_j)^2 / (2 h^2)} \right) - n \log(nh\sqrt{2 \pi}) \\ & \sim - n \log(nh\sqrt{2 \pi}) \end{align*}\]

for \(h \to 0\). Hence, \(\ell(\overline{f}_h) \to \infty\) for \(h \to 0\). This demonstrates that the MLE of the density does not exist in the set of all distributions with densities.

In the sense of weak convergence it actually holds that \[\overline{f}_h \cdot m \overset{\mathrm{wk}}{\longrightarrow} \varepsilon_n = \frac{1}{n} \sum_{j=1}^n \delta_{x_j}\] for \(h \to 0\). The empirical measure \(\varepsilon_n\) can sensibly be regarded as the nonparametric MLE of the distribution, but the empirical measure does not have a density. We conclude that we cannot directly define a sensible density estimator as a maximum-likelihood estimator.

2.4.1 Method of sieves

A sieve is a family of models, \(\Theta_{h}\), indexed by a real valued parameter, \(h \in \mathbb{R}\), such that \(\Theta_{h_1} \subseteq \Theta_{h_2}\) for \(h_1 \leq h_2\). In this chapter \(\Theta_{h}\) will denote a set of probability densities. If the increasing family of models is chosen in a sensible way, we may be able to compute the MLE \[\hat{f}_h = \text{arg max}_{f \in \Theta_h} \ell(f),\] and we may even be able to choose \(h = h_n\) as a function of the sample size \(n\) such that \(\hat{f}_{h_n}\) becomes a consistent estimator of \(f_0\).

It is possible to take \[\Theta_h = \{ \overline{f}_{h'} \mid h' \leq h \}\] with \(\overline{f}_{h'}\) as defined above, in which case \(\hat{f}_h = \overline{f}_h\). We will see in the following section that this is simply a kernel estimator.

A more interesting example is obtained by letting \[\Theta_h = \left\{ x \mapsto \frac{1}{h \sqrt{2 \pi}} \int e^{- \frac{(x - z)^2}{2 h^2} } \mathrm{d}\mu(z) \Biggm| \mu \textrm{ a probability measure} \right\},\] which is known as the convolution sieve. We note that \(\overline{f}_h \in \Theta_h\) by taking \(\mu = \varepsilon_n\), but generally \(\hat{f}_h\) will be different from \(\overline{f}_h\).

We will not pursue the general theory of sieve estimators, but refer to the paper Nonparametric Maximum Likelihood Estimation by the Method of Sieves by Geman and Hwang. In the following section we will work out some more practical details for a particular sieve estimator based on a basis expansion of the log-density.

2.4.2 Basis expansions

We suppose in this section that the data points are all contained in the interval \([a,b]\) for \(a, b \in \mathbb{R}\). This is true for the angle data with \(a = -\pi\) and \(b = \pi\) no matter the size of the data set, but if \(f_0\) does not have a bounded support it may be necessary to let \(a\) and \(b\) change with the data. However, for any fixed data set we can choose some sufficiently large \(a\) and \(b\).

In this section the sieve will be indexed by integers, and for \(h \in \mathbb{N}\) we suppose that we have chosen continuous functions \(b_1, \ldots, b_h : [a,b] \to \mathbb{R}\). These will be called basis functions. We then define \[\Theta_h = \left\{ x \mapsto \varphi(\boldsymbol{\beta})^{-1} \exp\left(\sum_{k=1}^h \beta_k b_k(x)\right) \Biggm| \boldsymbol{\beta} = (\beta_1, \ldots, \beta_h)^T \in \mathbb{R}^h \right\},\] where \[\varphi(\boldsymbol{\beta}) = \int_a^b \exp\left(\sum_{k=1}^h \beta_k b_k(x)\right) \mathrm{d} x.\]

The MLE over \(\Theta_h\) is then given as \[\hat{f}_h(x) = \varphi(\hat{\boldsymbol{\beta}})^{-1} \exp\left(\sum_{k=1}^h \hat{\beta}_k b_k(x)\right),\] where

\[\begin{align*} \hat{\boldsymbol{\beta}} & = \text{arg max}_{\boldsymbol{\beta} \in \mathbb{R}^h} \ \sum_{j=1}^n \sum_{k=1}^h \beta_k b_k(x_j) - \log \varphi(\boldsymbol{\beta}) \\ & = \text{arg max}_{\boldsymbol{\beta} \in \mathbb{R}^h} \ \ \mathbf{1}^T \mathbf{B} \boldsymbol{\beta} - \log \varphi(\boldsymbol{\beta}). \end{align*}\]

Here \(\mathbf{B}\) is the \(n \times h\) matrix with \(B_{jk} = b_k(x_j)\). Thus for any fixed \(h\) the model is, in fact, just an ordinary parametric exponential family, though it may not be entirely straightforward how to compute \(\varphi(\boldsymbol{\beta})\).

Many basis functions are possible. Polynomials may be used, but splines are often preferred. An alternative is a selection of trigonometric functions, for instance
\[b_1(x) = \cos(x), b_2(x) = \sin(x), \ldots, b_{2h-1}(x) = \cos(hx), b_{2h}(x) = \sin(hx)\] on the interval \([-\pi, \pi]\). In Section 1.2.1 a simple special case was actually treated corresponding to \(h = 2\), where the normalization constant was identified in terms of a modified Bessel function.

It is worth remembering the following:

“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

— John von Neumann

The Normal-inverse Gaussian distribution has four parameters and the generalized hyperbolic distribution is an extension with five, but von Neumann was probably thinking more in terms of a spline or a polynomial expansion as above with four or five suitably chosen basis functions.

The quote is not a mathematical statement but an empirical observation. With a handful of parameters you already have a quite flexible class of densities that will fit many real data sets well. But remember that a reasonably good fit does not mean that you have found the “true” data generating model. Though data is in some situations not as scarce a resource today as when von Neumann made elephants wiggle their trunks, the quote still suggests that \(h\) should grow rather slowly with \(n\) to avoid overfitting.

2.5 Exercises

Kernel density estimation

Exercise 2.1 The Epanechnikov kernel is given by \[ K(x) = \frac{3}{4}(1 - x^2) \] for \(x \in [-1, 1]\) and 0 elsewhere. Show that this is a probability density with mean zero and compute \(\sigma_K^2\) as well as \(\|K\|_2^2\).

For the following exercises use the log(F12) variable as considered in Exercise A.4.

Exercise 2.2 Use density() to compute the kernel density estimate with the Epanechnikov kernel to the log(F12) data. Try different bandwidths.

Exercise 2.3 Implement kernel density estimation yourself using the Epanechnikov kernel. Test your implementation by comparing it to density() using the log(F12) data.


Exercise 2.4 Construct the following vector

x <- rnorm(2^13)

Then use microbenchmark to benchmark the computation of

density(x[1:k], 0.2)

for k ranging from \(2^5\) to \(2^{13}\). Summarize the benchmarking results.

Exercise 2.5 Benchmark your own implementation of kernel density estimation using the Epanechnikov kernel. Compare the results to those obtained for density().

Exercise 2.6 Experiment with different implementations of kernel evaluation in R using the Gaussian kernel and the Epanechnikov kernel. Use microbenchmark() to compare the different implementations.