# 2 Density estimation

This chapter is on nonparametric density estimation. A classical nonparametric estimator of a density is the histogram, which provides discontinuous and piecewise constant estimates. The focus in this chapter is on some of the alternatives that provide continuous or even smooth estimates instead.

*Kernel methods* form an important class of smooth density estimators as
implemented by the R function `density()`

. These estimators are essentially
just locally weighted averages, and their computation is relatively
straightforward in theory. In practice, different choices of how to implement
the computations can, however, have a big effect on the actual computation time,
and the implementation of kernel density estimators will illustrate three points:

- if possible, choose vectorized implementations in R,
- if a small loss in accuracy is acceptable, an approximate solution can be orders of magnitudes faster than a literal implementation,
- the time it takes to numerically evaluate different elementary functions can depend a lot on the function and how you implement the computation.

The first point is emphasized because it results in implementations that are short, expressive and easier to understand just as much as it typically results in computationally more efficient implementations. Note also that not every computation can be vectorized in a beneficial way, and one should never go through hoops to vectorize a computation.

Kernel methods rely on one or more *regularization parameters* that must be
selected to achieve the right balance of adapting
to data without adapting too much to the random variation in the data.
Choosing the right amount of regularization is just as important as choosing
the method to use in the first place. It may, in fact, be more important.
We actually do not have a complete implementation of a nonparametric estimator
until we have implemented a data driven and automatic way of choosing the
amount of regularization. Implementing only the computations for
evaluating a kernel estimator, say, and leaving it completely
to the user to choose the bandwidth is a job half done. Methods and implementations
for choosing the bandwidth are therefore treated in some detail in this chapter.

In the final section a likelihood analysis is carried out.
This is done to further clarify why regularized estimators are needed
to avoid overfitting to the data, and why there is in general no nonparametric
maximum likelihood estimator of a density. Regularization of the likelihood
can be achieved by constraining the density estimates to belong to a family
of increasingly flexible parametric densities that are fitted to data. This is
known as the *method of sieves*. Another approach is based on basis expansions,
but in either case, automatic selection of the amount of regularization is just
as important as for kernel methods.

## 2.1 Univariate density estimation

Recall the data on \(\phi\)- and \(\psi\)-angles in polypeptide backbone structures, as considered in Section 1.1.1.

(ref:phipsiDenscap2) Histograms equipped with a rug plot of the distribution of \(\phi\)-angles (left) and \(\psi\)-angles (right) of the peptide planes in the protein human protein 1HMP.

We will in this section treat methods for smooth density estimation for univariate data such as data on either the \(\phi\)- or the \(\psi\)-angle.

We let \(f_0\) denote the unknown density that we want to estimate. That is, we imagine that the data points \(x_1, \ldots, x_n\) are all observations drawn from the probability measure with density \(f_0\) w.r.t. Lebesgue measure on \(\mathbb{R}\).

Suppose first that \(f_0\) belongs to a parametrized statistical model \((f_{\theta})_{\theta}\), where \(f_{\theta}\) is a density w.r.t. Lebesgue measure on \(\mathbb{R}\). If \(\hat{\theta}\) is an estimate of the parameter, \(f_{\hat{\theta}}\) is an estimate of the unknown density \(f_0\). For a parametric family we can always try to use the MLE \[\hat{\theta} = \text{arg max}_{\theta} \sum_{j=1}^n \log f_{\theta}(x_j)\] as an estimate of \(\theta\). Likewise, we might compute the empirical mean and variance for the data and plug those numbers into the density for the Gaussian distribution, and in this way obtain a Gaussian density estimate of \(f_0\).

```
psi_mean <- mean(psi)
psi_sd <- sd(psi)
hist(psi, prob = TRUE)
rug(psi)
curve(dnorm(x, psi_mean, psi_sd), add = TRUE, col = "red")
```

As Figure 2.2 shows, if we fit a Gaussian distribution to the \(\psi\)-angle data we get a density estimate that clearly does not match the histogram. The Gaussian density matches the data on the first and second moments, but the histogram shows a clear bimodality that the Gaussian distribution by definition cannot match. Thus we need a more flexible parametric model than the Gaussian if we want to fit a density to this data set.

In nonparametric density estimating we want to estimate the target density, \(f_0\), without assuming that it belongs to a particular parametrized family of densities.

## 2.2 Kernel methods

A simple approach to nonparametric density estimation relies on the approximation \[P(X \in (x-h, x+h)) = \int_{x-h}^{x+h} f_0(z) \ dz \simeq f_0(x) 2h,\] which is valid for any continuous density \(f_0\). Inverting this approximation and using the law of large numbers,

\[\begin{align*} f_0(x) & \simeq \frac{1}{2h}P(X \in (x-h, x+h)) \\ & \simeq \frac{1}{2hn} \sum_{j=1}^n 1_{(x-h, x+h)}(x_j) \\ & = \underbrace{\frac{1}{2hn} \sum_{j=1}^n 1_{(-h, h)}(x - x_j)}_{\hat{f}_h(x)} \end{align*}\]

for i.i.d. observations \(x_1, \ldots, x_n\) from the distribution \(f_0 \cdot m\). The function \(\hat{f}_h\) defined as above is an example of a kernel density estimator with a rectangular kernel. We immediately note that \(h\) has to be chosen appropriately. If \(h\) is large, \(\hat{f}_h\) will be flat and close to a constant. If \(h\) is small, \(\hat{f}_h\) will make large jumps close to the observations.

What do we then mean by an “appropriate” choice of \(h\) above? To answer this we must have some prior assumptions about what we expect \(f_0\) to look like. Typically, we expect \(f_0\) to have few oscillations and to be fairly smooth, and we want \(\hat{f}_h\) to reflect that. A too large \(h\) will oversmooth the data relative to \(f_0\) by effectively ignoring the data, while a too small \(h\) will undersmooth the data relative to \(f_0\) by allowing individual data points to have large local effects that make the estimate wiggly. More formally, we can look at the mean and variance of \(\hat{f}_h\). Letting \(p(x, h) = P(X \in (x-h, x+h))\), it follows that \(f_h(x) = E(\hat{f}_h(x)) = p(x, h) / (2h)\) while

\[\begin{equation} V(\hat{f}_h(x)) = \frac{p(x, h) (1 - p(x, h))}{4h^2 n} \simeq f_h(x) \frac{1}{2hn}. \tag{2.1} \end{equation}\]

We see from these computations that for \(\hat{f}_h(x)\) to be approximately unbiased for any \(x\) we need \(h\) to be small – ideally letting \(h \to 0\) since then \(f_h(x) \to f_0(x)\). However, this will make the variance blow up, and to minimize variance we should instead choose \(h\) as large as possible. One way to define “appropriate” is then to strike a balance between the bias and the variance as a function of \(h\) so as to minimize the mean squared error of \(\hat{f}_h(x)\).

We will find the optimal tradeoff for the rectangular kernel in Section 2.3 on bandwidth selection. It’s not difficult, and you are encouraged to try finding it yourself at this point. In this section we will focus on computational aspects of kernel density estimation, but first we will generalize the estimator by allowing for other kernels.

The estimate \(\hat{f}_h(x)\) will be unbiased if \(f_0\) is constantly equal to \(f_0(x)\)
in the entire interval \((x-h, x+h)\). This is atypical and can only happen for all \(x\)
if \(f_0\) is constant. We expect the typical situation to be that
\(f_0\) deviates the most from \(f_0(x)\) close to \(x \pm h\), and
that this causes a bias of \(\hat{f}_h(x).\)
Observations falling close to \(x + h\), say, should thus count less
than observations falling close to \(x\)? The rectangular kernel makes a sharp
cut; either a data point is in or it is out. If we use a smooth weighting
function instead of a sharp cut, we might be able to include more
data points and lower the variance while keeping the bias small. This is
precisely the idea of *kernel estimators*, defined generally as

\[\begin{equation} \hat{f}_h(x) = \frac{1}{hn} \sum_{j=1}^n K\left(\frac{x - x_j}{h}\right) \tag{2.2} \end{equation}\]

for a kernel \(K : \mathbb{R} \to \mathbb{R}\). The parameter \(h > 0\) is known
as the *bandwidth*. Examples of kernels include the *uniform* or *rectangular kernel*
\[K(x) = \frac{1}{2} 1_{(-1,1)}(x),\]
and the *Gaussian kernel*
\[K(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}.\]

One direct benefit of considering other kernels than the rectangular is that \(\hat{f}_h\) inherits all smoothness properties from \(K\). Whereas the rectangular kernel is not even continuous, the Gaussian kernel is \(C^{\infty}\) and so is the resulting kernel density estimate.

### 2.2.1 Implementation

What should be computed to compute a kernel density estimate? That is, in fact, a good question, because the definition actually just specifies how to evaluate \(\hat{f}_h\) in any given point \(x\), but there is really not anything to compute until we need to evaluate \(\hat{f}_h\). Thus when we implement kernel density estimation we really implement algorithms for evaluating a density estimate in a finite number of points.

Our first implementation is a fairly low-level implementation that returns
the evaluation of the density estimate in a given number of equidistant
points. The function mimics some of the defaults of `density()`

so that it
actually evaluates the estimate in the same points as `density()`

.

```
# This is an implementation of the function 'kern_dens' that computes
# evaluations of Gaussian kernel density estimates in a grid of points.
#
# The function has three formal arguments: 'x' is the numeric vector of data
# points, 'h' is the bandwidth and 'm' is the number of grid points.
# The default value of 512 is chosen to match the default of 'density()'.
kern_dens <- function(x, h, m = 512) {
rg <- range(x)
# xx is equivalent to grid points in 'density()'
xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
y <- numeric(m) # The evaluations, initialized as a vector of zeros
# The actual computation is done using nested for-loops. The outer loop
# is over the grid points, and the inner loop is over the data points.
for (i in seq_along(xx))
for (j in seq_along(x))
y[i] <- y[i] + exp(- (xx[i] - x[j])^2 / (2 * h^2))
y <- y / (sqrt(2 * pi) * h * length(x))
list(x = xx, y = y)
}
```

Note that the function returns a list containing the grid points (`x`

) where
the density estimate is evaluated as well as the estimated density
evaluations (`y`

). Note also that the argument `m`

above sets the number of
grid points, whereas `density()`

uses the argument `n`

for that. The
latter can be a bit confusing as \(n\) is often used to denote the
number of data points.

We will immediately test if the implementation works as expected – in this
case by comparing it to our reference implementation `density()`

.

```
f_hat <- kern_dens(psi, 0.2)
f_hat_dens <- density(psi, 0.2)
plot(f_hat, type = "l", lwd = 4, xlab = "x", ylab = "Density")
lines(f_hat_dens, col = "red", lwd = 2)
plot(f_hat$x, f_hat$y - f_hat_dens$y,
type = "l",
lwd = 2,
xlab = "x",
ylab = "Difference"
)
```

Figure 2.3 suggests that the estimates computed by our
implementation and by `density()`

are the same when we just visually compare the
plotted densities. However, if we look at the differences instead,
we see that they are as large as \(4 \times 10^{-4}\) in absolute value. This is
way above what we should expect from rounding errors alone when using
double precision arithmetic. Thus the two implementations only compute
*approximately* the same, which is, in fact, because `density()`

relies
on certain approximations for run time efficiency.

In R we can often beneficially implement computations in a vectorized way
instead of using an explicit loop. It is fairly easy to change the implementation
to be more vectorized by computing each evaluation in one single line using
the `sum()`

function and the fact that `exp()`

and squaring are vectorized.

```
kern_dens_vec <- function(x, h, m = 512) {
rg <- range(x)
xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
y <- numeric(m)
# The inner loop from 'kern_dens' has been vectorized, and only the
# outer loop over the grid points remains.
const <- (sqrt(2 * pi) * h * length(x))
for (i in seq_along(xx))
y[i] <- sum(exp(-(xx[i] - x)^2 / (2 * h^2))) / const
list(x = xx, y = y)
}
```

We test this new implementation by comparing it to our previous implementation.

`range(kern_dens(psi, 0.2)$y - kern_dens_vec(psi, 0.2)$y)`

`## [1] 0 0`

The magnitude of the differences are of order at most \(10^{-16}\), which is what
can be expected due to rounding errors. Thus we conclude that up to rounding
errors, `kern_dens()`

and `kern_dens_vec()`

return the same on this data set. This is,
of course, not a comprehensive test, but it is an example of one among a
number of tests that should be considered.

There are several ways to get completely rid of the explicit loops and write
an entirely vectorized implementation in R. One of the solutions will use the
`sapply()`

function, which belongs to the family of `*apply()`

functions
that apply a function to each element in a vector or a list. In the
parlance of functional programming the `*apply()`

functions are variations
of the functional, or higher-order-function, known as map.

```
kern_dens_apply <- function(x, h, m = 512) {
rg <- range(x)
xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
const <- sqrt(2 * pi) * h * length(x)
y <- sapply(xx, function(z) sum(exp(-(z - x)^2 / (2 * h^2))) / const)
list(x = xx, y = y)
}
```

The `sapply()`

call above will apply the function `function(z) sum(dnorm(...`

to every element in the vector `xx`

and return the result as a vector. The
function is an example of an *anonymous function* that does not get a name and
exists only during the `sapply()`

evaluation. Instead of `sapply()`

it is possible
to use `lapply()`

that returns a list. In fact, `sapply()`

is a
simple wrapper around `lapply()`

that attempts to “simplify” the result from
a list to an array (and in this case to a vector).

An alternative, and also completely vectorized, solution can be based on
the functions `outer()`

and `rowMeans()`

.

```
kern_dens_outer <- function(x, h, m = 512) {
rg <- range(x)
xx <- seq(rg[1] - 3 * h, rg[2] + 3 * h, length.out = m)
y <- outer(xx, x, function(zz, z) exp(-(zz - z)^2 / (2 * h^2)))
y <- rowMeans(y) / (sqrt(2 * pi) * h)
list(x = xx, y = y)
}
```

The `outer()`

function evaluates the kernel in all combinations
of the grid and data points and returns a matrix of
dimensions \(m \times n\). The function `rowMeans()`

computes
the means of each row and returns a vector of length \(m\).

We should, of course, also remember to test these two last implementations.

`range(kern_dens(psi, 0.2)$y - kern_dens_apply(psi, 0.2)$y)`

`## [1] 0 0`

`range(kern_dens(psi, 0.2)$y - kern_dens_outer(psi, 0.2)$y)`

`## [1] -5.551e-17 5.551e-17`

The natural question is then how to choose between the different implementations?
Besides being correct it is important that the code is easy to read
and understand. Which of the four implementations above that is best in
this respect may depend a lot on the background of the reader. If you strip
the implementations for comments, all four are arguably quite readable, but
`kern_dens()`

with the double loop might appeal a bit more to people
with a background in imperative programming, while `kern_dens_apply()`

might
appeal more to people with a preference for functional programming.
This functional and vectorized solution is also
a bit closer to the mathematical notation with
e.g. the sum sign \(\Sigma\) being mapped directly to the `sum()`

function
instead of the incremental addition in the for-loop. For
these specific implementations these differences are mostly
aesthetic nuances and preferences may be more subjective than substantial.

To make a qualified choice between the implementations we should investigate if they differ in terms of run time and memory consumption.

### 2.2.2 Benchmarking

Benchmarking is about measuring and comparing performance. For software this often means measuring run time and memory usage, though there are clearly many other aspects of software that should be benchmarked in general. This includes user experience, energy consumption and implementation and maintenance time. In this section we focus on benchmarking run time.

The function `system.time`

in R provides a simple way of benchmarking run time
measured in seconds.

```
system.time(kern_dens(psi, 0.2))
system.time(kern_dens_vec(psi, 0.2))
system.time(kern_dens_apply(psi, 0.2))
system.time(kern_dens_outer(psi, 0.2))
```

```
## kern_dens:
## user system elapsed
## 0.016 0.000 0.016
## kern_dens_vec:
## user system elapsed
## 0.002 0.000 0.001
## kern_dens_apply:
## user system elapsed
## 0.029 0.014 0.042
## kern_dens_outer:
## user system elapsed
## 0.020 0.042 0.090
```

The “elapsed” time is the total run time as experienced, while the “user” and “system” times are how long the CPU spent on executing your code and operating system code on behalf of your code, respectively.

From this simple benchmark, `kern_dens()`

is clearly substantially slower
than the three other implementations. For more systematic benchmarking of run
time, the R package microbenchmark is useful.

```
kern_bench <- microbenchmark(
kern_dens(psi, 0.2),
kern_dens_vec(psi, 0.2),
kern_dens_apply(psi, 0.2),
kern_dens_outer(psi, 0.2)
)
```

```
## Warning in microbenchmark(kern_dens(psi, 0.2), kern_dens_vec(psi, 0.2), : less
## accurate nanosecond times to avoid potential integer overflows
```

The result stored in `kern_bench()`

is a data frame with two columns. The first
contains the R expressions evaluated, and the second is the evaluation
time measured in nanoseconds. Each of the four expressions were evaluated 100 times,
and the data frame thus has 400 rows. The `times`

argument to `microbenchmark()`

can be used to change the number of evaluations per expression if needed.

It may not be immediately obvious that `kern_bench()`

is a data frame, because printing will automatically summarize the
data, but the actual data structure is revealed by the R function `str()`

.

`str(kern_bench)`

```
## Classes 'microbenchmark' and 'data.frame': 400 obs. of 2 variables:
## $ expr: Factor w/ 4 levels "kern_dens(psi, 0.2)",..: 3 3 3 4 3 2 1 1 2 2 ...
## $ time: num 1416960 1365136 1377026 1470793 1432868 ...
```

A total of 400 evaluations were done for the above benchmark, and `microbenchmark()`

does the evaluations in a random order by default. Measuring evaluation time on a
complex system like a modern computer is an empirical science, and
the order of evaluation can potentially affect the results as the
conditions for the evaluation change over time. The purpose of the
randomization is to avoid that the ordering causes systematically
misleading results.

The microbenchmark package implements some methods for summarizing and printing the results such as the following summary table with times in milliseconds.

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## kern_dens(psi, 0.2) 14.80 15.32 16.70 15.47 15.76 96.66 100
## kern_dens_vec(psi, 0.2) 1.10 1.19 1.57 1.25 1.35 7.58 100
## kern_dens_apply(psi, 0.2) 1.23 1.33 1.46 1.39 1.48 6.59 100
## kern_dens_outer(psi, 0.2) 1.10 1.34 3.39 1.49 1.65 118.89 100
```

The summary table shows some key statistics like median and mean evaluation
times but also extremes and upper and lower quartiles. The distributions
of run times can be investigated further using the `autoplot()`

function, which is
based on ggplot2 and thus easy to modify.

```
autoplot(kern_bench) +
geom_jitter(position = position_jitter(0.2, 0),
aes(color = expr), alpha = 0.4) +
aes(fill = I("gray")) +
theme(legend.position = "none")
```

This more refined benchmark study does not change our initial impression from
using `system.time()`

substantially. The function `kern_dens()`

is notably
slower than the three vectorized implementations, but we are now able to
more clearly see the minor differences among them. For instance, `kern_dens_vec()`

and `kern_dens_apply()`

have very similar run time distributions, while
`kern_dens_outer()`

clearly has a larger median run time and also a
run time distribution that is more spread out to the right.

In many cases when we benchmark run time it is of interest
to investigate how run time depends on various parameters. This is so
for kernel density estimation, where we want to understand how changes in
the number of data points, \(n\), and the number of grid points, \(m\), affect
run time. We can still use `microbenchmark()`

for running the benchmark
experiment, but we will typically process and plot the benchmark data afterwards
in a customized way.

Figure 2.4 shows median run times for an experiment with
28 combinations of parameters for each of the four different implementations
yielding a total of 112 different R expressions being benchmarked. The number of replications
for each expression was set to 40. The results confirm that `kern_dens()`

is substantially slower than the vectorized implementations for all combinations
of \(n\) and \(m\). However, Figure 2.4 also reveals a new
pattern; `kern_dens_outer()`

appears to scale with \(n\) in a slightly different way
than the two other vectorized implementations for small \(n\). It is comparable to
or even a bit faster than
`kern_dens_vec()`

and `kern_dens_apply()`

for very small data sets, while it becomes
slower for the larger data sets.

Run time for many algorithms have to a good approximation a dominating power law behavior as a function of typical size parameters, that is, the run time will scale approximately like \(n \mapsto C n^a\) for constants \(C\) and \(a\) and with \(n\) denoting a generic size parameter. Therefore it is beneficial to plot run time using log-log scales and to design benchmark studies with size parameters being equidistant on a log-scale. With approximate power law scaling, the log run time behaves like \[\log(C) + a \log(n),\] that is, on a log-log scale we see approximate straight lines. The slope reveals the exponent \(a\), and two different algorithms for solving the same problem might have different exponents and thus different slopes on the log-log-scale. Two different implementations of the same algorithm should have approximately the same slope but may differ in the constant \(C\) depending upon how efficient the particular implementation is in the particular programming language used. Differences in \(C\) correspond to vertical translations on the log-log scale.

In practice, we will see some deviations from straight lines on the log-log plot for a number of reasons. Writing the run time as \(C n^a + R(n)\), the residual term \(R(n)\) will often be noticeable or even dominating and positive for small \(n\). It is only for large enough \(n\) that the power law term, \(C n^a\), will dominate. In addition, run time can be affected by hardware constraints such as cache and memory sizes, which can cause abrupt jumps in run time.

Using `microbenchmark()`

over `system.time()`

has two main benefits. First,
it handles the replication and randomization automatically, which is
convenient. Second, it attempts to provide more accurate timings. The latter
is mostly important when we benchmark very fast computations.

It can be debated if a median summary of randomly ordered evaluations
is the best way to summarize run time. This is due to the way R does memory
management. R allocates and deallocates memory automatically
and uses garbage collection
for the deallocation. This means that computations occasionally, and in a somewhat
unpredictable manner, trigger the garbage collector, and as a result a small
fraction of the evaluations may take substantially longer time than the
rest. The median will typically be almost unaffected, and memory deallocation
is thus effectively (and wrongly) disregarded from run time when the median
summary is used. This is an argument for using the mean instead of the median,
but due to the randomization the computation that triggered
the garbage collector might not be the one that caused the memory allocation
in the first place. Using the mean instead of the median will therefore smear
out the garbage collection run time on all benchmarked expressions. Setting
the argument `control = list(order = "block")`

for `microbenchmark()`

will
evaluate the expressions in blocks, which in combination with a mean
summary more correctly accounts for memory allocation and deallocation in the
run time. The downside is that without the randomization the results might
suffer from other artefacts. This book will use randomization
and median summaries throughout, but we keep in mind that this
could underestimate actual average run time depending upon how much
memory a given computation requires. Memory usage and how it affects
run time by triggering garbage collection will be dealt with via
code profiling tools instead.

## 2.3 Bandwidth selection

### 2.3.1 Revisiting the rectangular kernel

We return to the rectangular kernel and compute the mean squared error. In the analysis it may be helpful to think about \(n\) large and \(h\) small. Indeed, we will eventually choose \(h = h_n\) as a function of \(n\) such that as \(n \to \infty\) we have \(h_n \to 0\). We should also note the \(f_h(x) = E (\hat{f}_h(x))\) is a density, thus \(\int f_h(x) \mathrm{d}x = 1\).

We will assume that \(f_0\) is sufficiently differentiable and use a Taylor expansion of the distribution function \(F_0\) to get that

\[\begin{align*} f_h(x) & = \frac{1}{2h}\left(F_0(x + h) - F_0(x - h)\right) \\ & = \frac{1}{2h}\left(2h f_0(x) + \frac{h^3}{3} f_0''(x) + R_0(x,h) \right) \\ & = f_0(x) + \frac{h^2}{6} f_0''(x) + R_1(x,h) \end{align*}\]

where \(R_1(x, h) = o(h^2)\). One should note how the quadratic terms in \(h\) in the Taylor expansion canceled. This gives the following formula for the squared bias of \(\hat{f}_h\).

\[\begin{align*} \mathrm{bias}(\hat{f}_h(x))^2 & = (f_h(x) - f_0(x))^2 \\ & = \left(\frac{h^2}{6} f_0''(x) + R_1(x,h) \right)^2 \\ & = \frac{h^4}{36} f_0''(x)^2 + R(x,h) \end{align*}\]

where \(R(x,h) = o(h^4)\). For the variance we see from (2.1) that \[V(\hat{f}_h(x)) = f_h(x)\frac{1}{2hn} - f_h(x)^2 \frac{1}{n}.\] Integrating the sum of the bias and the variance over \(x\) gives the integrated mean squared error

\[\begin{align*} \mathrm{MISE}(h) & = \int \mathrm{bias}(\hat{f}_h(x))^2 + V(\hat{f}_h(x)) \mathrm{d}x \\ & = \frac{h^4}{36} \|f_0''\|^2_2 + \frac{1}{2hn} + \int R(x,h) \mathrm{d} x - \frac{1}{n} \int f_h(x)^2 \mathrm{d} x. \end{align*}\]

If \(f_h(x) \leq C\) (which happens if \(f_0\) is bounded),
\[\int f_h(x)^2 \mathrm{d} x \leq C \int f_h(x) \mathrm{d} x = C,\]
and the last term is \(o((nh)^{-1})\). The second last term is \(o(h^4)\)
if we can interchange the limit and integration order. It is conceivable
that we can do so under suitable assumptions on \(f_0\), but we will not pursue
those at this place. The sum of the two remaining and asymptotically dominating terms in
the formula for MISE is

\[\mathrm{AMISE}(h) = \frac{h^4}{36} \|f_0''\|^2_2 + \frac{1}{2hn},\]
which is known as the asymptotic mean integrated squared error. Clearly,
for this to be a useful formula, we must assume \(\|f_0''\|_2^2 < \infty\).
In this case the formula for AMISE can be used to find the asymptotic
optimal tradeoff between (integrated) bias and variance. Differentiating w.r.t. \(h\) we
find that
\[\mathrm{AMISE}'(h) = \frac{h^3}{9} \|f_0''\|^2_2 - \frac{1}{2h^2n},\]
and solving for \(\mathrm{AMISE}'(h) = 0\) yields
\[h_n = \left(\frac{9}{2 \|f_0''\|_2^2}\right)^{1/5} n^{-1/5}.\]
When AMISE is regarded as a function of \(h\) we observe that
it tends to \(\infty\) for \(h \to 0\) as well as for \(h \to \infty\), thus the unique
stationary point \(h_n\) is a unique global minimizer. Choosing the bandwidth to
be \(h_n\) will therefore minimize the asympototic mean integrated squared error, and it
is in this sense an optimal choice of bandwidth.

We see how “wiggliness” of \(f_0\) enters into the formula for the optimal bandwidth \(h_n\) via \(\|f_0''\|_2\). This norm of the second derivative is precisely a quantification of how much \(f_0\) oscillates. A large value, indicating a wiggly \(f_0\), will drive the optimal bandwidth down whereas a small value will drive the optimal bandwidth up.

We should also observe that if we plug the optimal bandwidth into the formula for AMISE, we get \[\begin{align*} \mathrm{AMISE}(h_n) & = \frac{h_n^4}{36} \|f_0''\|^2_2 + \frac{1}{2h_n n} \\ & = C n^{-4/5}, \end{align*}\] which indicates that in terms of integrated mean squared error the rate at which we can nonparametrically estimate \(f_0\) is \(n^{-4/5}\). This should be contrasted to the common parametric rate of \(n^{-1}\) for mean squared error.

From a practical viewpoint there is one major problem with the optimal
bandwidth \(h_n\); it depends via \(\|f_0''\|^2_2\) upon the unknown \(f_0\)
that we are trying to estimate. We therefore refer to \(h_n\) as an *oracle*
bandwidth – it is the bandwidth that an oracle that knows \(f_0\) would
tell us to use. In practice, we will have to come up with an estimate
of \(\|f_0''\|^2_2\) and plug that estimate into the formula for \(h_n\).
We pursue a couple of different options for doing so for general kernel
density estimators below together with methods that do not rely on the
AMISE formula.

### 2.3.2 ISE, MISE and MSE for kernel estimators

Bandwidth selection for general kernel estimators can be studied
asymptotically just as above. To this end it is useful to formalize
how we quantify the *quality* of an estimate \(\hat{f}_h\). One natural
quantification is the *integrated squared error*,
\[\mathrm{ISE}(\hat{f}_h) = \int (\hat{f}_h(x) - f_0(x))^2 \ \mathrm{d}x = \|\hat{f}_h - f_0\|_2^2.\]

The quality of the estimation procedure producing \(\hat{f}_h\) from data can then be quantified by taking the mean ISE, \[\mathrm{MISE}(h) = E(\mathrm{ISE}(\hat{f}_h)),\] where the expectation integral is over the data. Using Tonelli’s theorem we may interchange the expectation and the integration over \(x\) to get \[\mathrm{MISE}(h) = \int \mathrm{MSE}_x(h) \ \mathrm{d}x\] where \[\mathrm{MSE}_h(x) = \mathrm{var}(\hat{f}_h(x)) + \mathrm{bias}(\hat{f}_h(x))^2.\] is the pointwise mean squared error.

Using the same kind of Taylor expansion argument as above we can show that if
\(K\) is a square integrable probability density with mean 0 and
\[\sigma_K^2 = \int z^2 K(z) \ \mathrm{d}z > 0,\]
then
\[\mathrm{MISE}(h) = \mathrm{AMISE}(h) + o((nh)^{-1} + h^4)\]
where the *asymptotic mean integrated squared error* is
\[\mathrm{AMISE}(h) = \frac{\|K\|_2^2}{nh} + \frac{h^4 \sigma^4_K \|f_0''\|_2^2}{4}\]
with
\[\|g\|_2^2 = \int g(z)^2 \ \mathrm{d}z \quad (\mathrm{squared } \ L_2\mathrm{-norm}).\]
Some regularity assumptions on \(f_0\) are necessary, and from the result
we clearly need to require that \(f_0''\) is meaningful and square integrable.
However, that is also enough. See Proposition A.1 in Tsybakov (2009) for a rigorous
proof.

By minimizing \(\mathrm{AMISE}(h)\) we derive the optimal oracle bandwidth

\[\begin{equation} \tag{2.3} h_n = \left( \frac{\|K\|_2^2}{ \|f_0''\|^2_2 \sigma_K^4} \right)^{1/5} n^{-1/5}. \end{equation}\]

If we plug this formula into the formula for AMISE we arrive at the asymptotic error rate \(\mathrm{AMISE}(h_n) = C n^{-4/5}\) with a constant \(C\) depending on \(f_0''\) and the kernel. It is noteworthy that the asymptotic analysis can be carried out even if \(K\) is allowed to take negative values, though the resulting estimate may not be a valid density as it is. Tsybakov (2009) demonstrates how to improve on the rate \(n^{-4/5}\) by allowing for kernels whose moments of order two or above vanish. Necessarily, such kernels must take negative values.

We observe that for the rectangular kernel, \[\sigma_K^4 = \left(\frac{1}{2} \int_{-1}^1 z^2 \ \mathrm{d} z\right)^2 = \frac{1}{9}\] and \[\|K\|_2^2 = \frac{1}{2^2} \int_{-1}^1 \ \mathrm{d} z = \frac{1}{2}.\] Plugging these numbers into (2.3) we find the oracle bandwidth for the rectangular kernel as derived in Section 2.3.1. For the Gaussian kernel we find that \(\sigma_K^4 = 1\), while \[\|K\|_2^2 = \frac{1}{2 \pi} \int e^{-x^2} \ \mathrm{d} x = \frac{1}{2 \sqrt{\pi}}.\]

### 2.3.3 Plug-in estimation of the oracle bandwidth

To compute \(\|f_0''\|^2_2\) that enters into the formula for the asymptotically optimal bandwidth we have to know \(f_0\) that we are trying to estimate in the first place. To resolve the circularity we will make a first guess of what \(f_0\) is and plug that guess into the formula for the oracle bandwidth.

Our first guess is that \(f_0\) is Gaussian with mean 0 and variance \(\sigma^2\). Then

\[\begin{align*} \|f_0''\|^2_2 & = \frac{1}{2 \pi \sigma^2} \int \left(\frac{x^2}{\sigma^4} - \frac{1}{\sigma^2}\right)^2 e^{-x^2/\sigma^2} \ \mathrm{d}x \\ & = \frac{1}{2 \sigma^9 \sqrt{\pi}} \frac{1}{\sqrt{\pi \sigma^2}} \int (x^4 - 2 \sigma^2 x^2 + \sigma^4) e^{-x^2/\sigma^2} \ \mathrm{d}x \\ & = \frac{1}{2 \sigma^9 \sqrt{\pi}} (\frac{3}{4} \sigma^4 - \sigma^4 + \sigma^4) \\ & = \frac{3}{8 \sigma^5 \sqrt{\pi}}. \end{align*}\]

Plugging this expression for the squared 2-norm of the second derivative of the density into the formula for the oracle bandwidth gives

\[\begin{equation} \tag{2.4} h_n = \left( \frac{8 \sqrt{\pi} \|K\|_2^2}{3 \sigma_K^4} \right)^{1/5} \sigma n^{-1/5}, \end{equation}\]

with \(\sigma\) the only quantity depending on the unknown density \(f_0\). We can now simply estimate \(\sigma\), using e.g. the empirical standard deviation \(\hat{\sigma}\), and plug this estimate into the formula above to get an estimate of the oracle bandwidth.

It is well known that the empirical standard deviation is sensitive to outliers, and if \(\sigma\) is overestimated for that reason, the bandwidth will be too large and the resulting estimate will be oversmoothed. To get more robust bandwidth selection, Silverman (1986) suggested using the interquartile range to estimate \(\sigma\). In fact, he suggested estimating \(\sigma\) by \[\tilde{\sigma} = \min\{\hat{\sigma}, \mathrm{IQR} / 1.34\}.\] In this estimator, IQR denotes the empirical interquartile range, and \(1.34\) is approximately the interquartile range, \(\Phi^{-1}(0.75) - \Phi^{-1}(0.25)\), of the standard Gaussian distribution. Curiously, the interquartile range for the standard Gaussian distribution is \(1.35\) to two decimals accuracy, but the use of \(1.34\) in the estimator \(\tilde{\sigma}\) has prevailed. Silverman, moreover, suggested to reduce the kernel-dependent constant in the formula (2.4) for \(h_n\) to further reduce oversmoothing.

If we specialize to the Gaussian kernel, formula (2.4)
simplifies to
\[\hat{h}_n = \left(\frac{4}{3}\right)^{1/5} \tilde{\sigma} n^{-1/5},\]
with \(\tilde{\sigma}\) plugged in as a robust estimate of \(\sigma\). Silverman (1986)
made the ad hoc suggestion to reduce the factor \((4/3)^{1/5} \simeq 1.06\) to \(0.9\).
This results in the bandwidth estimate
\[\hat{h}_n = 0.9 \tilde{\sigma} n^{-1/5},\]
which has become known as *Silverman’s rule of thumb*.

Silverman’s rule of thumb is the default bandwidth estimator for `density()`

as implemented by the function `bw.nrd0()`

. The `bw.nrd()`

function implements
bandwidth selection using the factor \(1.06\) instead of \(0.9\). Though the theoretical
derivations behind these implementations assume that \(f_0\)
is Gaussian, either will give reasonable bandwidth selection for a range of
unimodal distributions. If \(f_0\) is multimodal, Silverman’s
rule of thumb is known to oversmooth the density estimate.

Instead of computing \(\|f_0''\|^2_2\) assuming that the distribution is Gaussian, we can compute the norm for a pilot estimate, \(\tilde{f}\), and plug the result into the formula for \(h_n\). If the pilot estimate is a kernel estimate with kernel \(H\) and bandwidth \(r\) we get \[\|\tilde{f}''\|^2_2 = \frac{1}{n^2r^6} \sum_{i = 1}^n \sum_{j=1}^n \int H''\left( \frac{x - x_i}{r} \right) H''\left( \frac{x - x_j}{r} \right) \mathrm{d} x.\] The problem is, of course, that now we have to choose the pilot bandwidth \(r\). But doing so using a simple method like Silverman’s rule of thumb at this stage is typically not too bad an idea. Thus we arrive at the following plug-in procedure using the Gaussian kernel for the pilot estimate:

- Compute an estimate, \(\hat{r}\), of the pilot bandwidth using Silverman’s rule of thumb.
- Compute \(\|\tilde{f}''\|^2_2\) using the Gaussian kernel as pilot kernel \(H\) and using the estimated pilot bandwidth \(\hat{r}\).
- Plug \(\|\tilde{f}''\|^2_2\) into the oracle bandwidth formula (2.3) to compute \(\hat{h}_n\) for the kernel \(K\).

Note that to use a pilot kernel different from the Gaussian we have to adjust the constant 0.9 (or 1.06) in Silverman’s rule of thumb accordingly by computing \(\|H\|_2^2\) and \(\sigma_H^4\) and using (2.4).

Sheather and Jones (1991) took these plug-in ideas a step further and analyzed in detail
how to choose the pilot bandwidth in a good and data adaptive way. The resulting
method is somewhat complicated but implementable. We skip the details but simply
observe that their method is implemented in R in the function `bw.SJ()`

, and
it can be selected when using `density()`

by setting the argument `bw = "SJ"`

.
This plug-in method is regarded as a solid default that performs well for
many different data generating densities \(f_0\).

### 2.3.4 Cross-validation

An alternative to relying on the asymptotic optimality arguments for
integrated mean squared error and the corresponding plug-in estimates
of the bandwidth is known as *cross-validation*. The method mimics
the idea of setting aside a subset of the data set, which is then *not*
used for computing an estimate but only for validating the estimator’s
performance. The benefit of cross-validation over simply setting
aside a validation data set is that we do not “waste” any of the data points
on validation only. All data points are used for the ultimate computation
of the estimate. The deficit is that cross-validation is usually
computationally more demanding.

Suppose that \(I_1, \ldots, I_k\) form a partition of the index set \(\{1, \ldots, n\}\) and define \[I^{-i} = \bigcup_{l: i \not \in I_l} I_l.\] That is, \(I^{-i}\) contains all indices but those that belong to the set \(I_l\) containing \(i\). In particular, \(i \not \in I^{-i}\). Define also \(n_i = |I^{-i}|\) and \[\hat{f}^{-i}_h = \frac{1}{h n_i} \sum_{j \in I^{-i}} K\left(\frac{x_i - x_j}{h}\right).\] That is, \(\hat{f}^{-i}_h\) is the kernel density estimate based on data with indices in \(I^{-i}\) and evaluated in \(x_i\). Since the density estimate evaluated in \(x_i\) is not based on \(x_i\), the quantity \(\hat{f}^{-i}_h\) can be used to assess how well the density estimate computed using a bandwidth \(h\) concur with the data point \(x_i\). This can be summarized using the log-likelihood \[\ell_{\mathrm{CV}}(h) = \sum_{i=1}^n \log (\hat{f}^{-i}_h),\] that we will refer to as the cross-validated log-likelihood, and we define the bandwidth estimate as \[\hat{h}_{\mathrm{CV}} = \textrm{arg max}_h \ \ \ell_{\mathrm{CV}}(h).\] This cross-validation based bandwidth can then be used for computing kernel density estimates using the entire data set.

If the partition of indices consists of \(k\) subsets we usually talk about \(k\)-fold cross-validation. If \(k = n\) so that all subsets consist of just a single index we talk about leave-one-out cross-validation. For leave-one-out cross-validation there is only one possible partition, while for \(k < n\) there are many possible partitions. Which should be chosen then? In practice, we choose the partition by sampling indices randomly without replacement into \(k\) sets of size roughly \(n / k\).

It is also possible to use cross-validation in combination with MISE. Rewriting
we find that
\[\mathrm{MISE}(h) = E (\| \hat{f}_h\|_2^2) - 2 E (\hat{f}_h(X)) + E(\|f_0^2\|_2^2)\]
for \(X\) a random variable independent of the data and with distribution having
density \(f_0\). The last term does not depend upon \(h\) and we can ignore it from
the point of view of minimizing MISE. For the first term we have an unbiased
estimate in \(\| \hat{f}_h\|_2^2\). The middle term can be estimated
without bias by
\[\frac{2}{n} \sum_{i=1}^n \hat{f}^{-i}_h,\]
and this leads to the statistic
\[\mathrm{UCV}(h) = \| \hat{f}_h\|_2^2 - \frac{2}{n} \sum_{i=1}^n \hat{f}^{-i}_h\]
known as the unbiased cross-validation criterion. The corresponding
bandwidth estimate is
\[\hat{h}_{\mathrm{UCV}} = \textrm{arg min}_h \ \ \mathrm{UCV}(h).\]
Contrary to the log-likelihood based criterion, this criterion requires
the computation of \(\| \hat{f}_h\|_2^2\). Bandwidth selection using UCV
is implemented in R in the function `bw.ucv()`

and can be
used with `density()`

by setting the argument `bw = "ucv"`

.

## 2.4 Likelihood considerations

In Section 2.3.4 the cross-validated likelihood was used for bandwidth selection. It is natural to ask why we did not go all-in and simply maximized the likelihood over all possible densities to find a maximum likelihood estimator instead of using the ad hoc idea behind kernel density estimation.

The log-likelihood \[\ell(f) = \sum_{i=j}^n \log f(x_j)\] is well defined as a function of the density \(f\) – even when \(f\) is not restricted to belong to a finite-dimensional parametric model. To investigate if a nonparametric MLE is meaningful we consider how the likelihood behaves for the densities \[\overline{f}_h(x) = \frac{1}{nh \sqrt{2 \pi}} \sum_{j=1}^n e^{- \frac{(x - x_j)^2}{2 h^2} }\] for different choices of \(h\). Note that \(\overline{f}_h\) is simply the kernel density estimator with the Gaussian kernel and bandwidth \(h\).

```
## The densities can easily be implemented using the density implementation
## of the Gaussian density in R
f_h <- function(x, h) mean(dnorm(x, psi, h))
## This function does not work as a vectorized function as it is, but there is
## a convenience function, 'Vectorize', in R that turns the function into a
## function that can actually be applied correctly to a vector.
f_h <- Vectorize(f_h)
```

Figure 2.5 shows what some of these densities look like compared to the histogram of the \(\psi\)-angle data. Clearly, for large \(h\) these densities are smooth and slowly oscillating, while as \(h\) gets smaller the densities become more and more wiggly. As \(h \to 0\) the densities become increasingly dominated by tall narrow peaks around the individual data points.

The way that these densities adapt to the data points is reflected in the log-likelihood as well.

```
# To plot the log-likelihood we need to evaluate it in a grid of h-values.
hseq <- seq(1, 0.001, -0.001)
# For the following computation of the log-likelihood it is necessary
# that f_h is vectorized. There are other ways to implement this computation
# in R, and some are more efficient, but the computation of the log-likelihood
# for each h scales as O(n^2) with n the number of data points.
ll <- sapply(hseq, function(h) sum(log(f_h(psi, h))))
ggplot(mapping = aes(hseq, ll)) + geom_line() +
xlab("h") + ylab("Log-likelihood")
ggplot(mapping = aes(hseq, ll)) + geom_line() +
scale_x_log10("h") + ylab("")
```

From Figure 2.6 it is clear that the likelihood is decreasing in \(h\), and it appears that it is unbounded as \(h \to 0\). This is most clearly seen on the figure when \(h\) is plotted on the log-scale because then it appears that the log-likelihood approximately behaves as \(-\log(h)\) for \(h \to 0\).

We can show that that is, indeed, the case. If \(x_i \neq x_j\) when \(i \neq j\)

\[\begin{align*} \ell(\overline{f}_h) & = \sum_{i} \log\left(1 + \sum_{j \neq i} e^{-(x_i - x_j)^2 / (2 h^2)} \right) - n \log(nh\sqrt{2 \pi}) \\ & \sim - n \log(nh\sqrt{2 \pi}) \end{align*}\]

for \(h \to 0\). Hence, \(\ell(\overline{f}_h) \to \infty\) for \(h \to 0\). This demonstrates that the MLE of the density does not exist in the set of all distributions with densities.

In the sense of weak convergence it actually holds that
\[\overline{f}_h \cdot m \overset{\mathrm{wk}}{\longrightarrow}
\varepsilon_n = \frac{1}{n} \sum_{j=1}^n \delta_{x_j}\]
for \(h \to 0\). The *empirical measure* \(\varepsilon_n\) can sensibly be regarded as
the nonparametric MLE of the distribution, but the empirical measure does not have
a density. We conclude that we cannot directly define a sensible
density estimator as a maximum-likelihood estimator.

### 2.4.1 Method of sieves

A sieve is a family of models, \(\Theta_{h}\), indexed by a real valued parameter, \(h \in \mathbb{R}\), such that \(\Theta_{h_1} \subseteq \Theta_{h_2}\) for \(h_1 \leq h_2\). In this chapter \(\Theta_{h}\) will denote a set of probability densities. If the increasing family of models is chosen in a sensible way, we may be able to compute the MLE \[\hat{f}_h = \text{arg max}_{f \in \Theta_h} \ell(f),\] and we may even be able to choose \(h = h_n\) as a function of the sample size \(n\) such that \(\hat{f}_{h_n}\) becomes a consistent estimator of \(f_0\).

It is possible to take \[\Theta_h = \{ \overline{f}_{h'} \mid h' \leq h \}\] with \(\overline{f}_{h'}\) as defined above, in which case \(\hat{f}_h = \overline{f}_h\). We will see in the following section that this is simply a kernel estimator.

A more interesting example is obtained by letting \[\Theta_h = \left\{ x \mapsto \frac{1}{h \sqrt{2 \pi}} \int e^{- \frac{(x - z)^2}{2 h^2} } \mathrm{d}\mu(z) \Biggm| \mu \textrm{ a probability measure} \right\},\] which is known as the convolution sieve. We note that \(\overline{f}_h \in \Theta_h\) by taking \(\mu = \varepsilon_n\), but generally \(\hat{f}_h\) will be different from \(\overline{f}_h\).

We will not pursue the general theory of sieve estimators, but refer to the paper Nonparametric Maximum Likelihood Estimation by the Method of Sieves by Geman and Hwang. In the following section we will work out some more practical details for a particular sieve estimator based on a basis expansion of the log-density.

### 2.4.2 Basis expansions

We suppose in this section that the data points are all contained in the interval \([a,b]\) for \(a, b \in \mathbb{R}\). This is true for the angle data with \(a = -\pi\) and \(b = \pi\) no matter the size of the data set, but if \(f_0\) does not have a bounded support it may be necessary to let \(a\) and \(b\) change with the data. However, for any fixed data set we can choose some sufficiently large \(a\) and \(b\).

In this section the sieve will be indexed by integers,
and for \(h \in \mathbb{N}\) we suppose that we have chosen continuous
functions \(b_1, \ldots, b_h : [a,b] \to \mathbb{R}\). These will be
called *basis functions*. We then define
\[\Theta_h = \left\{ x \mapsto \varphi(\boldsymbol{\beta})^{-1}
\exp\left(\sum_{k=1}^h \beta_k b_k(x)\right) \Biggm| \boldsymbol{\beta} = (\beta_1, \ldots, \beta_h)^T \in \mathbb{R}^h \right\},\]
where
\[\varphi(\boldsymbol{\beta}) = \int_a^b \exp\left(\sum_{k=1}^h \beta_k b_k(x)\right) \mathrm{d} x.\]

The MLE over \(\Theta_h\) is then given as \[\hat{f}_h(x) = \varphi(\hat{\boldsymbol{\beta}})^{-1} \exp\left(\sum_{k=1}^h \hat{\beta}_k b_k(x)\right),\] where

\[\begin{align*} \hat{\boldsymbol{\beta}} & = \text{arg max}_{\boldsymbol{\beta} \in \mathbb{R}^h} \ \sum_{j=1}^n \sum_{k=1}^h \beta_k b_k(x_j) - \log \varphi(\boldsymbol{\beta}) \\ & = \text{arg max}_{\boldsymbol{\beta} \in \mathbb{R}^h} \ \ \mathbf{1}^T \mathbf{B} \boldsymbol{\beta} - \log \varphi(\boldsymbol{\beta}). \end{align*}\]

Here \(\mathbf{B}\) is the \(n \times h\) matrix with \(B_{jk} = b_k(x_j)\). Thus for any fixed \(h\) the model is, in fact, just an ordinary parametric exponential family, though it may not be entirely straightforward how to compute \(\varphi(\boldsymbol{\beta})\).

Many basis functions are possible. Polynomials may be used, but splines are
often preferred. An alternative is a selection of trigonometric functions,
for instance

\[b_1(x) = \cos(x), b_2(x) = \sin(x), \ldots, b_{2h-1}(x) = \cos(hx), b_{2h}(x) = \sin(hx)\]
on the interval \([-\pi, \pi]\). In Section 1.2.1 a simple special case
was actually treated corresponding to \(h = 2\), where the normalization constant
was identified in terms of a modified Bessel function.

It is worth remembering the following:

“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

— John von Neumann

The Normal-inverse Gaussian distribution has four parameters and the generalized hyperbolic distribution is an extension with five, but von Neumann was probably thinking more in terms of a spline or a polynomial expansion as above with four or five suitably chosen basis functions.

The quote is not a mathematical statement but an empirical observation. With a handful of parameters you already have a quite flexible class of densities that will fit many real data sets well. But remember that a reasonably good fit does not mean that you have found the “true” data generating model. Though data is in some situations not as scarce a resource today as when von Neumann made elephants wiggle their trunks, the quote still suggests that \(h\) should grow rather slowly with \(n\) to avoid overfitting.

## 2.5 Exercises

### Kernel density estimation

**Exercise 2.1 **The Epanechnikov kernel is given by
\[
K(x) = \frac{3}{4}(1 - x^2)
\]
for \(x \in [-1, 1]\) and 0 elsewhere. Show that this is a probability density
with mean zero and compute \(\sigma_K^2\) as well as \(\|K\|_2^2\).

For the following exercises use the log(F12) variable as considered in Exercise A.4.

**Exercise 2.2 **Use `density()`

to compute the kernel density estimate with the Epanechnikov kernel
to the log(F12) data. Try different bandwidths.

**Exercise 2.3 **Implement kernel density estimation yourself using the Epanechnikov kernel.
Test your implementation by comparing it to `density()`

using the log(F12) data.

### Benchmarking

**Exercise 2.4 **Construct the following vector

`x <- rnorm(2^13)`

Then use `microbenchmark`

to benchmark the computation of

`density(x[1:k], 0.2)`

for `k`

ranging from \(2^5\) to \(2^{13}\). Summarize the benchmarking results.

**Exercise 2.5 **Benchmark your own implementation of kernel density estimation using the Epanechnikov
kernel. Compare the results to those obtained for `density()`

.

**Exercise 2.6 **Experiment with different implementations of kernel evaluation in R using
the Gaussian kernel and the Epanechnikov kernel. Use `microbenchmark()`

to compare
the different implementations.