## 2.3 Bandwidth selection

### 2.3.1 Revisiting the rectangular kernel

We return to the rectangular kernel and compute the mean squared error. In the analysis it may be helpful to think about \(n\) large and \(h\) small. Indeed, we will eventually choose \(h = h_n\) as a function of \(n\) such that as \(n \to \infty\) we have \(h_n \to 0\). We should also note the \(f_h(x) = E (\hat{f}_h(x))\) is a density, thus \(\int f_h(x) \mathrm{d}x = 1\).

We will assume that \(f_0\) is sufficiently differentiable and use a Taylor expansion of the distribution function \(F_0\) to get that

\[\begin{align*} f_h(x) & = \frac{1}{2h}\left(F_0(x + h) - F_0(x - h)\right) \\ & = \frac{1}{2h}\left(2h f_0(x) + \frac{h^3}{3} f_0''(x) + R_0(x,h) \right) \\ & = f_0(x) + \frac{h^2}{6} f_0''(x) + R_1(x,h) \end{align*}\]

where \(R_1(x, h) = o(h^2)\). One should note how the quadratic terms in \(h\) in the Taylor expansion canceled. This gives the following formula for the squared bias of \(\hat{f}_h\).

\[\begin{align*} \mathrm{bias}(\hat{f}_h(x))^2 & = (f_h(x) - f_0(x))^2 \\ & = \left(\frac{h^2}{6} f_0''(x) + R_1(x,h) \right)^2 \\ & = \frac{h^4}{36} f_0''(x)^2 + R(x,h) \end{align*}\]

where \(R(x,h) = o(h^4)\). For the variance we see from (2.1) that \[V(\hat{f}_h(x)) = f_h(x)\frac{1}{2hn} - f_h(x)^2 \frac{1}{n}.\] Integrating the sum of the bias and the variance over \(x\) gives the integrated mean squared error

\[\begin{align*} \mathrm{MISE}(h) & = \int \mathrm{bias}(\hat{f}_h(x))^2 + V(\hat{f}_h(x)) \mathrm{d}x \\ & = \frac{h^4}{36} \|f_0''\|^2_2 + \frac{1}{2hn} + \int R(x,h) \mathrm{d} x - \frac{1}{n} \int f_h(x)^2 \mathrm{d} x. \end{align*}\]

If \(f_h(x) \leq C\) (which happens if \(f_0\) is bounded),
\[\int f_h(x)^2 \mathrm{d} x \leq C \int f_h(x) \mathrm{d} x = C,\]
and the last term is \(o((nh)^{-1})\). The second last term is \(o(h^4)\)
if we can interchange the limit and integration order. It is conceivable
that we can do so under suitable assumptions on \(f_0\), but we will not pursue
those at this place. The sum of the two remaining and asymptotically dominating terms in
the formula for MISE is

\[\mathrm{AMISE}(h) = \frac{h^4}{36} \|f_0''\|^2_2 + \frac{1}{2hn},\]
which is known as the asymptotic mean integrated squared error. Clearly,
for this to be a useful formula, we must assume \(\|f_0''\|_2^2 < \infty\).
In this case the formula for AMISE can be used to find the asymptotic
optimal tradeoff between (integrated) bias and variance. Differentiating w.r.t. \(h\) we
find that
\[\mathrm{AMISE}'(h) = \frac{h^3}{9} \|f_0''\|^2_2 - \frac{1}{2h^2n},\]
and solving for \(\mathrm{AMISE}'(h) = 0\) yields
\[h_n = \left(\frac{9}{2 \|f_0''\|_2^2}\right)^{1/5} n^{-1/5}.\]
When AMISE is regarded as a function of \(h\) we observe that
it tends to \(\infty\) for \(h \to 0\) as well as for \(h \to \infty\), thus the unique
stationary point \(h_n\) is a unique global minimizer. Choosing the bandwidth to
be \(h_n\) will therefore minimize the asympototic mean integrated squared error, and it
is in this sense an optimal choice of bandwidth.

We see how “wiggliness” of \(f_0\) enters into the formula for the optimal bandwidth \(h_n\) via \(\|f_0''\|_2\). This norm of the second derivative is precisely a quantification of how much \(f_0\) oscillates. A large value, indicating a wiggly \(f_0\), will drive the optimal bandwidth down whereas a small value will drive the optimal bandwidth up.

We should also observe that if we plug the optimal bandwidth into the formula for AMISE, we get \[\begin{align*} \mathrm{AMISE}(h_n) & = \frac{h_n^4}{36} \|f_0''\|^2_2 + \frac{1}{2h_n n} \\ & = C n^{-4/5}, \end{align*}\] which indicates that in terms of integrated mean squared error the rate at which we can nonparametrically estimate \(f_0\) is \(n^{-4/5}\). This should be contrasted to the common parametric rate of \(n^{-1}\) for mean squared error.

From a practical viewpoint there is one major problem with the optimal
bandwidth \(h_n\); it depends via \(\|f_0''\|^2_2\) upon the unknown \(f_0\)
that we are trying to estimate. We therefore refer to \(h_n\) as an *oracle*
bandwidth – it is the bandwidth that an oracle that knows \(f_0\) would
tell us to use. In practice, we will have to come up with an estimate
of \(\|f_0''\|^2_2\) and plug that estimate into the formula for \(h_n\).
We pursue a couple of different options for doing so for general kernel
density estimators below together with methods that do not rely on the
AMISE formula.

### 2.3.2 ISE, MISE and MSE for kernel estimators

Bandwidth selection for general kernel estimators can be studied
asymptotically just as above. To this end it is useful to formalize
how we quantify the *quality* of an estimate \(\hat{f}_h\). One natural
quantification is the *integrated squared error*,
\[\mathrm{ISE}(\hat{f}_h) = \int (\hat{f}_h(x) - f_0(x))^2 \ \mathrm{d}x = \|\hat{f}_h - f_0\|_2^2.\]

The quality of the estimation procedure producing \(\hat{f}_h\) from data can then be quantified by taking the mean ISE, \[\mathrm{MISE}(h) = E(\mathrm{ISE}(\hat{f}_h)),\] where the expectation integral is over the data. Using Tonelli’s theorem we may interchange the expectation and the integration over \(x\) to get \[\mathrm{MISE}(h) = \int \mathrm{MSE}_x(h) \ \mathrm{d}x\] where \[\mathrm{MSE}_h(x) = \mathrm{var}(\hat{f}_h(x)) + \mathrm{bias}(\hat{f}_h(x))^2.\] is the pointwise mean squared error.

Using the same kind of Taylor expansion argument as above we can show that if
\(K\) is a square integrable probability density with mean 0 and
\[\sigma_K^2 = \int z^2 K(z) \ \mathrm{d}z > 0,\]
then
\[\mathrm{MISE}(h) = \mathrm{AMISE}(h) + o((nh)^{-1} + h^4)\]
where the *asymptotic mean integrated squared error* is
\[\mathrm{AMISE}(h) = \frac{\|K\|_2^2}{nh} + \frac{h^4 \sigma^4_K \|f_0''\|_2^2}{4}\]
with
\[\|g\|_2^2 = \int g(z)^2 \ \mathrm{d}z \quad (\mathrm{squared } \ L_2\mathrm{-norm}).\]
Some regularity assumptions on \(f_0\) are necessary, and from the result
we clearly need to require that \(f_0''\) is meaningful and square integrable.
However, that is also enough. See Proposition A.1 in Tsybakov (2009) for a rigorous
proof.

By minimizing \(\mathrm{AMISE}(h)\) we derive the optimal oracle bandwidth

\[\begin{equation} \tag{2.3} h_n = \left( \frac{\|K\|_2^2}{ \|f_0''\|^2_2 \sigma_K^4} \right)^{1/5} n^{-1/5}. \end{equation}\]

If we plug this formula into the formula for AMISE we arrive at the asymptotic error rate \(\mathrm{AMISE}(h_n) = C n^{-4/5}\) with a constant \(C\) depending on \(f_0''\) and the kernel. It is noteworthy that the asymptotic analysis can be carried out even if \(K\) is allowed to take negative values, though the resulting estimate may not be a valid density as it is. Tsybakov (2009) demonstrates how to improve on the rate \(n^{-4/5}\) by allowing for kernels whose moments of order two or above vanish. Necessarily, such kernels must take negative values.

We observe that for the rectangular kernel, \[\sigma_K^4 = \left(\frac{1}{2} \int_{-1}^1 z^2 \ \mathrm{d} z\right)^2 = \frac{1}{9}\] and \[\|K\|_2^2 = \frac{1}{2^2} \int_{-1}^1 \ \mathrm{d} z = \frac{1}{2}.\] Plugging these numbers into (2.3) we find the oracle bandwidth for the rectangular kernel as derived in Section 2.3.1. For the Gaussian kernel we find that \(\sigma_K^4 = 1\), while \[\|K\|_2^2 = \frac{1}{2 \pi} \int e^{-x^2} \ \mathrm{d} x = \frac{1}{2 \sqrt{\pi}}.\]

### 2.3.3 Plug-in estimation of the oracle bandwidth

To compute \(\|f_0''\|^2_2\) that enters into the formula for the asymptotically optimal bandwidth we have to know \(f_0\) that we are trying to estimate in the first place. To resolve the circularity we will make a first guess of what \(f_0\) is and plug that guess into the formula for the oracle bandwidth.

Our first guess is that \(f_0\) is Gaussian with mean 0 and variance \(\sigma^2\). Then

\[\begin{align*} \|f_0''\|^2_2 & = \frac{1}{2 \pi \sigma^2} \int \left(\frac{x^2}{\sigma^4} - \frac{1}{\sigma^2}\right)^2 e^{-x^2/\sigma^2} \ \mathrm{d}x \\ & = \frac{1}{2 \sigma^9 \sqrt{\pi}} \frac{1}{\sqrt{\pi \sigma^2}} \int (x^4 - 2 \sigma^2 x^2 + \sigma^4) e^{-x^2/\sigma^2} \ \mathrm{d}x \\ & = \frac{1}{2 \sigma^9 \sqrt{\pi}} (\frac{3}{4} \sigma^4 - \sigma^4 + \sigma^4) \\ & = \frac{3}{8 \sigma^5 \sqrt{\pi}}. \end{align*}\]

Plugging this expression for the squared 2-norm of the second derivative of the density into the formula for the oracle bandwidth gives

\[\begin{equation} \tag{2.4} h_n = \left( \frac{8 \sqrt{\pi} \|K\|_2^2}{3 \sigma_K^4} \right)^{1/5} \sigma n^{-1/5}, \end{equation}\]

with \(\sigma\) the only quantity depending on the unknown density \(f_0\). We can now simply estimate \(\sigma\), using e.g. the empirical standard deviation \(\hat{\sigma}\), and plug this estimate into the formula above to get an estimate of the oracle bandwidth.

It is well known that the empirical standard deviation is sensitive to outliers, and if \(\sigma\) is overestimated for that reason, the bandwidth will be too large and the resulting estimate will be oversmoothed. To get more robust bandwidth selection, Silverman (1986) suggested using the interquartile range to estimate \(\sigma\). In fact, he suggested estimating \(\sigma\) by \[\tilde{\sigma} = \min\{\hat{\sigma}, \mathrm{IQR} / 1.34\}.\] In this estimator, IQR denotes the empirical interquartile range, and \(1.34\) is approximately the interquartile range, \(\Phi^{-1}(0.75) - \Phi^{-1}(0.25)\), of the standard Gaussian distribution. Curiously, the interquartile range for the standard Gaussian distribution is \(1.35\) to two decimals accuracy, but the use of \(1.34\) in the estimator \(\tilde{\sigma}\) has prevailed. Silverman, moreover, suggested to reduce the kernel-dependent constant in the formula (2.4) for \(h_n\) to further reduce oversmoothing.

If we specialize to the Gaussian kernel, formula (2.4)
simplifies to
\[\hat{h}_n = \left(\frac{4}{3}\right)^{1/5} \tilde{\sigma} n^{-1/5},\]
with \(\tilde{\sigma}\) plugged in as a robust estimate of \(\sigma\). Silverman (1986)
made the ad hoc suggestion to reduce the factor \((4/3)^{1/5} \simeq 1.06\) to \(0.9\).
This results in the bandwidth estimate
\[\hat{h}_n = 0.9 \tilde{\sigma} n^{-1/5},\]
which has become known as *Silverman’s rule of thumb*.

Silverman’s rule of thumb is the default bandwidth estimator for `density()`

as implemented by the function `bw.nrd0()`

. The `bw.nrd()`

function implements
bandwidth selection using the factor \(1.06\) instead of \(0.9\). Though the theoretical
derivations behind these implementations assume that \(f_0\)
is Gaussian, either will give reasonable bandwidth selection for a range of
unimodal distributions. If \(f_0\) is multimodal, Silverman’s
rule of thumb is known to oversmooth the density estimate.

Instead of computing \(\|f_0''\|^2_2\) assuming that the distribution is Gaussian, we can compute the norm for a pilot estimate, \(\tilde{f}\), and plug the result into the formula for \(h_n\). If the pilot estimate is a kernel estimate with kernel \(H\) and bandwidth \(r\) we get \[\|\tilde{f}''\|^2_2 = \frac{1}{n^2r^6} \sum_{i = 1}^n \sum_{j=1}^n \int H''\left( \frac{x - x_i}{r} \right) H''\left( \frac{x - x_j}{r} \right) \mathrm{d} x.\] The problem is, of course, that now we have to choose the pilot bandwidth \(r\). But doing so using a simple method like Silverman’s rule of thumb at this stage is typically not too bad an idea. Thus we arrive at the following plug-in procedure using the Gaussian kernel for the pilot estimate:

- Compute an estimate, \(\hat{r}\), of the pilot bandwidth using Silverman’s rule of thumb.
- Compute \(\|\tilde{f}''\|^2_2\) using the Gaussian kernel as pilot kernel \(H\) and using the estimated pilot bandwidth \(\hat{r}\).
- Plug \(\|\tilde{f}''\|^2_2\) into the oracle bandwidth formula (2.3) to compute \(\hat{h}_n\) for the kernel \(K\).

Note that to use a pilot kernel different from the Gaussian we have to adjust the constant 0.9 (or 1.06) in Silverman’s rule of thumb accordingly by computing \(\|H\|_2^2\) and \(\sigma_H^4\) and using (2.4).

Sheather and Jones (1991) took these plug-in ideas a step further and analyzed in detail
how to choose the pilot bandwidth in a good and data adaptive way. The resulting
method is somewhat complicated but implementable. We skip the details but simply
observe that their method is implemented in R in the function `bw.SJ()`

, and
it can be selected when using `density()`

by setting the argument `bw = "SJ"`

.
This plug-in method is regarded as a solid default that performs well for
many different data generating densities \(f_0\).

### 2.3.4 Cross-validation

An alternative to relying on the asymptotic optimality arguments for
integrated mean squared error and the corresponding plug-in estimates
of the bandwidth is known as *cross-validation*. The method mimics
the idea of setting aside a subset of the data set, which is then *not*
used for computing an estimate but only for validating the estimator’s
performance. The benefit of cross-validation over simply setting
aside a validation data set is that we do not “waste” any of the data points
on validation only. All data points are used for the ultimate computation
of the estimate. The deficit is that cross-validation is usually
computationally more demanding.

Suppose that \(I_1, \ldots, I_k\) form a partition of the index set \(\{1, \ldots, n\}\) and define \[I^{-i} = \bigcup_{l: i \not \in I_l} I_l.\] That is, \(I^{-i}\) contains all indices but those that belong to the set \(I_l\) containing \(i\). In particular, \(i \not \in I^{-i}\). Define also \(n_i = |I^{-i}|\) and \[\hat{f}^{-i}_h = \frac{1}{h n_i} \sum_{j \in I^{-i}} K\left(\frac{x_i - x_j}{h}\right).\] That is, \(\hat{f}^{-i}_h\) is the kernel density estimate based on data with indices in \(I^{-i}\) and evaluated in \(x_i\). Since the density estimate evaluated in \(x_i\) is not based on \(x_i\), the quantity \(\hat{f}^{-i}_h\) can be used to assess how well the density estimate computed using a bandwidth \(h\) concur with the data point \(x_i\). This can be summarized using the log-likelihood \[\ell_{\mathrm{CV}}(h) = \sum_{i=1}^n \log (\hat{f}^{-i}_h),\] that we will refer to as the cross-validated log-likelihood, and we define the bandwidth estimate as \[\hat{h}_{\mathrm{CV}} = \textrm{arg max}_h \ \ \ell_{\mathrm{CV}}(h).\] This cross-validation based bandwidth can then be used for computing kernel density estimates using the entire data set.

If the partition of indices consists of \(k\) subsets we usually talk about \(k\)-fold cross-validation. If \(k = n\) so that all subsets consist of just a single index we talk about leave-one-out cross-validation. For leave-one-out cross-validation there is only one possible partition, while for \(k < n\) there are many possible partitions. Which should be chosen then? In practice, we choose the partition by sampling indices randomly without replacement into \(k\) sets of size roughly \(n / k\).

It is also possible to use cross-validation in combination with MISE. Rewriting
we find that
\[\mathrm{MISE}(h) = E (\| \hat{f}_h\|_2^2) - 2 E (\hat{f}_h(X)) + E(\|f_0^2\|_2^2)\]
for \(X\) a random variable independent of the data and with distribution having
density \(f_0\). The last term does not depend upon \(h\) and we can ignore it from
the point of view of minimizing MISE. For the first term we have an unbiased
estimate in \(\| \hat{f}_h\|_2^2\). The middle term can be estimated
without bias by
\[\frac{2}{n} \sum_{i=1}^n \hat{f}^{-i}_h,\]
and this leads to the statistic
\[\mathrm{UCV}(h) = \| \hat{f}_h\|_2^2 - \frac{2}{n} \sum_{i=1}^n \hat{f}^{-i}_h\]
known as the unbiased cross-validation criterion. The corresponding
bandwidth estimate is
\[\hat{h}_{\mathrm{UCV}} = \textrm{arg min}_h \ \ \mathrm{UCV}(h).\]
Contrary to the log-likelihood based criterion, this criterion requires
the computation of \(\| \hat{f}_h\|_2^2\). Bandwidth selection using UCV
is implemented in R in the function `bw.ucv()`

and can be
used with `density()`

by setting the argument `bw = "ucv"`

.

### References

Sheather, S. J., and M. C. Jones. 1991. “A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation.” *Journal of the Royal Statistical Society. Series B (Methodological)* 53 (3): 683–90.

Silverman, B. W. 1986. *Density Estimation for Statistics and Data Analysis*. Chapman; Hall/CRC.

Tsybakov, A. B. 2009. *Introduction to Nonparametric Estimation*. Springer Series in Statistics. Springer, New York.