2.3 Bandwidth selection

2.3.1 Revisiting the rectangular kernel

We return to the rectangular kernel and compute the mean squared error. In the analysis it may be helpful to think about $$n$$ large and $$h$$ small. Indeed, we will eventually choose $$h = h_n$$ as a function of $$n$$ such that as $$n \to \infty$$ we have $$h_n \to 0$$. We should also note the $$f_h(x) = E (\hat{f}_h(x))$$ is a density, thus $$\int f_h(x) \mathrm{d}x = 1$$.

We will assume that $$f_0$$ is sufficiently differentiable and use a Taylor expansion of the distribution function $$F_0$$ to get that

\begin{align*} f_h(x) & = \frac{1}{2h}\left(F_0(x + h) - F_0(x - h)\right) \\ & = \frac{1}{2h}\left(2h f_0(x) + \frac{h^3}{3} f_0''(x) + R_0(x,h) \right) \\ & = f_0(x) + \frac{h^2}{6} f_0''(x) + R_1(x,h) \end{align*}

where $$R_1(x, h) = o(h^2)$$. One should note how the quadratic terms in $$h$$ in the Taylor expansion canceled. This gives the following formula for the squared bias of $$\hat{f}_h$$.

\begin{align*} \mathrm{bias}(\hat{f}_h(x))^2 & = (f_h(x) - f_0(x))^2 \\ & = \left(\frac{h^2}{6} f_0''(x) + R_1(x,h) \right)^2 \\ & = \frac{h^4}{36} f_0''(x)^2 + R(x,h) \end{align*}

where $$R(x,h) = o(h^4)$$. For the variance we see from (2.1) that $V(\hat{f}_h(x)) = f_h(x)\frac{1}{2hn} - f_h(x)^2 \frac{1}{n}.$ Integrating the sum of the bias and the variance over $$x$$ gives the integrated mean squared error

\begin{align*} \mathrm{MISE}(h) & = \int \mathrm{bias}(\hat{f}_h(x))^2 + V(\hat{f}_h(x)) \mathrm{d}x \\ & = \frac{h^4}{36} \|f_0''\|^2_2 + \frac{1}{2hn} + \int R(x,h) \mathrm{d} x - \frac{1}{n} \int f_h(x)^2 \mathrm{d} x. \end{align*}

If $$f_h(x) \leq C$$ (which happens if $$f_0$$ is bounded), $\int f_h(x)^2 \mathrm{d} x \leq C \int f_h(x) \mathrm{d} x = C,$ and the last term is $$o((nh)^{-1})$$. The second last term is $$o(h^4)$$ if we can interchange the limit and integration order. It is conceivable that we can do so under suitable assumptions on $$f_0$$, but we will not pursue those at this place. The sum of the two remaining and asymptotically dominating terms in the formula for MISE is
$\mathrm{AMISE}(h) = \frac{h^4}{36} \|f_0''\|^2_2 + \frac{1}{2hn},$ which is known as the asymptotic mean integrated squared error. Clearly, for this to be a useful formula, we must assume $$\|f_0''\|_2^2 < \infty$$. In this case the formula for AMISE can be used to find the asymptotic optimal tradeoff between (integrated) bias and variance. Differentiating w.r.t. $$h$$ we find that $\mathrm{AMISE}'(h) = \frac{h^3}{9} \|f_0''\|^2_2 - \frac{1}{2h^2n},$ and solving for $$\mathrm{AMISE}'(h) = 0$$ yields $h_n = \left(\frac{9}{2 \|f_0''\|_2^2}\right)^{1/5} n^{-1/5}.$ When AMISE is regarded as a function of $$h$$ we observe that it tends to $$\infty$$ for $$h \to 0$$ as well as for $$h \to \infty$$, thus the unique stationary point $$h_n$$ is a unique global minimizer. Choosing the bandwidth to be $$h_n$$ will therefore minimize the asympototic mean integrated squared error, and it is in this sense an optimal choice of bandwidth.

We see how “wiggliness” of $$f_0$$ enters into the formula for the optimal bandwidth $$h_n$$ via $$\|f_0''\|_2$$. This norm of the second derivative is precisely a quantification of how much $$f_0$$ oscillates. A large value, indicating a wiggly $$f_0$$, will drive the optimal bandwidth down whereas a small value will drive the optimal bandwidth up.

We should also observe that if we plug the optimal bandwidth into the formula for AMISE, we get \begin{align*} \mathrm{AMISE}(h_n) & = \frac{h_n^4}{36} \|f_0''\|^2_2 + \frac{1}{2h_n n} \\ & = C n^{-4/5}, \end{align*} which indicates that in terms of integrated mean squared error the rate at which we can nonparametrically estimate $$f_0$$ is $$n^{-4/5}$$. This should be contrasted to the common parametric rate of $$n^{-1}$$ for mean squared error.

From a practical viewpoint there is one major problem with the optimal bandwidth $$h_n$$; it depends via $$\|f_0''\|^2_2$$ upon the unknown $$f_0$$ that we are trying to estimate. We therefore refer to $$h_n$$ as an oracle bandwidth – it is the bandwidth that an oracle that knows $$f_0$$ would tell us to use. In practice, we will have to come up with an estimate of $$\|f_0''\|^2_2$$ and plug that estimate into the formula for $$h_n$$. We pursue a couple of different options for doing so for general kernel density estimators below together with methods that do not rely on the AMISE formula.

2.3.2 ISE, MISE and MSE for kernel estimators

Bandwidth selection for general kernel estimators can be studied asymptotically just as above. To this end it is useful to formalize how we quantify the quality of an estimate $$\hat{f}_h$$. One natural quantification is the integrated squared error, $\mathrm{ISE}(\hat{f}_h) = \int (\hat{f}_h(x) - f_0(x))^2 \ \mathrm{d}x = \|\hat{f}_h - f_0\|_2^2.$

The quality of the estimation procedure producing $$\hat{f}_h$$ from data can then be quantified by taking the mean ISE, $\mathrm{MISE}(h) = E(\mathrm{ISE}(\hat{f}_h)),$ where the expectation integral is over the data. Using Tonelli’s theorem we may interchange the expectation and the integration over $$x$$ to get $\mathrm{MISE}(h) = \int \mathrm{MSE}_x(h) \ \mathrm{d}x$ where $\mathrm{MSE}_h(x) = \mathrm{var}(\hat{f}_h(x)) + \mathrm{bias}(\hat{f}_h(x))^2.$ is the pointwise mean squared error.

Using the same kind of Taylor expansion argument as above we can show that if $$K$$ is a square integrable probability density with mean 0 and $\sigma_K^2 = \int z^2 K(z) \ \mathrm{d}z > 0,$ then $\mathrm{MISE}(h) = \mathrm{AMISE}(h) + o((nh)^{-1} + h^4)$ where the asymptotic mean integrated squared error is $\mathrm{AMISE}(h) = \frac{\|K\|_2^2}{nh} + \frac{h^4 \sigma^4_K \|f_0''\|_2^2}{4}$ with $\|g\|_2^2 = \int g(z)^2 \ \mathrm{d}z \quad (\mathrm{squared } \ L_2\mathrm{-norm}).$ Some regularity assumptions on $$f_0$$ are necessary, and from the result we clearly need to require that $$f_0''$$ is meaningful and square integrable. However, that is also enough. See Proposition A.1 in Tsybakov (2009) for a rigorous proof.

By minimizing $$\mathrm{AMISE}(h)$$ we derive the optimal oracle bandwidth

$$$\tag{2.3} h_n = \left( \frac{\|K\|_2^2}{ \|f_0''\|^2_2 \sigma_K^4} \right)^{1/5} n^{-1/5}.$$$

If we plug this formula into the formula for AMISE we arrive at the asymptotic error rate $$\mathrm{AMISE}(h_n) = C n^{-4/5}$$ with a constant $$C$$ depending on $$f_0''$$ and the kernel. It is noteworthy that the asymptotic analysis can be carried out even if $$K$$ is allowed to take negative values, though the resulting estimate may not be a valid density as it is. Tsybakov (2009) demonstrates how to improve on the rate $$n^{-4/5}$$ by allowing for kernels whose moments of order two or above vanish. Necessarily, such kernels must take negative values.

We observe that for the rectangular kernel, $\sigma_K^4 = \left(\frac{1}{2} \int_{-1}^1 z^2 \ \mathrm{d} z\right)^2 = \frac{1}{9}$ and $\|K\|_2^2 = \frac{1}{2^2} \int_{-1}^1 \ \mathrm{d} z = \frac{1}{2}.$ Plugging these numbers into (2.3) we find the oracle bandwidth for the rectangular kernel as derived in Section 2.3.1. For the Gaussian kernel we find that $$\sigma_K^4 = 1$$, while $\|K\|_2^2 = \frac{1}{2 \pi} \int e^{-x^2} \ \mathrm{d} x = \frac{1}{2 \sqrt{\pi}}.$

2.3.3 Plug-in estimation of the oracle bandwidth

To compute $$\|f_0''\|^2_2$$ that enters into the formula for the asymptotically optimal bandwidth we have to know $$f_0$$ that we are trying to estimate in the first place. To resolve the circularity we will make a first guess of what $$f_0$$ is and plug that guess into the formula for the oracle bandwidth.

Our first guess is that $$f_0$$ is Gaussian with mean 0 and variance $$\sigma^2$$. Then

\begin{align*} \|f_0''\|^2_2 & = \frac{1}{2 \pi \sigma^2} \int \left(\frac{x^2}{\sigma^4} - \frac{1}{\sigma^2}\right)^2 e^{-x^2/\sigma^2} \ \mathrm{d}x \\ & = \frac{1}{2 \sigma^9 \sqrt{\pi}} \frac{1}{\sqrt{\pi \sigma^2}} \int (x^4 - 2 \sigma^2 x^2 + \sigma^4) e^{-x^2/\sigma^2} \ \mathrm{d}x \\ & = \frac{1}{2 \sigma^9 \sqrt{\pi}} (\frac{3}{4} \sigma^4 - \sigma^4 + \sigma^4) \\ & = \frac{3}{8 \sigma^5 \sqrt{\pi}}. \end{align*}

Plugging this expression for the squared 2-norm of the second derivative of the density into the formula for the oracle bandwidth gives

$$$\tag{2.4} h_n = \left( \frac{8 \sqrt{\pi} \|K\|_2^2}{3 \sigma_K^4} \right)^{1/5} \sigma n^{-1/5},$$$

with $$\sigma$$ the only quantity depending on the unknown density $$f_0$$. We can now simply estimate $$\sigma$$, using e.g. the empirical standard deviation $$\hat{\sigma}$$, and plug this estimate into the formula above to get an estimate of the oracle bandwidth.

It is well known that the empirical standard deviation is sensitive to outliers, and if $$\sigma$$ is overestimated for that reason, the bandwidth will be too large and the resulting estimate will be oversmoothed. To get more robust bandwidth selection, Silverman (1986) suggested using the interquartile range to estimate $$\sigma$$. In fact, he suggested estimating $$\sigma$$ by $\tilde{\sigma} = \min\{\hat{\sigma}, \mathrm{IQR} / 1.34\}.$ In this estimator, IQR denotes the empirical interquartile range, and $$1.34$$ is approximately the interquartile range, $$\Phi^{-1}(0.75) - \Phi^{-1}(0.25)$$, of the standard Gaussian distribution. Curiously, the interquartile range for the standard Gaussian distribution is $$1.35$$ to two decimals accuracy, but the use of $$1.34$$ in the estimator $$\tilde{\sigma}$$ has prevailed. Silverman, moreover, suggested to reduce the kernel-dependent constant in the formula (2.4) for $$h_n$$ to further reduce oversmoothing.

If we specialize to the Gaussian kernel, formula (2.4) simplifies to $\hat{h}_n = \left(\frac{4}{3}\right)^{1/5} \tilde{\sigma} n^{-1/5},$ with $$\tilde{\sigma}$$ plugged in as a robust estimate of $$\sigma$$. Silverman (1986) made the ad hoc suggestion to reduce the factor $$(4/3)^{1/5} \simeq 1.06$$ to $$0.9$$. This results in the bandwidth estimate $\hat{h}_n = 0.9 \tilde{\sigma} n^{-1/5},$ which has become known as Silverman’s rule of thumb.

Silverman’s rule of thumb is the default bandwidth estimator for density() as implemented by the function bw.nrd0(). The bw.nrd() function implements bandwidth selection using the factor $$1.06$$ instead of $$0.9$$. Though the theoretical derivations behind these implementations assume that $$f_0$$ is Gaussian, either will give reasonable bandwidth selection for a range of unimodal distributions. If $$f_0$$ is multimodal, Silverman’s rule of thumb is known to oversmooth the density estimate.

Instead of computing $$\|f_0''\|^2_2$$ assuming that the distribution is Gaussian, we can compute the norm for a pilot estimate, $$\tilde{f}$$, and plug the result into the formula for $$h_n$$. If the pilot estimate is a kernel estimate with kernel $$H$$ and bandwidth $$r$$ we get $\|\tilde{f}''\|^2_2 = \frac{1}{n^2r^6} \sum_{i = 1}^n \sum_{j=1}^n \int H''\left( \frac{x - x_i}{r} \right) H''\left( \frac{x - x_j}{r} \right) \mathrm{d} x.$ The problem is, of course, that now we have to choose the pilot bandwidth $$r$$. But doing so using a simple method like Silverman’s rule of thumb at this stage is typically not too bad an idea. Thus we arrive at the following plug-in procedure using the Gaussian kernel for the pilot estimate:

• Compute an estimate, $$\hat{r}$$, of the pilot bandwidth using Silverman’s rule of thumb.
• Compute $$\|\tilde{f}''\|^2_2$$ using the Gaussian kernel as pilot kernel $$H$$ and using the estimated pilot bandwidth $$\hat{r}$$.
• Plug $$\|\tilde{f}''\|^2_2$$ into the oracle bandwidth formula (2.3) to compute $$\hat{h}_n$$ for the kernel $$K$$.

Note that to use a pilot kernel different from the Gaussian we have to adjust the constant 0.9 (or 1.06) in Silverman’s rule of thumb accordingly by computing $$\|H\|_2^2$$ and $$\sigma_H^4$$ and using (2.4).

Sheather and Jones (1991) took these plug-in ideas a step further and analyzed in detail how to choose the pilot bandwidth in a good and data adaptive way. The resulting method is somewhat complicated but implementable. We skip the details but simply observe that their method is implemented in R in the function bw.SJ(), and it can be selected when using density() by setting the argument bw = "SJ". This plug-in method is regarded as a solid default that performs well for many different data generating densities $$f_0$$.

2.3.4 Cross-validation

An alternative to relying on the asymptotic optimality arguments for integrated mean squared error and the corresponding plug-in estimates of the bandwidth is known as cross-validation. The method mimics the idea of setting aside a subset of the data set, which is then not used for computing an estimate but only for validating the estimator’s performance. The benefit of cross-validation over simply setting aside a validation data set is that we do not “waste” any of the data points on validation only. All data points are used for the ultimate computation of the estimate. The deficit is that cross-validation is usually computationally more demanding.

Suppose that $$I_1, \ldots, I_k$$ form a partition of the index set $$\{1, \ldots, n\}$$ and define $I^{-i} = \bigcup_{l: i \not \in I_l} I_l.$ That is, $$I^{-i}$$ contains all indices but those that belong to the set $$I_l$$ containing $$i$$. In particular, $$i \not \in I^{-i}$$. Define also $$n_i = |I^{-i}|$$ and $\hat{f}^{-i}_h = \frac{1}{h n_i} \sum_{j \in I^{-i}} K\left(\frac{x_i - x_j}{h}\right).$ That is, $$\hat{f}^{-i}_h$$ is the kernel density estimate based on data with indices in $$I^{-i}$$ and evaluated in $$x_i$$. Since the density estimate evaluated in $$x_i$$ is not based on $$x_i$$, the quantity $$\hat{f}^{-i}_h$$ can be used to assess how well the density estimate computed using a bandwidth $$h$$ concur with the data point $$x_i$$. This can be summarized using the log-likelihood $\ell_{\mathrm{CV}}(h) = \sum_{i=1}^n \log (\hat{f}^{-i}_h),$ that we will refer to as the cross-validated log-likelihood, and we define the bandwidth estimate as $\hat{h}_{\mathrm{CV}} = \textrm{arg max}_h \ \ \ell_{\mathrm{CV}}(h).$ This cross-validation based bandwidth can then be used for computing kernel density estimates using the entire data set.

If the partition of indices consists of $$k$$ subsets we usually talk about $$k$$-fold cross-validation. If $$k = n$$ so that all subsets consist of just a single index we talk about leave-one-out cross-validation. For leave-one-out cross-validation there is only one possible partition, while for $$k < n$$ there are many possible partitions. Which should be chosen then? In practice, we choose the partition by sampling indices randomly without replacement into $$k$$ sets of size roughly $$n / k$$.

It is also possible to use cross-validation in combination with MISE. Rewriting we find that $\mathrm{MISE}(h) = E (\| \hat{f}_h\|_2^2) - 2 E (\hat{f}_h(X)) + E(\|f_0^2\|_2^2)$ for $$X$$ a random variable independent of the data and with distribution having density $$f_0$$. The last term does not depend upon $$h$$ and we can ignore it from the point of view of minimizing MISE. For the first term we have an unbiased estimate in $$\| \hat{f}_h\|_2^2$$. The middle term can be estimated without bias by $\frac{2}{n} \sum_{i=1}^n \hat{f}^{-i}_h,$ and this leads to the statistic $\mathrm{UCV}(h) = \| \hat{f}_h\|_2^2 - \frac{2}{n} \sum_{i=1}^n \hat{f}^{-i}_h$ known as the unbiased cross-validation criterion. The corresponding bandwidth estimate is $\hat{h}_{\mathrm{UCV}} = \textrm{arg min}_h \ \ \mathrm{UCV}(h).$ Contrary to the log-likelihood based criterion, this criterion requires the computation of $$\| \hat{f}_h\|_2^2$$. Bandwidth selection using UCV is implemented in R in the function bw.ucv() and can be used with density() by setting the argument bw = "ucv".

References

Sheather, S. J., and M. C. Jones. 1991. “A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation.” Journal of the Royal Statistical Society. Series B (Methodological) 53 (3): 683–90.

Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. Chapman; Hall/CRC.

Tsybakov, A. B. 2009. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York.