8.3 Fisher information

For statistics relying on classical asymptotic theory we need an estimate of the Fisher information, e.g. the observed Fisher information (Hessian of the negative log-likelihood for the observed data). For numerical optimization of \(Q\) or variants of the EM algorithm (like EM gradient or acceleration methods) the gradient and Hessian of \(Q\) can be useful. However, these do not directly inform us on the Fisher information. In this section we show some interesting and useful relations between the derivatives of the log-likelihood for the observed data and derivatives of \(Q\) with the primary purpose of estimating the Fisher information.

First we look at the peppered moth example, where we note that with \(p = p(\theta)\) being some parametrization of the cell probabilities, \[Q(\theta \mid \theta') = \sum_{k=1}^K \frac{x_{j(k)} p_k(\theta')}{M(p(\theta'))_{j(k)}} \log p_k(\theta),\] where \(j(k)\) is defined by \(k \in A_{j(k)}\). The gradient of \(Q\) w.r.t. \(\theta\) is therefore

\[\nabla_{\theta} Q(\theta \mid \theta') = \sum_{k = 1}^K \frac{x_{j(k)} p_k(\theta')}{M(p(\theta'))_{j(k)} p_k(\theta)} \nabla_{\theta} p_k(\theta').\]

We recognize from previous computations in Section 7.2.4 that when we evaluate \(\nabla_{\theta} Q(\theta \mid \theta')\) in \(\theta = \theta'\) we get

\[\nabla_{\theta} Q(\theta' \mid \theta') = \sum_{i = 1}^K \frac{x_{j(i)} }{M(p(\theta'))_{j(i)}} \nabla_{\theta} p_i(\theta') = \nabla_{\theta} \ell(\theta'),\]

thus the gradient of \(\ell\) in \(\theta'\) is actually identical to the gradient of \(Q(\cdot \mid \theta')\) in \(\theta'\). This is not a coincidence, and it holds generally that \[\nabla_{\theta} Q(\theta' \mid \theta') = \nabla_{\theta} \ell(\theta').\] This follows from the fact we derived in the proof of Theorem 8.1 that \(\theta'\) minimizes

\[\theta \mapsto \ell(\theta) - Q(\theta \mid \theta').\]

Another way to phrase this is that the minorant of \(\ell(\theta)\) touches \(\ell\) tangentially in \(\theta'\).

In the case where the observation \(\mathbf{y}\) consists of \(n\) i.i.d. observations from the model with parameter \(\theta_0\), \(\ell\) as well as \(Q(\cdot \mid \theta')\) are sums of terms for which the gradient identity above holds for each term. In particular, \[\nabla_{\theta} \ell(\theta_0) = \sum_{i=1}^n \nabla_{\theta} \ell_i(\theta_0) = \sum_{i=1}^n \nabla_{\theta} Q_i(\theta_0 \mid \theta_0),\] and using the second Bartlett identity

\[\mathcal{I}(\theta_0) = V_{\theta_0}(\nabla_{\theta} \ell(\theta_0))\]

we see that

\[\hat{\mathcal{I}}(\theta_0) = \sum_{i=1}^n \big(\nabla_{\theta} Q_i(\theta_0 \mid \theta_0) - n^{-1} \nabla_{\theta} \ell(\theta_0)\big)\big(\nabla_{\theta} Q_i(\theta_0 \mid \theta_0) - n^{-1} \nabla_{\theta} \ell(\theta_0)\big)^T\]

is almost an unbiased estimator of the Fisher information. It does have mean \(\mathcal{I}(\theta_0)\), but it is not an estimator as \(\theta_0\) is not known. Using a plug-in-estimator, \(\hat{\theta}\), of \(\theta_0\) we get a real estimator

\[\hat{\mathcal{I}} = \hat{\mathcal{I}}(\hat{\theta}) = \sum_{i=1}^n \big(\nabla_{\theta} Q_i(\hat{\theta} \mid \hat{\theta}) - n^{-1} \nabla_{\theta} \ell(\hat{\theta})\big)\big(\nabla_{\theta} Q_i(\hat{\theta} \mid \hat{\theta}) - n^{-1} \nabla_{\theta} \ell(\hat{\theta})\big)^T,\]

though \(\hat{\mathcal{I}}\) will no longer necessarily be unbiased.

We refer to \(\hat{\mathcal{I}}\) as the empirical Fisher information given by the estimator \(\hat{\theta}\). In most cases, \(\hat{\theta}\) is the maximum-likelihood estimator, in which case \(\nabla_{\theta} \ell(\hat{\theta}) = 0\) and the empirical Fisher information simplifies to \[\hat{\mathcal{I}} = \sum_{i=1}^n \nabla_{\theta} Q_i(\hat{\theta} \mid \hat{\theta}) \nabla_{\theta} Q_i(\hat{\theta} \mid \hat{\theta})^T.\] However, \(\nabla_{\theta} \ell(\hat{\theta})\) is in practice only approximately equal to zero, and it is unclear if it should be dropped.

For the peppered moths, where data is collected as i.i.d. samples of \(n\) individual specimens and tabulated according to phenotype, we implement the empirical Fisher information with the optional possibility of centering the gradients before computing the information estimate. We note that only three different observations of phenotype are possible, giving rise to three different possible terms in the sum. The implementation works directly on the tabulated data by computing all the three possible terms and then forming a weighted sum according to the number of times each term is present.

empFisher <- function(par, x, grad, center = FALSE) {
  grad_MLE <- 0 ## is supposed to be 0 in the MLE
  if (center) 
     grad_MLE <-  grad(par, x) / sum(x)
   grad1 <- grad(par, c(1, 0, 0)) - grad_MLE
   grad2 <- grad(par, c(0, 1, 0)) - grad_MLE
   grad3 <- grad(par, c(0, 0, 1)) - grad_MLE
   x[1] * t(grad1) %*% grad1 + 
     x[2] * t(grad2) %*% grad2 + 
     x[3] * t(grad3) %*% grad3 
}

We test the implementation with and without centering and compare the result to a numerically computed hessian using optimHess (it is possible to get optim to compute the Hessian numerically in the minimizer as a final step, but optimHess does this computation separately).

## The gradient of Q (equivalently the log-likelihood) was 
## implemented earlier as 'grad_loglik'.
grad <- function(par, x) grad_loglik(par, x, prob, Dprob, c(1, 1, 1, 2, 2, 3))
empFisher(phat, c(85, 196, 341), grad)
##           [,1]     [,2]
## [1,] 18487.558 1384.626
## [2,]  1384.626 6816.612
empFisher(phat, c(85, 196, 341), grad, center = TRUE)
##           [,1]     [,2]
## [1,] 18487.558 1384.626
## [2,]  1384.626 6816.612
optimHess(phat, loglik, grad_loglik, x = c(85, 196, 341), 
          prob = prob, Dprob = Dprob, group = c(1, 1, 1, 2, 2, 3))
##           [,1]     [,2]
## [1,] 18490.938 1384.629
## [2,]  1384.629 6816.769

Note that the numerically computed Hessian (the observed Fisher information) and the empirical Fisher information are different estimates of the same quantity. Thus they are not supposed to be identical on a given data set, but they are supposed to be estimates of the same thing and thus to be similar.

An alternative to the empirical Fisher information or a direct computation of the observed Fisher information is supplemented EM (SEM). This is a general method for computing the observed Fisher information that relies only on EM steps and a numerical differentiation scheme. Define the EM map \(\Phi : \Theta \mapsto \Theta\) by

\[\Phi(\theta') = \textrm{arg max}_{\theta} \ Q(\theta \mid \theta').\]

A global maximum of the likelihood is a fixed point of \(\Phi\), and the EM algorithm searches for a fixed point for \(\Phi\), that is, a solution to

\[\Phi(\theta) = \theta.\]

Variations of the EM-algorithm can often be seen as other ways to find a fixed point for \(\Phi\). From \[\ell(\theta) = Q(\theta \mid \theta') + H(\theta \mid \theta')\] it follows that the observed Fisher information equals

\[\hat{i}_X := - D^2_{\theta} \ell(\hat{\theta}) = \underbrace{-D^2_{\theta} Q(\hat{\theta} \mid \theta')}_{= \hat{i}_Y(\theta')} - D \underbrace{^2_{\theta} H(\hat{\theta} \mid \theta')}_{= \hat{i}_{Y \mid X}(\theta')}.\]

It is possible to compute \(\hat{i}_Y := \hat{i}_Y(\hat{\theta})\). For peppered moths (and exponential families) it is as difficult as computing the Fisher information for complete observations.

We want to compute \(\hat{i}_X\) but \(\hat{i}_{Y \mid X} := \hat{i}_{Y \mid X}(\hat{\theta})\) is not computable either. It can, however, be shown that

\[D_{\theta} \Phi(\hat{\theta})^T = \hat{i}_{Y\mid X} \left(\hat{i}_Y\right)^{-1}.\]

Hence \[\begin{align} \hat{i}_X & = \left(I - \hat{i}_{Y\mid X} \left(\hat{i}_Y\right)^{-1}\right) \hat{i}_Y \\ & = \left(I - D_{\theta} \Phi(\hat{\theta})^T\right) \hat{i}_Y. \end{align}\]

Though the EM map \(\Phi\) might not have a simple analytic expression, its Jacobian, \(D_{\theta} \Phi(\hat{\theta})\), can be computed via numerical differentiation once we have implemented \(\Phi\). We also need the hessian of the map \(Q\), which we implement as an R function as well.

Q <- function(p, pp, x = c(85, 196, 341), group) {
  p[3] <- 1 - p[1] - p[2]
  pp[3] <- 1 - pp[1] - pp[2]
  - (x[group] * prob(pp) / M(prob(pp), group)[group]) %*% log(prob(p))
}

The R package numDeriv contains functions that compute numerical derivatives.

library(numDeriv)

The Hessian of \(Q\) can be computed using this package.

iY <- hessian(Q, phat, pp = phat, group = c(1, 1, 1, 2, 2, 3))

Supplemented EM can then be implemented by computing the Jacobian of \(\Phi\) using numDeriv as well.

Phi <- function(pp) MStep(EStep(pp, x = c(85, 196, 341)))
DPhi <- jacobian(Phi, phat)  ## Using numDeriv function 'jacobian'
iX <- (diag(1, 2) - t(DPhi)) %*% iY
iX
##           [,1]     [,2]
## [1,] 18487.558 1384.626
## [2,]  1384.626 6816.612

For statistics, we actually need the inverse Fisher information, which can be computed by inverting \(\hat{i}_X\), but we also have the following interesting identity

\[\begin{align} \hat{i}_X^{-1} & = \hat{i}_Y^{-1} \left(I - D_{\theta} \Phi(\hat{\theta})^T\right)^{-1} \\ & = \hat{i}_Y^{-1} \left(I + \sum_{n=1}^{\infty} \left(D_{\theta} \Phi(\hat{\theta})^T\right)^n \right) \\ & = \hat{i}_Y^{-1} + \hat{i}_Y^{-1} D_{\theta} \Phi(\hat{\theta})^T \left(I - D_{\theta} \Phi(\hat{\theta})^T\right)^{-1} \end{align}\]

where the second identity follows by the Neumann series.

The last formula above explicitly gives the asymptotic variance for the incomplete observation \(X\) as the asymptotic variance for the complete observation \(Y\) plus a correction term.

iYinv <- solve(iY)
iYinv + iYinv %*% t(solve(diag(1, 2) - DPhi, DPhi))
##               [,1]          [,2]
## [1,]  5.492602e-05 -1.115686e-05
## [2,] -1.115686e-05  1.489667e-04
solve(iX) ## SEM-based, but different use of inversion
##               [,1]          [,2]
## [1,]  5.492602e-05 -1.115686e-05
## [2,] -1.115686e-05  1.489667e-04

The SEM implementation above relies on the hessian and jacobian functions from the numDeriv package for numerical differentiation.

It is possible to implement the computation of the hessian of \(Q\) analytically for the peppered moths, but to illustrate functionality of the numDeriv package we implemented the computation numerically above.

Variants on the strategy for computing \(D_{\theta} \Phi(\hat{\theta})\) via numerical differentiation have been suggested in the literature, specifically using difference quotient approximations along the sequence of EM steps. This is not going to work as well as standard numerical differentiation since this method ignores numerical errors, and when the algorithm gets sufficiently close to the MLE, the numerical errors will dominate in the difference quotients.