6.2 Multinomial models

The multinomial model is the model of all probability distributions on a finite set \(\mathcal{Y} = \{1, \ldots, K\}\). The model is parametrized by the simplex \[\Delta_K = \left\{(p_1, \ldots, p_K)^T \in \mathbb{R}^K \mid p_k \geq 0, \sum_{k=1}^K p_k = 1\right\}.\] The distributions parametrized by the relative interior of \(\Delta_K\) form an exponential family by the parametrization \[p_k = \frac{e^{\theta_k}}{\sum_{l=1}^K e^{\theta_l}} \in (0,1)\] for \((\theta_1, \ldots, \theta_K)^T \in \mathbb{R}^K.\) That is, the sufficient statistic is \(k \mapsto (\delta_{1k}, \ldots, \delta_{Kk})^T \in \mathbb{R}^{K-1}\) (where \(\delta_{lk} \in \{0, 1\}\) is the Kronecker delta being 1 if and only if \(l = k\)), and \[\varphi(\theta_1, \ldots, \theta_K) = \sum_{l=1}^K e^{\theta_l}.\] We call this exponential family parametrization the symmetric parametrization. The canonical parameter in the symmetric parametrization is not identifiable, and to resolve the lack of identifiability there is a tradition of fixing the last parameter as \(\theta_K = 0\). This results in a canonical parameter \(\theta = (\theta_1, \ldots, \theta_{K-1})^T \in \mathbb{R}^{K-1},\) a sufficient statistic \(t_1(k) = (\delta_{1k}, \ldots, \delta_{(K-1)k})^T \in \mathbb{R}^p\),

\[\varphi(\theta) = 1 + \sum_{l=1}^{K-1} e^{\theta_l}\]

and probabilities

\[p_k = \left\{\begin{array}{ll} \frac{e^{\theta_k}}{1 + \sum_{l=1}^{K-1} e^{\theta_l}} & \quad \text{if } k = 1, \ldots, K-1 \\ \frac{1}{1 + \sum_{l=1}^{K-1} e^{\theta_l}} & \quad \text{if } k = K. \end{array}\right.\]

We see that in this parametrization \(p_k = e^{\theta_k}p_K\) for \(k = 1, \ldots, K-1\), where the probability of the last element \(K\) acts as a baseline or reference probability, and the factors \(e^{\theta_k}\) act as multiplicative modifications of this baseline. Moreover, \[\frac{p_k}{p_l} = e^{\theta_k - \theta_l},\] which is independent of the chosen baseline.

In the special case of \(K = 2\) the two elements \(\mathcal{Y}\) are often given other labels than \(1\) and \(2\). The most common labels are \(\{0, 1\}\) and \(\{-1, 1\}\). If we use the \(0\)-\(1\)-labels the convention is to use \(p_0\) as the baseline and thus \[p_1 = \frac{e^{\theta}}{1 + e^{\theta}} = e^{\theta} p_0 = e^{\theta} (1 - p_1).\] parametrized by \(\theta \in \mathbb{R}\). As this function of \(\theta\) is known as the logistic function, this parametrization of the probability distributions on \(\{0,1\}\) is often referred to as the logistic model. From the above we see directly that \[\theta = \log \frac{p_1}{1 - p_1}\] is the log-odds.

If we use the \(\pm 1\)-labels, an alternative exponential family parametrization is \[p_k = \frac{e^{k\theta}}{e^\theta + e^{-\theta}}\] for \(\theta \in \mathbb{R}\) and \(k \in \{-1, 1\}\). This gives a symmetric treatment of the two labels while retaining the identifiability.

With i.i.d. observations from a multinomial distribution we find that the log-likelihood is

\[\begin{align*} \ell(\theta) & = \sum_{i=1}^n \sum_{k=1}^K \delta_{k y_i} \log(p_k(\theta)) \\ & = \sum_{k=1}^K \underbrace{\sum_{i=1}^n \delta_{k y_i}}_{= n_k} \log(p_k(\theta)) \\ & = \theta^T \mathbf{n} - n \log \left(1 + \sum_{l=1}^{K-1} e^{\theta_l} \right). \end{align*}\]

where \(\mathbf{n} = (n_1, \ldots, n_K)^T = \sum_{i=1}^n t(y_i)\) is the sufficient statistic, which is simply a tabulation of the times the different elements in \(\{1, \ldots, K\}\) were observed.

6.2.1 Peppered Moths

This example is on the color of the peppered Moth (Birkemåler in Danish). The color of the moth is primarily determined by one gene that occur in three different alleles denoted C, I and T. The alleles are ordered in terms of dominance as C > I > T. Moths with genotype including C are dark. Moths with genotype TT are light colored. Moths with genotypes II and IT are mottled. Thus there a total of six different genotypes (CC, CI, CT, II, IT and TT) and three different phenotypes (black, mottled, light colored).

The peppered moth provided an early demonstration of evolution in the 19th century England, where the light colored moth was outnumbered by the dark colored variety. The dark color became dominant due to the increased pollution, where trees were darkened by soot, and had for that reason a selective advantage. There is a nice collection of moth in different colors at the Danish Zoological Museum, and further explanation of the role it played in understanding evolution.

We denote the allele frequencies \(p_C\), \(p_I\), \(p_T\) with \(p_C + p_I + p_T = 1.\) According to the Hardy-Weinberg principle, the genotype frequencies are then

\[p_C^2, 2p_Cp_I, 2p_Cp_T, p_I^2, 2p_Ip_T, p_T^2.\]

If we could observe the genotypes, the complete multinomial log-likelihood would be

\[\begin{align*} & 2n_{CC} \log(p_C) + n_{CI} \log (2 p_C p_I) + n_{CT} \log(2 p_C p_I) \\ & \ \ + 2 n_{II} \log(p_I) + n_{IT} \log(2p_I p_T) + 2 n_{TT} \log(p_T) \\ & = 2n_{CC} \log(p_C) + n_{CI} \log (2 p_C p_I) + n_{CT} \log(2 p_C p_I) \\ & \ \ + 2 n_{II} \log(p_I) + n_{IT} \log(2p_I (1 - p_C - p_I)) + 2 n_{TT} \log(1 - p_C - p_I). \end{align*}\]

The log-likelihood is given in terms of the genotype counts and the two probability parameters \(p_C, p_I \geq 0\) with \(p_C + p_I \leq 1\), and on the interior of this parameter set the model is a curved exponential family.

Collecting moths and determining their color will, however, only identify their phenotype, not their genotype. Thus we observe \((n_C, n_T, n_I)\), where \[n = \underbrace{n_{CC} + n_{CI} + n_{CT}}_{= n_C} + \underbrace{n_{IT} + n_{II}}_{=n_I} + \underbrace{n_{TT}}_{=n_T}.\]

For maximum-likelihood estimation of the parameters in this model from the observation \((n_C, n_T, n_I)\), we need the likelihood, that is, the likelihood in the marginal distribution of the observed variables.

The peppered Moth example is an example of cell collapsing in a multinomial model. In general, let \(A_1 \cup \ldots \cup A_{K_0} = \{1, \ldots, K\}\) be a partition and let \[M : \mathbb{N}_0^K \to \mathbb{N}_0^{K_0}\] be the map given by \[M((n_1, \ldots, n_K))_j = \sum_{k \in A_j} n_k.\]

If \(Y \sim \textrm{Mult}(p, n)\) with \(p = (p_1, \ldots, p_K)\) then \[X = M(Y) \sim \textrm{Mult}(M(p), n).\] If \(\theta \mapsto p(\theta)\) is a parametrization of the cell probabilities the log-likelihood under the collapsed multinomial model is generally

\[\begin{equation} \ell(\theta) = \sum_{j = 1}^{K_0} x_j \log (M(p(\theta))_j) = \sum_{j = 1}^{K_0} x_j \log \left(\sum_{k \in A_j} p_k(\theta)\right). \tag{6.2} \end{equation}\]

For the peppered Moths, \(K = 6\) corresponding to the six genotypes, \(K_0 = 3\) and the partition corresponding to the phenotypes is \[\{1, 2, 3\} \cup \{4, 5\} \cup \{6\} = \{1, \ldots, 6\},\] and \[M(n_1, \ldots, n_6) = (n_1 + n_2 + n_3, n_4 + n_5, n_6).\]

In terms of the \((p_C, p_I)\) parametrization, \(p_T = 1 - p_C - p_I\) and \[p = (p_C^2, 2p_Cp_I, 2p_Cp_T, p_I^2, 2p_Ip_T, p_T^2).\]

Hence \[M(p) = (p_C^2 + 2p_Cp_I + 2p_Cp_T, p_I^2 +2p_Ip_T, p_T^2).\]

The log-likelihood is

\[\begin{align} \ell(p_C, p_I) & = n_C \log(p_C^2 + 2p_Cp_I + 2p_Cp_T) + n_I \log(p_I^2 +2p_Ip_T) + n_T \log (p_T^2). \end{align}\]

We can implement the log-likelihood in a very problem specific way. Note how the parameter constraints are encoded via the return value \(\infty\).

## par = c(pC, pI), pT = 1 - pC - pI
## x is the data vector of length 3 of counts 
loglik <- function(par, x) {
  pT <- 1 - par[1] - par[2]
  
  if (par[1] > 1 || par[1] < 0 || par[2] > 1 
      || par[2] < 0 || pT < 0)
    return(Inf)
  
  PC <- par[1]^2 + 2 * par[1] * par[2] + 2 * par[1] * pT
  PI <- par[2]^2 + 2 * par[2] * pT
  PT <- pT^2
  ## The function returns the negative log-likelihood 
  - (x[1] * log(PC) + x[2] * log(PI) + x[3]* log(PT))
}

It is possible to use optim in R with just this implementation to compute the maximum-likelihood estimate of the allele parameters.

optim(c(0.3, 0.3), loglik, x = c(85, 196, 341))
## $par
## [1] 0.07084643 0.18871900
## 
## $value
## [1] 600.481
## 
## $counts
## function gradient 
##       71       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

The optim function uses an algorithm called Nelder-Mead as the default algorithm that relies on log-likelihood evaluations only. It is slow but fairly robust, though a bit of thought has to go into the initial parameter choice.

optim(c(0, 0), loglik, x = c(85, 196, 341))
## Error in optim(c(0, 0), loglik, x = c(85, 196, 341)): function
cannot be evaluated at initial parameters```

Starting the algorithm in a boundary value where the negative log-likelihood attains
the value $\infty$ does not work. 

The computations can beneficially be implemented in greater 
generality. The function `M` sums the cells that are collapsed, 
which has to be specified by the `group` argument.


```r
M <- function(x, group)
  as.vector(tapply(x, group, sum))

The log-likelihood is then implemented for multinomial cell collapsing via M and two problem specific arguments to the loglik function. One of these is a vector specifying the grouping structure of the collapsing. The other is a function that determines the parametrization that maps the parameter that we are optimizing over to the cell probabilities. In the implementation it is assumed that this prob function in addition encodes parameter constraints by returning NULL if parameter constraints are violated.

loglik <- function(par, x, prob, group, ...) {
  p <- prob(par)
  if(is.null(p)) return(Inf)
  - sum(x * log(M(prob(par), group)))
}

The function prob is implemented specifically for the peppered moths as follows.

prob <- function(p) {
  p[3] <- 1 - p[1] - p[2]
  if (p[1] > 1 || p[1] < 0 || p[2] > 1 || p[2] < 0 || p[3] < 0)
    return(NULL)
  c(p[1]^2, 2 * p[1] * p[2], 2* p[1] * p[3], 
    p[2]^2, 2 * p[2] * p[3], p[3]^2)
}

We test that the new implementation gives the same result when optimized as using the more problem specific implementation.

optim(c(0.3, 0.3), loglik, x = c(85, 196, 341), 
      prob = prob, group = c(1, 1, 1, 2, 2, 3))
## $par
## [1] 0.07084643 0.18871900
## 
## $value
## [1] 600.481
## 
## $counts
## function gradient 
##       71       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Once we have found an estimate of the parameters, we can compute a prediction of the unobserved genotype counts from the phenotype counts using the conditional distribution of the genotypes given the phenotypes. This is straightforward as the conditional distribution of \(Y_{A_j} = (Y_k)_{k \in A_j}\), conditionally on \(X\) is also a multinomial distribution; \[Y_{A_j} \mid X = x \sim \textrm{Mult}\left( \frac{p_{A_j}}{M(p)_j}, x_j \right).\] The probability parameters are simply \(p\) restricted to \(A_j\) and renormalized to a probability vector. Observe that this gives \[E (Y_k \mid X = x) = \frac{x_j p_k}{M(p)_j}\] for \(k \in A_j\). Using the estimated parameters and the M function implemented above, we can compute a prediction of the genotype counts as follows.

x <- c(85, 196, 341)
group <- c(1, 1, 1, 2, 2, 3)
p <- prob(c(0.07084643, 0.18871900))
x[group] * p / M(p, group)[group]
## [1]   3.121549  16.630211  65.248241  22.154520 173.845480 341.000000

This is of interest in itself, but computing these conditional expectations will also be central for the EM algorithm in Chapter 8.