## 6.2 Multinomial models

The multinomial model is the model of all probability distributions on
a finite set \(\mathcal{Y} = \{1, \ldots, K\}\). The model is parametrized
by the simplex
\[\Delta_K = \left\{(p_1, \ldots, p_K)^T \in \mathbb{R}^K \mid p_k \geq 0, \sum_{k=1}^K p_k = 1\right\}.\]
The distributions parametrized by the relative interior of \(\Delta_K\) form
an exponential family by the parametrization
\[p_k = \frac{e^{\theta_k}}{\sum_{l=1}^K e^{\theta_l}} \in (0,1)\]
for \((\theta_1, \ldots, \theta_K)^T \in \mathbb{R}^K.\) That is,
the sufficient statistic is \(k \mapsto (\delta_{1k}, \ldots, \delta_{Kk})^T \in \mathbb{R}^{K-1}\)
(where \(\delta_{lk} \in \{0, 1\}\) is the Kronecker delta being 1 if and only
if \(l = k\)), and
\[\varphi(\theta_1, \ldots, \theta_K) = \sum_{l=1}^K e^{\theta_l}.\]
We call this exponential family parametrization the *symmetric* parametrization.
The canonical parameter in the symmetric parametrization is not identifiable,
and to resolve the lack of identifiability
there is a tradition of fixing the last parameter as \(\theta_K = 0\). This
results in a canonical parameter \(\theta = (\theta_1, \ldots, \theta_{K-1})^T \in \mathbb{R}^{K-1},\)
a sufficient statistic \(t_1(k) = (\delta_{1k}, \ldots, \delta_{(K-1)k})^T \in \mathbb{R}^p\),

\[\varphi(\theta) = 1 + \sum_{l=1}^{K-1} e^{\theta_l}\]

and probabilities

\[p_k = \left\{\begin{array}{ll} \frac{e^{\theta_k}}{1 + \sum_{l=1}^{K-1} e^{\theta_l}} & \quad \text{if } k = 1, \ldots, K-1 \\ \frac{1}{1 + \sum_{l=1}^{K-1} e^{\theta_l}} & \quad \text{if } k = K. \end{array}\right.\]

We see that in this parametrization \(p_k = e^{\theta_k}p_K\) for \(k = 1, \ldots, K-1\), where the probability of the last element \(K\) acts as a baseline or reference probability, and the factors \(e^{\theta_k}\) act as multiplicative modifications of this baseline. Moreover, \[\frac{p_k}{p_l} = e^{\theta_k - \theta_l},\] which is independent of the chosen baseline.

In the special case of \(K = 2\) the two elements \(\mathcal{Y}\) are often given
other labels than \(1\) and \(2\). The most common labels are \(\{0, 1\}\) and \(\{-1, 1\}\).
If we use the \(0\)-\(1\)-labels the convention is to use \(p_0\) as the baseline and
thus
\[p_1 = \frac{e^{\theta}}{1 + e^{\theta}} = e^{\theta} p_0 = e^{\theta} (1 - p_1).\]
parametrized by \(\theta \in \mathbb{R}\). As this function of \(\theta\) is
known as the *logistic function*, this parametrization of the probability
distributions on \(\{0,1\}\) is often referred to as the logistic model. From the
above we see directly that
\[\theta = \log \frac{p_1}{1 - p_1}\]
is the log-odds.

If we use the \(\pm 1\)-labels, an alternative exponential family parametrization is \[p_k = \frac{e^{k\theta}}{e^\theta + e^{-\theta}}\] for \(\theta \in \mathbb{R}\) and \(k \in \{-1, 1\}\). This gives a symmetric treatment of the two labels while retaining the identifiability.

With i.i.d. observations from a multinomial distribution we find that the log-likelihood is

\[\begin{align*} \ell(\theta) & = \sum_{i=1}^n \sum_{k=1}^K \delta_{k y_i} \log(p_k(\theta)) \\ & = \sum_{k=1}^K \underbrace{\sum_{i=1}^n \delta_{k y_i}}_{= n_k} \log(p_k(\theta)) \\ & = \theta^T \mathbf{n} - n \log \left(1 + \sum_{l=1}^{K-1} e^{\theta_l} \right). \end{align*}\]

where \(\mathbf{n} = (n_1, \ldots, n_K)^T = \sum_{i=1}^n t(y_i)\) is the sufficient statistic, which is simply a tabulation of the times the different elements in \(\{1, \ldots, K\}\) were observed.

### 6.2.1 Peppered Moths

This example is on the color of the peppered Moth (*Birkemåler* in Danish).
The color of the moth is
primarily determined by one gene that occur in three different alleles denoted C,
I and T. The alleles are ordered in terms of dominance as C > I > T. Moths with genotype
including C are dark. Moths
with genotype TT are light colored. Moths with genotypes II and IT are mottled.
Thus there a total of six different genotypes (CC, CI, CT, II, IT and TT) and
three different phenotypes (black, mottled, light colored).

The peppered moth provided an early demonstration of evolution in the 19th century England, where the light colored moth was outnumbered by the dark colored variety. The dark color became dominant due to the increased pollution, where trees were darkened by soot, and had for that reason a selective advantage. There is a nice collection of moth in different colors at the Danish Zoological Museum, and further explanation of the role it played in understanding evolution.

We denote the allele frequencies \(p_C\), \(p_I\), \(p_T\) with \(p_C + p_I + p_T = 1.\) According to the Hardy-Weinberg principle, the genotype frequencies are then

\[p_C^2, 2p_Cp_I, 2p_Cp_T, p_I^2, 2p_Ip_T, p_T^2.\]

If we could observe the genotypes, the complete multinomial log-likelihood would be

\[\begin{align*} & 2n_{CC} \log(p_C) + n_{CI} \log (2 p_C p_I) + n_{CT} \log(2 p_C p_I) \\ & \ \ + 2 n_{II} \log(p_I) + n_{IT} \log(2p_I p_T) + 2 n_{TT} \log(p_T) \\ & = 2n_{CC} \log(p_C) + n_{CI} \log (2 p_C p_I) + n_{CT} \log(2 p_C p_I) \\ & \ \ + 2 n_{II} \log(p_I) + n_{IT} \log(2p_I (1 - p_C - p_I)) + 2 n_{TT} \log(1 - p_C - p_I). \end{align*}\]

The log-likelihood is given in terms of the genotype counts and the two probability parameters \(p_C, p_I \geq 0\) with \(p_C + p_I \leq 1\), and on the interior of this parameter set the model is a curved exponential family.

Collecting moths and determining their color will, however, only identify their phenotype, not their genotype. Thus we observe \((n_C, n_T, n_I)\), where \[n = \underbrace{n_{CC} + n_{CI} + n_{CT}}_{= n_C} + \underbrace{n_{IT} + n_{II}}_{=n_I} + \underbrace{n_{TT}}_{=n_T}.\]

For maximum-likelihood estimation of the parameters in this model from the observation \((n_C, n_T, n_I)\), we need the likelihood, that is, the likelihood in the marginal distribution of the observed variables.

The peppered Moth example is an example of *cell collapsing* in a multinomial model.
In general, let \(A_1 \cup \ldots \cup A_{K_0} = \{1, \ldots, K\}\) be a partition and let
\[M : \mathbb{N}_0^K \to \mathbb{N}_0^{K_0}\]
be the map given by
\[M((n_1, \ldots, n_K))_j = \sum_{k \in A_j} n_k.\]

If \(Y \sim \textrm{Mult}(p, n)\) with \(p = (p_1, \ldots, p_K)\) then \[X = M(Y) \sim \textrm{Mult}(M(p), n).\] If \(\theta \mapsto p(\theta)\) is a parametrization of the cell probabilities the log-likelihood under the collapsed multinomial model is generally

\[\begin{equation} \ell(\theta) = \sum_{j = 1}^{K_0} x_j \log (M(p(\theta))_j) = \sum_{j = 1}^{K_0} x_j \log \left(\sum_{k \in A_j} p_k(\theta)\right). \tag{6.2} \end{equation}\]

For the peppered Moths, \(K = 6\) corresponding to the six genotypes, \(K_0 = 3\) and the partition corresponding to the phenotypes is \[\{1, 2, 3\} \cup \{4, 5\} \cup \{6\} = \{1, \ldots, 6\},\] and \[M(n_1, \ldots, n_6) = (n_1 + n_2 + n_3, n_4 + n_5, n_6).\]

In terms of the \((p_C, p_I)\) parametrization, \(p_T = 1 - p_C - p_I\) and \[p = (p_C^2, 2p_Cp_I, 2p_Cp_T, p_I^2, 2p_Ip_T, p_T^2).\]

Hence \[M(p) = (p_C^2 + 2p_Cp_I + 2p_Cp_T, p_I^2 +2p_Ip_T, p_T^2).\]

The log-likelihood is

\[\begin{align} \ell(p_C, p_I) & = n_C \log(p_C^2 + 2p_Cp_I + 2p_Cp_T) + n_I \log(p_I^2 +2p_Ip_T) + n_T \log (p_T^2). \end{align}\]

We can implement the log-likelihood in a very problem specific way. Note how the parameter constraints are encoded via the return value \(\infty\).

```
## par = c(pC, pI), pT = 1 - pC - pI
## x is the data vector of length 3 of counts
loglik <- function(par, x) {
pT <- 1 - par[1] - par[2]
if (par[1] > 1 || par[1] < 0 || par[2] > 1
|| par[2] < 0 || pT < 0)
return(Inf)
PC <- par[1]^2 + 2 * par[1] * par[2] + 2 * par[1] * pT
PI <- par[2]^2 + 2 * par[2] * pT
PT <- pT^2
## The function returns the negative log-likelihood
- (x[1] * log(PC) + x[2] * log(PI) + x[3]* log(PT))
}
```

It is possible to use `optim`

in R with just this implementation to
compute the maximum-likelihood estimate of the allele parameters.

```
## $par
## [1] 0.07084643 0.18871900
##
## $value
## [1] 600.481
##
## $counts
## function gradient
## 71 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
```

The `optim`

function uses an algorithm called Nelder-Mead
as the default algorithm that relies on log-likelihood evaluations only. It
is slow but fairly robust, though a bit of thought has to go into the initial
parameter choice.

```
## Error in optim(c(0, 0), loglik, x = c(85, 196, 341)): function
cannot be evaluated at initial parameters```
Starting the algorithm in a boundary value where the negative log-likelihood attains
the value $\infty$ does not work.
The computations can beneficially be implemented in greater
generality. The function `M` sums the cells that are collapsed,
which has to be specified by the `group` argument.
```r
M <- function(x, group)
as.vector(tapply(x, group, sum))
```

The log-likelihood is then implemented for multinomial cell
collapsing via `M`

and two problem specific arguments to
the `loglik`

function. One of these is a vector specifying
the grouping structure of the collapsing. The other is a function
that determines the
parametrization that maps the parameter that we are optimizing over to the
cell probabilities. In the implementation it is assumed
that this `prob`

function in addition encodes parameter constraints
by returning `NULL`

if parameter constraints are violated.

```
loglik <- function(par, x, prob, group, ...) {
p <- prob(par)
if(is.null(p)) return(Inf)
- sum(x * log(M(prob(par), group)))
}
```

The function `prob`

is implemented specifically for the peppered moths
as follows.

```
prob <- function(p) {
p[3] <- 1 - p[1] - p[2]
if (p[1] > 1 || p[1] < 0 || p[2] > 1 || p[2] < 0 || p[3] < 0)
return(NULL)
c(p[1]^2, 2 * p[1] * p[2], 2* p[1] * p[3],
p[2]^2, 2 * p[2] * p[3], p[3]^2)
}
```

We test that the new implementation gives the same result when optimized as using the more problem specific implementation.

```
## $par
## [1] 0.07084643 0.18871900
##
## $value
## [1] 600.481
##
## $counts
## function gradient
## 71 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
```

Once we have found an estimate of the parameters, we can compute a prediction
of the unobserved genotype counts from the phenotype counts using
the conditional distribution of the genotypes given the phenotypes.
This is straightforward as the conditional distribution of \(Y_{A_j} = (Y_k)_{k \in A_j}\),
conditionally on \(X\) is also a multinomial distribution;
\[Y_{A_j} \mid X = x \sim \textrm{Mult}\left( \frac{p_{A_j}}{M(p)_j}, x_j \right).\]
The probability parameters are simply \(p\) restricted to \(A_j\) and renormalized
to a probability vector. Observe that this gives
\[E (Y_k \mid X = x) = \frac{x_j p_k}{M(p)_j}\]
for \(k \in A_j\). Using the estimated parameters and the `M`

function implemented
above, we can compute a prediction of the genotype counts as follows.

```
x <- c(85, 196, 341)
group <- c(1, 1, 1, 2, 2, 3)
p <- prob(c(0.07084643, 0.18871900))
x[group] * p / M(p, group)[group]
```

`## [1] 3.121549 16.630211 65.248241 22.154520 173.845480 341.000000`

This is of interest in itself, but computing these conditional expectations will also be central for the EM algorithm in Chapter 8.