6.4 Finite mixture models

A finite mixture model is a model of a pair of random variables \((Y, Z)\) with \(Z \in \{1, \ldots, K\}\), \(P(Z = k) = p_k\), and the conditional distribution of \(Y\) given \(Z = k\) having density \(f_k( \cdot \mid \theta_k)\). The joint density is then \[(y, k) \mapsto f_k(y \mid \theta_k) p_k,\] and the marginal density for the distribution of \(Y\) is \[f(y \mid \theta) = \sum_{k=1}^K f_k(y \mid \theta_k) p_k\] where \(\theta\) is the vector of all parameters. We say that the model has \(K\) mixture components and call \(f_k( \cdot \mid \theta_k)\) the mixture distributions and \(p_k\) the mixture weights.

The main usage of mixture models is to situations where \(Z\) is not observed. In practice, only \(Y\) is observed, and parameter estimation has to be based on the marginal distribution of \(Y\) with density \(f(\cdot \mid \theta)\), which is a weighted sum of the mixture distributions.

The set of all probability measures on \(\{1, \ldots, K\}\) is an exponential family with sufficient statistic \[\tilde{t}_0(k) = (\delta_{1k}, \delta_{2k}, \ldots, \delta_{(K-1)k}) \in \mathbb{R}^{K-1},\] canonical parameter \(\alpha = (\alpha_1, \ldots, \alpha_{K-1}) \in \mathbb{R}^{K-1}\) and \[p_k = \frac{e^{\alpha_k}}{1 + \sum_{l=1}^{K-1} e^{\alpha_l}}.\]

When \(f_k\) is an exponential family as well with sufficient statistic \(\tilde{t}_k : \mathcal{Y} \to \mathbb{R}^{p_k}\) and \(\theta_k\) the canonical parameter, we bundle \(\alpha, \theta_1, \ldots, \theta_K\) into \(\theta\) and define \[t_1(y) = \left(\begin{array}{c} \tilde{t}_0 \\ 0 \\ 0 \\ \vdots \\ 0 \end{array} \right)\] together with \[t_2(y \mid k) = \left(\begin{array}{c} 0 \\ \delta_{1k} \tilde{t}_1(y) \\ \delta_{2k} \tilde{t}_2(y) \\ \vdots \\ \delta_{Kk} \tilde{t}_K(y) \end{array} \right)\] we see that we have an exponential family of the joint distribution of \((Y, Z)\) with the \(p = K-1 + p_1 + \ldots + p_K\)-dimensional canonical parameter \(\theta\), with the sufficient statistic \(t_1\) determining the marginal distribution of \(Z\) and with the sufficient statistic \(t_2\) determining the conditional distribution of \(Y\) given \(Z\). We have here made the conditioning variable explicit.

The marginal density of \(Y\) in the exponential family parametrization then becomes \[f(y \mid \theta) = \sum_{k=1}^K e^{\theta^T (t_1(k) + t_2(y \mid k)) - \log \varphi_1(\theta) - \log \varphi_2(\theta \mid k)}.\] For small \(K\) it is usually unproblematic to implement the computation of the marginal density using the formula above, and the computation of derivatives can likewise be implemented based on the formulas derived in Section 6.1.3.

6.4.1 Gaussian mixtures

A Gaussian mixture model is a mixture model where all the mixture distributions are
Gaussian distributions but potentially with different means and variances. In this section we focus on the simplest Gaussian mixture model with \(K = 2\) mixture components.

When \(K = 2\), the Gaussian mixture model is parametrized by the five parameters: the mixture weight \(p = P(Z = 1) \in (0, 1)\), the two means \(\mu_1, \mu_2 \in \mathbb{R}\), and the two variances \(\sigma_1, \sigma_2 > 0\). This is not the parametrization using canonical exponential family parameters, which we return to below. First we will simply simulate random variables from this model.

sigma1 <- 1
sigma2 <- 2
mu1 <- -0.5
mu2 <- 4
p <- 0.5
n <- 5000
z <- sample(c(TRUE, FALSE), n, replace = TRUE, prob = c(p, 1 - p))
## Conditional simulation from the mixture components
y <- numeric(n)
n1 <- sum(z)
y[z] <- rnorm(n1, mu1, sigma1)
y[!z] <- rnorm(n - n1, mu2, sigma2)

The simulation above generated 5000 samples from a two-component Gaussian mixture model with mixture distributions \(\mathcal{N}(-0.5, 1)\) and \(\mathcal{N}(4, 4)\), and with each component having weight \(0.5\). This gives a bimodal distribution as illustrated by the histogram on Figure 6.1.

yy <- seq(-3, 11, 0.1)
dens <- p * dnorm(yy, mu1, sigma1) + (1 - p) * dnorm(yy, mu2, sigma2)
ggplot(mapping = aes(y, ..density..)) + 
  geom_histogram(bins = 20) + 
  geom_line(aes(yy, dens), color = "blue", size = 1) + 
  stat_density(bw = "SJ", geom="line", color = "red", size = 1)
Histogram and density estimate (red) of data simulated from a two-component Gaussian mixture distribution. The true mixture distribution has

Figure 6.1: Histogram and density estimate (red) of data simulated from a two-component Gaussian mixture distribution. The true mixture distribution has

It is possible to give a mathematically different representation of the marginal distribution of \(Y\) that is sometimes useful. Though it gives the same marginal distribution from the same components, it does provide a different interpretation of what a mixture model is a model of, and it does provide a different way of simulating from a mixture distribution.

If we let \(Y_1 \sim \mathcal{N}(\mu_1, \sigma_1^2)\) and \(Y_2 \sim \mathcal{N}(\mu_2, \sigma_2^2)\) be independent, and independent of \(Z\), we can define

\[\begin{equation} Y = 1(Z = 1) Y_1 + 1(Z = 2) Y_2 = Y_{Z}. \tag{6.3} \end{equation}\]

Clearly, the conditional distribution of \(Y\) given \(Z = k\) is \(f_k\). From (6.3) it follows directly that \[E(Y^n) = P(Z = 1) E(Y_1^n) + P(Z = 2) E(Y_2^n) = p m_1(n) + (1 - p) m_2(n)\] where \(m_k(n)\) denotes the \(n\)th non-central moment of the \(k\)th mixture component. In particular,

\[E(Y) = p \mu_1 + (1 - p) \mu_2.\]

The variance can be found from the second moment as

\[\begin{align} V(Y) & = p(\mu_1^2 + \sigma_1^2) + (1-p)(\mu_2^2 + \sigma_2^2) - (p \mu_1 + (1 - p) \mu_2)^2 \\ & = p\sigma_1^2 + (1 - p) \sigma_2^2 + p(1-p)(\mu_1^2 + \mu_2^2 - 2 \mu_1 \mu_2). \end{align}\]

While it is certainly possible to derive this formula by other means, using (6.3) gives a simple argument based on elementary properties of expectation and independence of \((Y_1, Y_2)\) and \(Z\).

The construction via (6.3) has an interpretation that differs from how a mixture model was defined in the first place. Though \((Y, Z)\) has the correct joint distribution, \(Y\) is by (6.3) the result of \(Z\) selecting one out of two possible observations. The difference can best be illustrated by an example. Suppose that we have a large population consisting of married couples entirely. We can draw a sample of individuals (ignoring the marriage relations completely) from this population and let \(Z\) denote the sex and \(Y\) the height of the individual. Then \(Y\) follows a mixture distribution with two components corresponding to males and females according to the definition. At the risk of being heteronormative, suppose that all couples consist of one male and one female. We could then also draw a sample of married couples, and for each couple flip a coin to decide whether to report the male’s or the female’s height. This corresponds to the construction of \(Y\) by (6.3). We get the same marginal mixture distribution of heights though.

Arguably the heights of individuals in a marriage are not independent, but this is actually immaterial. Any dependence structure between \(Y_1\) and \(Y_2\) is lost in the transformation (6.3), and we can just as well assume them independent for mathematical convenience. We won’t be able to tell the difference from observing only \(Y\) (and \(Z\)) anyway.

We illustrate below how (6.3) can be used for alternative implementations of ways to simulate from the mixture model. We compare empirical means and variances to the theoretical values to test if all the implementations actually simulate from the Gaussian mixture model.

## Mean
mu <- p * mu1 + (1 - p) * mu2

## Variance 
sigmasq <- p * sigma1^2 + (1 - p) * sigma2^2 + 
  p * (1-p)*(mu1^2 + mu2^2 - 2 * mu1 * mu2)

## Simulation using the selection formulation via 'ifelse'
y2 <- ifelse(z, rnorm(n, mu1, sigma1), rnorm(n, mu2, sigma2))

## Yet another way of simulating from a mixture model
## using arithmetic instead of 'ifelse' for the selection 
## and with Y_1 and Y_2 actually being dependent
y3 <- rnorm(n)
y3 <- z * (sigma1 * y3 + mu1) + (!z) * (sigma2 * y3 + mu2)

## Returning to the definition again, this last method simulates conditionally 
## from the mixture components via transformation of an underlying Gaussian
## variable with mean 0 and variance 1
y4 <- rnorm(n)
y4[z] <- sigma1 * y4[z] + mu1
y4[!z] <- sigma2 * y4[!z] + mu2

## Tests
data.frame(mean = c(mu, mean(y), mean(y2), mean(y3), mean(y4)),
      variance = c(sigmasq, var(y), var(y2), var(y3), var(y4)), 
      row.names = c("true", "conditional", "ifelse", "arithmetic", 
                    "conditional2" ))
##                  mean variance
## true         1.750000 7.562500
## conditional  1.728195 7.381843
## ifelse       1.713098 7.513963
## arithmetic   1.708420 7.566312
## conditional2 1.700052 7.463941

In terms of run time there is not a big difference between three of the ways of simulating from a mixture model. A benchmark study (not shown) will reveal that the first and third methods are comparable in terms of run time and slightly faster than the fourth, while the second using ifelse takes more than twice as much time as the others. This is unsurprising as the ifelse method takes (6.3) very literally and generates twice the number of Gaussian variables actually needed.

The marginal density of \(Y\) is \[f(y) = p \frac{1}{\sqrt{2 \pi \sigma_1^2}} e^{-\frac{(y - \mu_1)^2}{2 \sigma_1^2}} + (1 - p)\frac{1}{\sqrt{2 \pi \sigma_2^2}}e^{-\frac{(y - \mu_2)^2}{2 \sigma_2^2}}\] as given in terms of the parameters \(p\), \(\mu_1, \mu_2\), \(\sigma_1^2\) and \(\sigma_2^2\).

Returning to the canonical parameters we see that they are given as follows: \[\theta_1 = \log \frac{p}{1 - p}, \quad \theta_2 = \frac{\mu_1}{2\sigma_1^2}, \quad \theta_3 = \frac{1}{2\sigma_1^2}, \quad \theta_4 = \frac{\mu_2}{\sigma_2^2}, \quad \theta_5 = \frac{1}{2\sigma_2^2}.\]

The joint density in this parametrization then becomes \[(y,k) \mapsto \left\{ \begin{array}{ll} \exp\left(\theta_1 + \theta_2 y - \theta_3 y^2 - \log (1 + e^{\theta_1}) - \frac{ \theta_2^2}{4\theta_3}+ \frac{1}{2} \log(\theta_3) \right) & \quad \text{if } k = 1 \\ \exp\left(\theta_4 y - \theta_5 y^2 - \log (1 + e^{\theta_1}) - \frac{\theta_4^2}{4\theta_5} + \frac{1}{2}\log(\theta_5) \right) & \quad \text{if } k = 2 \end{array} \right. \] and the marginal density for \(Y\) is

\[\begin{align} f(y \mid \theta) & = \exp\left(\theta_1 + \theta_2 y - \theta_3 y^2 - \log (1 + e^{\theta_1}) - \frac{ \theta_2^2}{4\theta_3}+ \frac{1}{2} \log(\theta_3) \right) \\ & + \exp\left(\theta_4 y - \theta_5 y^2 - \log (1 + e^{\theta_1}) - \frac{\theta_4^2}{4\theta_5} + \frac{1}{2}\log(\theta_5) \right). \end{align}\]

There is no apparent benefit to the canonical parametrization when considering the marginal density. It is, however, of value when we need the logarithm of the joint density as we will for the EM-algorithm.