8.4 Revisiting Gaussian mixtures

In a two-component Gaussian mixture model the marginal density of the distribution of \(Y\) is \[ f(y) = p \frac{1}{\sqrt{2 \pi \sigma_1^2}} e^{-\frac{(y - \mu_1)^2}{2 \sigma_1^2}} + (1 - p)\frac{1}{\sqrt{2 \pi \sigma_2^2}}e^{-\frac{(y - \mu_2)^2}{2 \sigma_2^2}}.\] The following is a simulation of data from such a mixture model.

sigma1 <- 1
sigma2 <- 2
mu1 <- -0.5
mu2 <- 4
p <- 0.5
n <- 1000
z <- sample(c(TRUE, FALSE), n, replace = TRUE, prob = c(p, 1 - p))
y <- numeric(n)
n1 <- sum(z)
y[z] <- rnorm(n1, mu1, sigma1)
y[!z] <- rnorm(n - n1, mu2, sigma2)

We implement the log-likelihood assuming that the variances are known. Note that the implementation takes just one single parameter argument, which is then supposed to be a vector of all parameters in the model. Internally to the function one has to decide for each entry in the parameter vector what parameter in the model it corresponds to.

loglik <- function(par, y) {
  p <- par[1]
  if(p < 0 || p > 1)
    return(Inf)
  
  mu1 <- par[2]
  mu2 <- par[3]
  -sum(log(p * exp(-(y - mu1)^2 / (2 * sigma1^2)) / sigma1 + 
             (1 - p) * exp(-(y - mu2)^2 / (2 * sigma2^2)) / sigma2))
}

Without further implementations, optim can find the maximum-likelihood estimate if we have a sensible initial parameter guess. In this case we use the true parameters, which can be used when algorithms are tested, but they are, of course, not available for real applications.

optim(c(0.5, -0.5, 4), loglik, y = y)[c(1, 2)]
## $par
## [1]  0.4934452 -0.5495679  4.0979106
## 
## $value
## [1] 1384.334

However, if we initialize the optimization badly, it does not find the maximum but a local maximum instead.

optim(c(0.9, 3, 1), loglik, y = y)[c(1, 2)]
## $par
## [1] 0.2334382 5.6763596 0.6255516
## 
## $value
## [1] 1509.86

We will implement the EM algorithm for the Gaussian mixture model by implementing and E-step and an M-step function. We know from Section 6.4.1 how the complete log-likelihood looks, and the E-step becomes a matter of computing \[p_i(\mathbf{y}) = E(1(Z_i = 1) \mid \mathbf{Y} = \mathbf{y}) = P(Z_i = 1 \mid \mathbf{Y} = \mathbf{y}).\] The M-step becomes identical to the MLE, which can be found explicitly, but where the indicators \(1(Z_i = 1)\) and \(1(Z_i = 2) = 1 - 1(Z_i = 1)\) are replaced by the conditional probabilities \(p_i(\mathbf{y})\) and \(1 - p_i(\mathbf{y})\), respectively.

EStep <- function(par, y) {
  p <- par[1]
  mu1 <- par[2]
  mu2 <- par[3]
  a <- p * exp(- (y - mu1)^2 / (2 * sigma1^2)) / sigma1 
  b <- (1 - p) * exp(- (y - mu2)^2 / (2 * sigma2^2)) / sigma2
  b / (a + b)
}

MStep <- function(y, pz) {
  n <- length(y)
  N2 <- sum(pz)
  N1 <- n - N2
  c(N1 / n, sum((1 - pz) * y) / N1, sum(pz * y) / N2)
}

EM <- function(par, y, epsilon = 1e-12) {
  repeat{
    par0 <- par
    par <- MStep(y, EStep(par, y))
    if(sum((par - par0)^2) <= epsilon * (sum(par^2) + epsilon))
      break
  } 
  par  ## Remember to return the parameter estimate
}

EM(c(0.5, -0.5, 4), y)
## [1]  0.4934443 -0.5497060  4.0982383

The EM algorithm may, just as any other optimization algorithm, end up in a local maximum, if it is started wrongly.

EM(c(0.9, 3, 1), y)
## [1] 0.2334722 5.6759456 0.6256279