Karin Knudson

July 31, 2020

Expectation Maximization (EM)

Trying to maximize a likelihood or a posterior? Got some latent variables on hand? The EM algorithm might be useful for you.

Although I knew it was useful, it took me a long time to get around to understanding EM, perhaps because it's a framework that can be adapted to a variety of problems, and so the particular implementations of it for different problems can look quite quite different from each other. Those differences obscured for me the underlying structure of the algorithm, which is pretty elegant, and which is what we'll start by presenting here.

The goal of this post is to explain the broad framework of expectation maximization, give a sense when EM is applicable and helpful, and illustrate EM as an adaptable framework by showing it in action in a mixture model and in a situation with missing data. The calculations in the examples get detailed, so that you can see what's involved in translating the outlines of the EM algorithm to a particular context. Hopefully, by the end of this, when you see EM in context, you'll be able to see the underlying structure, and have an idea of where to start if you someday want to adapt EM to your own particular problem.

Let's suppose we have the following reasonably broad setup. Say we have:

observed data $X$
latent variables $Z$ (which could also represent missing data, auxiliary variables that we introduce for convenience, or additional parameters that we group separately from $\theta$)
parameters of interest $\theta$

Moreover, suppose we have the following hopes, dreams, circumstances:

We seek to find the value of $\theta$ that maximizes a likelihood $P(X|\theta)$ or a posterior distribution $P(\theta|X)$.
We find it hard to hard to do the desired maximization of $P(X|\theta)$ or a $P(\theta|X)$ directly (in fact it may be difficult to even calculate it — e.g. if we're seeking $P(X|\theta) = \int P(X,Z | \theta) \, dZ$, we may not be able to compute that integral exactly). We have a $P(X,Z | \theta) $ that is easier to deal with.
We're comfortable accepting a solution that may be just a local maximum (not necessarily a global maximum). Or, perhaps we're comfortable running the algorithm many times to find a number of local maxima.
Roughly speaking, if we knew $X$ and $Z$, we'd be okay estimating $\theta$, and if we know $X$ and $\theta$, we could say useful things about the distribution of $Z$.

If someone was asking me in an informal conversation about how you would proceed using EM to try to do the following, I might give the following intuitive-but-not-specific-enough outline of the EM plan of action:

Choose an initial $\theta$
Use $\theta$ update your knowledge about about $Z$
Use your updated knowledge about Z in a tractable optimization problem, and solve the optimization problem to update $\theta$
Repeat the previous two steps, alternately updating $\theta$ and some Z-related quantity until convergence

It's not that anything that follows the above rough intuition is EM — we'll outline EM carefully in a moment. But as the EM steps get technical, it helps to keep in mind this basic intuition of iteration, alternating between updates related to $Z$ and updates related to $\theta$.

Somewhat more specifically we can say we will do the following. Choose some starting value of $\theta$, (e.g. at random). Condition on X and the chosen value of $\theta$, and compute the distribution of Z. With respect to the distribution just computed, calculate the expectation of the log likelihood of $P(X, Z | \theta) $ — an expected log likelihood which of course depends on $\theta$, and then find $\theta$ to maximize that expected log likelihood. Repeat the process, using the latest value of $\theta$ to compute the distribution of $Z$ given $X$ and $\theta$.

Let's pin that down a little more, assuming we are trying find $\theta$ to maximize the likelihood $P(X|\theta)$, or, alternatively, that we are trying to find $\theta$ to maximize a posterior $P(\theta|X)$ in the presence of a prior $p(\theta)$. We would set $\theta$ to some initial value and then repeat the following until convergence (that's convergence in estimates of $\theta$, or in the log likelihood or posterior).

Likelihood
Posterior

Evaluate $P(Z | X, \theta^{old})$
Compute the expectation under $P(Z|X , \theta^{old})$ of the log likelihood $P(X, Z | \theta)$ denoted $Q(\theta, \theta^{old}) = \mathbb{E}_{Z|X, \theta^{old}} \ln (P(X, Z | \theta))$
Choose $\theta^{new}$ to be the value of theta that maximizes $Q(\theta, \theta^{old})$

Evaluate $P(Z | X, \theta^{old})$
Compute the expectation under $P(Z|X , \theta^{old})$ of the log posterior $P(\theta, Z | X)$ denoted $ Q(\theta, \theta^{old}) = \mathbb{E}_{Z | X, \theta^{old}} \ln (P(\theta, Z | X)) \propto \mathbb{E}_{Z|X, \theta^{old}} \ln (P(X, Z | \theta)) + P(\theta) $
Choose $\theta^{new}$ to be the value of theta that maximizes $Q(\theta, \theta^{old})$

For a likelihood versus a posterior, this outline differs only in the second step, where we swap $P(\theta, Z | X)$ for $P(X,Z | \theta)$.

Does the setup and the general proceedure still seem a little fuzzy? That's fine. We'll make a quick vocabulary note and then will get to examples. The examples will help illustrate the meaning of the above quantities in context, so that you can see how iterative process of EM works in two different realms where where EM is applicable: a mixture of Gaussians (see, e.g. [1 Ch. 9]) and a setup with missing data (as in an example from [2 Ch. 18]).

Keep in mind: The above step-by-step procedures can look pretty different from how we actually perform EM in practice for two main reasons. First, we probably won't be actually recalculating $Q(\theta, \theta^{old})$ at every step — instead, we'll hopefully just work out once and for all an expression for $\theta$ that maximizes it. Second, we may not calculate $ P(Z | X, \theta^{old})$ itself, but rather just some derived quantities that are relevant to the maximization of $Q$. Look out for how this works in the examples!

Oh, sure, some of them do, depending on what you mean by "fun". We can say:

Quantity	Fun name
{X, Z}	complete data set
X	incomplete data
ln p(X, Z \| θ)	complete data log-likelihood
p(Z \| X, θ)	posterior distribution of the latent variables

I somehow find the EM steps a little easier to read and remember in terms of symbols (especially if I'm trying to remember the connection between when I'm maximizing a likelihood vs. a posterior), but you might find the opposite to be true for you. So we can write the EM method for maximizing a likelihood as follows:

Evaluate the posterior distribution of the latent variables.
Compute the expectation under the posterior distribution of the latent variables of the complete data log-likelihood (as a function of the parameters).
Find the parameters that maximize the expectation function that was computed in the previous step.

The step of computing the expectation of the complete data log-likelihood (or at least the pieces of it that are necessary for the upcoming maximization) is called the expectation step or E-step, the computing of new parameters to maximize that expectation is the maximization or M-step.

It's pretty great that EM is applicable to a number of different problems. But, that range can also make it hard to get a handle on EM itself, and what is common to each application. Here, we present the EM algorithm in two rather different problems, within a common framework so that you can keep the outline of the algorithm itself in mind as you see how it can be applied.

Click equations to show or hide calculation details.

Gaussian Mixture
Missing data

Clear

Variables of interest and model description

X (data): $\{\mathbf{x}_n\}_{n=1}^N$, data assumed to come from a mixture of K Gaussians. $ \{y_{nj} | n \in 1...N, j \in 1...d, \text{ and } y_{nj} \text{ observed}\}$ the observed components of $\{\mathbf{y_n}\}_{n=1}^N$

Z (latent variables): We let $z_{nk}$ encode the class membership of data point $\mathbf{x}_n$, so $z_{nk}=1$ if $\mathbf{x}_n$ is in class k, $z_{nk}=0$ otherwise. Note that these are auxiliary variables; we can define the model without them, but introducing them will make inference easier; in particular, we use them in EM. $ \{y_{nj} | n \in 1...N, j \in 1...d, \text{ and } y_{nj} \text{ unobserved}\}$ the unobserved components of $\{\mathbf{y_n}\}_{n=1}^N$

$\theta$ (parameters): $ \{\mu_k, \Sigma_k, \pi_k\}_{k=1}^K$, where $\mu$, $\Sigma$ are the means and variances of the Gaussian distributions in the mixture, and $\mathbf{\pi}$ gives the mixing weights, which sum to 1. $\mathbf{\mu}, \mathbf{\Sigma} = \mathbf{\Lambda}^{-1}$, the mean and covariance of the normal distribution from which the $\mathbf{y_n}$ are assumed to be drawn

We have that the data points $\mathbf{x}_n$ are conditionally independent from each other given the parameters, that $P(\mathbf{z}_n | \pi_k) = \prod_{k=1}^K \pi_k^{z_{nk}}$ and $P(\mathbf{x}_n | \theta) = \prod_{k=1}^K N(\mathbf{x}_n | \mathbf{\mu}_k, \mathbf{\Sigma}_k)^{z_{nk}} $.

For this setup, we have $Y = \{\mathbf{y}_n \}_{n=1}^N =\{X, Z\} $, the complete data. We'll us $\mathbf{y}_n^{obs}$ to denote the vector observed values in each data vector, $\mu^{obs}$ to be the vector of elements of $\mu$ that correspond to observed (non-missing) values of $\mathbf{y_n}$, and $ \Lambda_{j, obs_n}$ to be elements of the $j^{th}$ row of $\Lambda$ in columns that correspond to observed (non-missing) values of $\mathbf{y_n}$. We assume the mechanism of the missing data is ignorable.

Goal

Maximize $P(X | \theta)$ — a likelihood!

Maximize $P(\theta | X)$ — a posterior distribution!*

*although, okay, we have an improper uniform prior here, which isn't so interesting - see [2 Ch. 18] for more interesting extensions

Steps to iterate until convergence

Evaluate P(Z | X, θ):

Recall that each $z_{nk}$ can be either 1 or 0, so probability distribution for $Z$ is specified by the following quantities:

Because we'll need to reference these probabilities for Z a lot, we'll follow [1] in defining $\gamma(z_{nk}):= p(z_{kn} = 1 | \mathbf{x}_n, \theta^{old})$

If $y_{ni}$ is missing, its distribution is Gaussian:

$y_{ni} | \mathbf{y}_n^{obs}, \mu^{old}, \mathbf{\Lambda}^{old} \sim N(\mu_i - \Lambda^{old -1}_{ii} \Lambda_{j,obs_n} (\mathbf{y}_{n,obs} - \mu^{old}_{obs}) , \Lambda_{ii}^{-1}$)

See, e.g. [1, Section 2.3.2, Partitioned Gaussians]

Now we'll actually peek ahead a little and note that all the quantities we'll need to compute will depend on some particular quantities derived from this posterior distribution of missing data. For ease of notation later on, we'll define the following as they apply to both missing and observed data. (Click for more detail on how to compute each.)

$\gamma(y_{ni}) = \mathbb{E} (y_{ni} | X = \{\mathbf{y}_n^{obs}\}, \mu^{old}, \Lambda^{old}) $

$$ \gamma(y_{ni}) = \begin{cases} y_{ni} \text{ if } y_{ni} \text{ is observed} \\ \mu_i - (\Lambda^{old}_{ii})^{-1} \Lambda_{j,obs_n}(\mathbf{y}_n^{obs} - \mu^{old, obs}) \text{ if } y_{ni} \text{ is missing } \\ \end{cases} $$

$ s_{nij} = \mathbb{E} \left(y_{ni}y_{nj} | X = \{\mathbf{y}_n^{obs}\}, \mu^{old}, \Lambda^{old}\right)$

$$ s_{nij} = \begin{cases} \gamma(y_{ni}) \gamma(y_{nj}) = y_{ni}y_{nj} \text{ if }y_{ni} \text{ and } y_{nj} \text{ are observed } \\ \gamma(y_{ni}) \gamma(y_{nj}) \text{ if exactly one of } y_{ni} \text{ and } y_{nj} \text{ are observed} \\ \Lambda_{ij}^{-1} + \gamma(y_{ni}) \gamma(y_{nj}) \text{ if }y_{ni} \text{ and } y_{nj} \text{ are missing } \end{cases} $$

Choose a new value of $\theta$ to maximize the expectation under $P(Z | X , \theta^{old})$ of the log likelihood $P(X, Z | \theta)$, that is to maximize $Q (\theta, \theta^{old}) = \mathbb{E}_{Z|X, \theta^{old}} \ln P(X, Z | \theta)$:

We have $Q(\theta, \theta^{old}) = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) (\ln \pi_k + \ln N( \mathbf{x}_n | \mathbf{\mu}_k, \mathbf{\Sigma_k})) $

$$\begin{aligned} Q(\theta, \theta^{(old)}) := & \mathbb{E}_{Z | X, \theta^{(old)}} \ln p(X,Z | \theta ) \\ = & \mathbb{E}_{Z | X, \theta^{(old)}} \ln \prod_{n=1}^N \prod_{k=1}^K \pi_k^{z_{nk}} N (\mathbf{x}_n | \mu_k, \Sigma_k)^{z_{nk}} \\ = & \mathbb{E}_{Z | X, \theta^{old}} \sum_{n=1}^N \sum_{k=1}^K z_{nk} (\ln \pi_k + \ln N( \mathbf{x}_n | \mathbf{\mu}_k, \mathbf{\Sigma_k})) \\ = & \sum_{n=1}^N \sum_{k=1}^K \mathbb{E}_{Z | X, \theta^{(old)}} (z_{nk}) (\ln \pi_k + \ln N( \mathbf{x}_n | \mathbf{\mu}_k, \mathbf{\Sigma}_k)) \\ = & \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) (\ln \pi_k + \ln N( \mathbf{x}_n | \mathbf{\mu}_k, \mathbf{\Sigma_k})) \end{aligned}$$

Maximizing $Q(\theta, \theta^{old})$ with respect to $\theta$, we obtain the following updates for $\theta$:

$\mathbf{\mu}_k^{new} = \frac{ \sum_{n=1}^N \gamma(z_{nk}) \mathbf{x}_n }{\sum_{n=1}^N \gamma(z_{nk})}$

Setting the derivative of Q with respect to $\mathbf{\mu}_k$ to zero and then reducing using the assumption that $\mathbf{\Sigma}_k$ is nonsingular yields $$\sum_{n=1}^N \gamma(z_{n,k}) \mathbf{\Sigma}_k^{-1}(\mathbf{x}_n - \mathbf{\mu}_k) = 0 \implies \mathbf{\mu}_k^{new} = \frac{ \sum_{n=1}^N \gamma(z_{nk}) \mathbf{x}_n }{\sum_{n=1}^N \gamma(z_{nk})} $$

$\Sigma_k^{new} = \frac{\sum_{n=1}^N \gamma(z_{nk})(\mathbf{x}_n - \mathbf{\mu}_k^{new})(\mathbf{x}_n - \mathbf{\mu}_k^{new})^T} {\sum_{n=1}^N \gamma(z_{nk})} $

Set the derivative with respect to $\Sigma^{-1}$ to zero. Calculations parallel those for finding the maximum likelihood for a Gaussian, with the extra $\gamma(z_{nk})$ terms in tow. See e.g. Section 13. 5, here.

$\pi_k^{new} = \frac{\sum_{n=1}^N \gamma(z_{nk})}{N}$

We need to maximize Q with respect to $\pi$ subject to $\sum_{k=1}^K \pi_k =1$. Lagrange multipliers make quick work of this constrained optimization problem. Omitting the terms of Q that do not depend on $\pi$ we define $L(\pi, \lambda) = \sum_{k=1}^K \sum_{n=1}^N \gamma(z_{nk}) \ln \pi_k - \lambda \left(\sum_{k=1}^K - 1\right) $. Setting the derivatives with respect to each $\pi_k$ and $\lambda$ simultaneously to zero yields $\pi_k^{(new)} = \frac{\sum_{n=1}^N \gamma(z_{nk})}{N}$.

$Q(\theta, \theta^{old}) = \mathbb{E}_{Z|X, \theta^{old}} \ln P(X, Z | \theta) = \\ \mathbb{E}_{Z|X, \theta^{old}} -\frac{N}{2} \ln | \Sigma^{-1} | - \frac{1}{2} \sum_{n=1}^N (\mathbf{y}_n - \mu)^T \Sigma^{-1} (\mathbf{y}_n - \mu) + C $ where C does not depend on Z, X, or $\theta$.

Let's sketch how the $s_nij$ and $\gamma(y_{ni}$ can be used to replace all quantities in Q that depend on $Y$. Dropping the C for the sake of concision, we have \\ \mathbb{E}_{Z|X, \theta^{old}} = -\frac{N}{2} \ln | \Sigma^{-1} | - \frac{1}{2} \sum_{n=1}^N (\mathbf{y}_n - \mu)^T \Sigma^{-1} (\mathbf{y}_n - \mu) = -\frac{N}{2} \ln | \Sigma^{-1} | - \mathbb{E} \frac{1}{2} \sum_{n=1}^N \text{Tr}[(\mathbf{y}_n - \mu)^T \Sigma^{-1} (\mathbf{y}_n - \mu)] \\ = -\frac{N}{2} \ln | \Sigma^{-1} | - \mathbb{E}\frac{1}{2} \sum_{n=1}^N \text{Tr}[(\mathbf{y}_n - \mu)(\mathbf{y}_n - \mu)^T \Sigma^{-1}] \\ = -\frac{N}{2} \ln | \Sigma^{-1} | - \mathbb{E}\frac{1}{2} \sum_{n=1}^N \text{Tr}[(\mathbf{y}_n\mathbf{y}_n^T - \mu\mathbf{y}_n^T - \mathbf{y_n}^T \mu + \mu \mu^T ) \Sigma^{-1}] \\ = -\frac{N}{2} \ln | \Sigma^{-1} | - \frac{1}{2} \sum_{n=1}^N \text{Tr}[(\mathbb{E}\mathbf{y}_n\mathbf{y}_n^T - \mu\mathbb{E}\mathbf{y}_n^T - \mathbb{E}\mathbf{y_n}^T \mu + \mu \mu^T ) \Sigma^{-1}] \\ \) The expectations in this last line can be expressed in terms of $s_nij$ and $\gamma(y_{ni}$.

We can rewrite Q in terms of the expectations defined in the previous step: $\gamma(y_{ni})$ and $s_{nij}$.

Maximizing Q follows similar calculations to maximizing the likelihood for a Gaussian.

$\mu_i^{new} = \frac{1}{N} \sum_{n=1}^N \gamma(y_{ij})$

The derivation here parallels that of finding parameters that maximize the likelihood of a Gaussian. If we had no missing data, the mean that maximized the likelihood would be $\frac{1}{N} \sum_{n=1}^N y_{ij}$, so our result just has the $y_{ij}$ replaced with their expectation when they are missing.

$\Sigma_{ij}^{new} = \frac{1}{N} \sum_{n=1}^N s_{ijn} - \mu_i^{new} \mu_j^{new} $

Again, the derivation follows that of maximizing the likelihood of a Gaussian. (see e.g. 13.5 here. Note that if Y represented entirely non-missing data from a Gaussian distribution, then the likelihood-maximizing $\Sigma$ would be $$ \Sigma_{ij} = \frac{1}{N} \sum_{n=1}^N y_{ni}y_{nj} - \mu^{ML}_i y_{nj} - y_{ni} \mu^{ML}_j + \mu^{ML}_i \mu^{ML}_j \\ =\frac{1}{N}\sum_{n=1}^N y_{ni}y_{nj} - \frac{\mu^{ML}_i}{N}\sum_{n=1}^N y_{nj} - \frac{\mu_j^{ML}}{N}\sum_{n=1}^N y_{ni} + \mu^{ML}_i \mu^{ML}_j\\ = \frac{1}{N}\sum_{n=1}^N y_{ni}y_{nj} - \frac{\mu^{ML}_i}{N} N \mu^{ML}_j - \frac{\mu^{ML}_j}{N} N \mu^{ML}_i + \mu^{ML}_i \mu^{ML}_j\\ = \frac{1}{N}\sum_{n=1}^N y_{ni}y_{nj} - \mu^{ML}_i \mu^{ML}_j. \\ $$ Thus the result here is analogous to the maximum likelihood result, but with the expectations $s_{ijn}$ appearing instead of $y_{ni}y_{nj}$.

Great! There are fundamental conceptual similarities between EM and Gibbs sampling. In both cases, we divide parameters into groups and use appropriately chosen conditional distributions within an iterative process to obtain quantities of interest. The connections continue when we look at implementation details as well. Tricks like adding auxiliary variables to make certain Gibbs sampling processes more efficient can be helpful in EM as well. (I learned recently that some of these tools were actually developed for EM before being applied in the Gibbs sampling setting as well [2]).

However, we should keep in mind a fundamental difference between EM and Gibbs sampling: EM is a technique for optimization (in particular finding the maximum of a likelihood or posterior), while Gibbs sampling is for, well, sampling.

Optimization	Sampling
EM	Gibbs Sampling
(Stochastic) gradient descent	Metropolis Hastings
Newton's method	Other MCMC methods (HMC, etc.)

Iterative techniques in optimization may resemble iterative techniques of sampling, and in both cases, we may end up with some iteratively obtained sequence of parameter values. But the goals of optimization and sampling are different, and so are the interpretations of the quantities produced by iteration.

When we are optimizing, we are updating our parameter values in a way that we hope will get them closer to the optimum value at each iteration, or at least that will trend closer to the optimum after many iterations. In sampling, we are drawing values that will represent a distribution — we expect our sampling process to produce both samples that are highly probable (many of these), others that correspond to lower probability regions (fewer of these). We're not expecting or hoping that our samples will settle down on a single value once we have enough of them — on the contrary, we rely on our samples to explore the space and help us understand the distribution they come from. Lastly, when optimizing, we are interested in the optimizing parameter itself, not in all the points that might have been sequentially considered to get us closer to that optimum. With sampling, in contrast, we are likely to want to retain all of our samples, to use them to understand the underlying distribution or at the very least to feed them in to some summary statistics like a mean or standard deviation.

When you're learning about or working with a new iterative or sequential technique in machine learning, it can be surprisingly helpful to just take a quick moment to ask yourself whether you are iterating in order to move towards some optimum, or rather to generate representative samples of a distribution of interest.

To recap: EM is an iterative algorithm that may help you maximize likelihoods or posteriors if your model has latent variables — or if you can make it have latent variables, by creating auxiliary variables as latent variables, by treating missing data as latent variables, or by thoughtfully partitioning parameters to consider some of them as latent variables. EM is applicable in a range of settings, and the particular calculations involved will look different depending on the application. And don't forget that EM helps you optimize over a likelihood or posterior (not sample from it!).

Thanks for reading! I promise more pictures in my next post.

References:

Note that the given outline of the general steps of EM follows [1].

Bishop, C.M., 2006. Pattern recognition and machine learning. Springer.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin, D.B., 2013. Bayesian data analysis. Chapman and Hall/CRC.