The Expectation-Maximization algorithm

March 14, 2022

In this post, I explain the popular Expectation-Maximization algorithm under a variational methods perspective.$\newcommand{\set}[1]{\left\{ #1 \right\}}$

Introduction

Let’s consider an observation $x\in\mathcal{X}$. In a traditional statistical setting, we posit a model $\set{ p_{\theta} \vert \theta \in \Theta} $ describing the generation process of the variable $x$, and we would like to make inference on the parameter $\theta$.

However, observations are often best described by introducing an unobserved latent variable $z\in \mathcal{Z}$. This unobserved variable can either correspond to an existing unobserved real-world attribute of the data, or to an abstract hypothesis made to describe the data (for instance in clustering, we define clusters and cluster assignments as latent variables).

In that context, computing the log-likelihood of the data involves integrating over the latent variable, since:

$$ p_{\theta}(x) = \int p_{\theta}(x,z) dz $$

Maximum Likelihood Estimation

Let’s suppose we aim to provide a Maximum Likelihood Estimate $\hat{\theta}^{(MLE)}$ of the parameter $\theta$. Classically, to do this we maximize the log-likelihood of the data:

$$ \max\limits_{\theta} \log \left(\int p_{\theta}(x,z) dz \right) $$

Unfortunately, due to the presence of the integral, this function of $\theta$ is often non-convex, and thus difficult to optimize as such.

However, in many cases, once the latent variable is known, the log-likelihood becomes convex and becomes easy to optimize, for this particular value of the latent variable:

$$\max\limits_{\theta} \log \left(p_{\theta}(x,z) \right)$$

The Expectation-Maximization algorithm (Dempster et al., 1977) provides a solution in this particular case, by essentially performing alternate optimization on the parameter, and a variational distribution.

Variational Expression of the log-likelihood

$\newcommand{\PZ}{\mathcal{P(\mathcal{Z})}}$ Let $\PZ$ denote the set of all possible densities defined on the latent space $\mathcal{Z}$.

As is common in variational methods, we use Jensen’s inequality to rewrite the log-likelihood of the data as:

$$ log(p_{\theta}(x)) = \underset{q \in \PZ}{max} \space F(x, q;\theta) $$

Where $F(x,q; \theta)$ is the variational free energy defined as

$$ F(x, q;\theta) = \int \log( \frac{p_{\theta}(x,z)}{q(z)} ) q(z)dz $$

As a consequence, the maximization of the log likelihood becomes a double maximization problem:

$$ \max \limits_{\theta} \log \left(p_{\theta}(x) \right) = \max\limits_{\theta} \max\limits_{q\in \PZ} F(x,q;\theta) $$

Alternate optimization of the Variational Free Energy

Since the optimization problem involves two maximizations, the fundamental idea of the EM algorithm is thus to perform alternate optimization.

Supposing that at step $t$ we obtain a parameter value $\theta^{t}$ and $q^t$, we update these values one after the other in two steps

The Expectation Step

In that step, we fix the value of the parameter $\theta^{t}$ and maximize $F$ with respect to the variational distribution:

$$q^{t+1}=\underset{q}{argmax}\space F(x,q,\theta^{t}).$$

Using the Lagrange multiplier methods, one can easily find that the optimal distribution is given by the conditional distribution of the latent variable given the data $x$, under the current set of parameters $\theta^t$:

$$q^{t+1}(z) = p_{\theta^t}(z \vert x)$$

Injecting back this optimal distribution into the variational free energy yields a new convex function of $\theta$, given by:$\newcommand{\E}{\mathbb{E}}$

\begin{aligned}F(x,q^{t+1}, \theta) &= F(x,p_{\theta^t}(. \vert x), \theta)\\ &= \int \log( \frac{p_{\theta}(x,z)}{p_{\theta^t}(z \vert x)} ) p_{\theta^t}(z \vert x) dz \\ &= \E_{Z\sim p_{\theta^t}(.\vert x)}[\log(p_{\theta}(x,Z))] + H(p_{\theta^t}(. \vert x)) \end{aligned} $$

where $H(p_{\theta^t}(. \vert x))$ is the entropy of the conditional latent distribution $p_{\theta^t}(. \vert x)$.

Since the entropy term doesn’t depend on the parameter $\theta$, the annex function to optimize is given by the expectation of the joint likelihood, under the conditional distribution of the latent variable given the observation and the current parameter:

$$ \theta \mapsto E(\theta)=\E_{Z\sim p_{\theta^t}(.\vert x)}[\log(p_{\theta}(x,Z))] $$

Thus this step is called the Expectation step.

The Maximization step

Once the annex function has been derived in the Expectation step, its convexity allows us to easily maximize it with respect to the model parameter:

$$ \theta^{t+1} = \max\limits_{\theta} E(\theta) $$

This step is thus called the Maximization step.

Example: the Gaussian Mixture Model.

In the Gaussian Mixture Model, we are given a dataset $x={x_1,…,x_n}$, and our goal is to assign a cluster label $c_i\in\{1,…,K\}$ to each datapoint.

To do so, an unobserved cluster assignment variable $z_i \in \{1,…,K\}$ is introduced.

Moreover, each cluster $k\in \{1,…,K\}$ is associated with a Gaussian Distribution $\mathcal{N}(\mu_k, \Sigma_k)$.

Given this, we assume the following generating process for the datapoints:

$$ \begin{aligned} z_i &\sim \mathcal{M}(1, (\lambda_1,…,\lambda_K)) \\ x_i\vert z_i=k &\sim \mathcal{N}(\mu_k, \Sigma_k) \end{aligned} $$

The goal is to find the set of parameters $\theta=(\lambda_1,…,\lambda_K, \mu_1,…,\mu_K, \Sigma_1,…,\Sigma_K )$ that maximizes the likelihood $ p(x| \theta) $.

E-step

As mentionned before, in the E-step we will compute a surrogate function that is the expectation for $z$ distributed under the its conditional distribution given the data and the current set of parameters $\theta^{t}$

Using Bayes’ Rule, we get that this distribution is

$$ \alpha_{i,k}^{t} \overset{\Delta}{=} p(z_i=k|x_i, \theta^{t}) = \frac{\mathcal{N}(x_i;\mu_k^{t}, \Sigma_k^{t})}{\sum_{k’=1}^{K}\mathcal{N}(x_i;\mu_{k’}^{t}, \Sigma_{k’}^{t})} $$

We deduce that the surrogate function to maximize is

$$E(\theta)= \sum_{i=1}^{n} \sum_{k=1}^{K} \alpha_{k,i}^{t}\left[ log(\lambda_k)+\log(\mathcal{N}(x_i;\mu_k,\Sigma_k)) \right]$$

Note that this maximization has to be done on the set of admissible parameters. In particular, we should have $\sum_{k=1}^{K}\lambda_k = 1$.

M-step

In the M-step, we maximize the surrogate function derived at the E-step, with respect to the model parameters. In the case of the Gaussian Mixture Model, this has a closed form that writes as follows.

Maximizing $E(\theta)$ with respect to $\lambda_1,…,\lambda_K$ under the constraint $\sum_{k=1}^{K}\lambda_k = 1$ yields :

$$ \lambda_{k}^{t+1} = \frac{1}{n}\sum_{i=1}^{n}\alpha_{i,k}^{t} $$
Maximizing $E(\theta)$ with respect to $\mu_1,…,\mu_{K}$ yields the following re-weighted empirical mean of the data points in each clusters:

$$ \mu_{k}^{t+1} = \frac{\sum_{i=1}^n\alpha_{i,k}^{t}x_i}{\sum_{i=1}^n\alpha_{i,k}^{t}} $$

Similarly

$$ \Sigma_{k}^{t+1} = \frac{\sum_{i=1}^n\alpha_{i,k}^{t}(x_i-\mu_k^{t+1})(x_i-\mu_k^{t+1})^T}{\sum_{i=1}^n\alpha_{i,k}^{t}} $$

Conclusion

In this article we have presented the Expectation-Maximization algorithm and its close connection with variational methods. While its name suggest otherwise, this algorithm is simply a form of alternate optimization of the Variational Free Energy, with respect to the model parameters on the one hand, and a variational distribution defined on the latent space on the other hand.

It can be noted that although this algorithm is described in the context where we optimize a log-likelihood, it applies more generally to any setup where the objective function involves integrating over a latent variable.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data Via the EM Algorithm . In Journal of the Royal Statistical Society: Series B (Methodological) (Vol. 39, Number 1, pp. 1–22). https://doi.org/10.1111/j.2517-6161.1977.tb01600.x