Mixture Distributions
Introduction to Mixture Distributions
Recommended Prerequesites
- Probability
- Probability 2
- Maximum Likelihood Estimation
Explanation
In probability and statistics, a mixture distribution arises when the population being studied is composed of several distinct subpopulations, each of which follows a different probability distribution. Instead of modeling the population as a whole with a single distribution, we assume that the data is generated from one of several component distributions, each with its own characteristics.
Formally, a mixture distribution represents a probability distribution made up of a weighted combination of other distributions. This should not be confused with the concept of a sum of the realizations of random variables, which is different, but easily mistaken for it; that would be given by convolution.
Definition
A mixture distribution is defined as a weighted sum of two or more component distributions. Suppose there are k component distributions
\(f_1(x),f_2(x),\dots,f_k(x)\), with weights \(w_1,w_2,\dots,w_k\), where \(w_i\geq 0\) and \(\sum_{i=1}^{k}w_if_i(x)=1\).
Then the PDF of the mixture distribution is given by:
$$f(x)=\sum_{i=1}^{k}w_if_i(x)$$
Similarly, a discrete case is given by:
$$P(X=x)=\sum_{i=1}^k w_i P(X_i=x)$$
One possible interpretation is that \(w_i\) is the probability of sampling from \(f_i(x)\). Some literature will denote the weights by \(\pi_i\) instead of \(w_i\).
This makes mixture models and its extensions useful for Latent Variable Models:
$$X|Z\sim i\sim F_{X_i}(x)$$
$$f_{X}(x)=\sum_{i=1}^kP(Z=i)f_{X_i}(x)$$
This approach can also be useful when combining multiple models.
Moments
Mean
Since the expectation is a linear operator, the mean is given by:
$$\mathbb{E}[X]=\sum_{i=1}^k \pi_{i}\mathbb{E}[X_i]$$
Variance
$$\text{Var}(X)=\sum_{i=1}^k\pi_i(\text{Var}(X_i)+(\mathbb{E}[X_i])^2)-(\mathbb{E}[X])^2$$
CDF
Since the integral is a linear operator, the CDF has a similar relationship to the PMF/PDF.
$$F_X(x)=\sum_{i=1}^k \pi_i F_{X_i}(x)$$
MGF
$$M_X(t)=\sum_{i=1}^k\pi_iM_{X_i}(t)$$
Likelihood
For independent observations \(x_1,x_2,\dots,x_n\):
$$L(\theta)=\prod_{j=1}^{n}f_X(x_j;\theta)=\prod_{j=1}^n(\sum_{i=1}^k\pi_if_{X_i}(x_j;\theta))$$
Expectation-Maximization Algorithm
Note: The reader may want to come back to this section after reading the
Marginal Likelihood section. There is somewhat of a circular dependency where that chapter builds off of things that builds off of this chapter. The contents of this section remain here since they are strongly tied to mixture distributions.
In the
MLE chapter, we went over the eponymous parameter estimation method. We will now build on that in the context of Mixture Distributions with something called the Expectation-Maximization Algorithm.
The Expectation-Maximization (EM) algorithm is a powerful iterative method for estimating parameters in statistical models, particularly in cases where the data is incomplete or has a latent variable structure. It is widely used in a variety of applications such as clustering (e.g., Gaussian Mixture Models), missing data problems, and latent variable models. The EM algorithm alternates between two steps: the Expectation step (E-step) and the Maximization step (M-step), with the goal of finding the parameters that maximize the likelihood function.
Setup
Consider a statistical model with observed data \(X=\left\{x_1,x_2,\dots,x_n\right\}\), and unobserved (latent) variables \(Z=\left\{z_1,z_2,\dots,z_n\right\}\). The model is governed by a set of parameters \(\theta\), and we wish to estimate these parameters given only the observed data \(X\).
The complete data likelihood is given by:
$$L(\theta;X,Z)=P(X,Z|\theta)$$
However, since we do not observe Z, we must work with the marginal likelihood of the observed data:
$$L(\theta;X)=\sum_{Z}P(X,Z|\theta)$$
Algorithm
The EM algorithm seeks to maximize the marginal likelihood by iteratively applying the following two steps:
E-step (Expectation Step)
In this step, we compute the expected value of the complete-data log likelihood, \(Q(\theta;\theta^{(t)})\), given the observed data and the current estimate of the parameters \(\theta^{(t)}\). This expectation is taken with respect to the conditional distribution of the latent variables Z given the observed data X and the parameter estimates.
$$Q(\theta;\theta^{(t)})=\mathbb{E}_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)]$$
M-Step (Maximization Step)
In this step, we maximize the expected complete-data log-likelihood \(Q(\theta;\theta^{(t)})\) with respect to the parameters \(\theta\), to obtain an updated estimate of the parameters \(\theta^{(t+1)}\).
$$\theta^{(t+1)}=\arg \max_{\theta}Q(\theta;\theta^{(t)})$$
The algorithm repeats these two steps until convergence, typically when the change in log-likelihood or parameter values between iterations falls below a predetermined threshold.
Basic Example
Before we dive into the general Gaussian Mixture Model (GMM) case, let's consider a simplified version: a mixture of two 1-dimensional normal distributions. This will help clarify the steps of the Expectation-Maximization (EM) algorithm in a less complex context.
Setup
Suppose we observed data \(X=\left\{x_1,x_2,\dots,x_n\right\}\), and we believe the data comes from a mixture of two normal distributiosn with unknown parameters. Specifically we assume each \(x_i\) is generated by one of the following two distributions:
- \(N(\mu_1,\sigma_1^2)\) with weight \(\pi_1\)
- \(N(\mu_2,\sigma_2^2)\) with weight \(1-\pi_1\)
Our goal is to estimate the parameters \(\mu_1,\sigma_1^2,\mu_2,\sigma_2^2,\pi_1\) from the data.
We will initialize the parameters \(\mu_1,\sigma_1^2,\mu_2,\sigma_2^2,\pi_1\) randomly or using some heuristic. For simplicity, we can start with:
- Initial means: \(\mu_1^{(0)}=\min(X); \mu_2^{(0)}=\max(X)\)
- Initial variances: \(\sigma_1^2\) and \(\sigma_2^2\) can be initialized to the variance of \(X\)
- Mixing coefficient: \(\pi_1=0.5\) (assuming both components contribute equally at the start)
E-Step
In the E-step, we compute the responsibilities, which tell us how much each data point \(x_i\) "belongs" to each component (i.e., how likely it is that each point was generated by component 1 or component 2).
The responsibility for the first component is calculated as:
$$\gamma_{i1}^{(t)} = \frac{\pi_1^{(t)} \mathcal{N}(x_i \mid \mu_1^{(t)}, \sigma_1^{(t)2})}{\pi_1^{(t)} \mathcal{N}(x_i \mid \mu_1^{(t)}, \sigma_1^{(t)2}) + (1 - \pi_1^{(t)}) \mathcal{N}(x_i \mid \mu_2^{(t)}, \sigma_2^{(t)2})}$$
Since the responsibilities add up to 1, we can simply calculate the responsibility for the second component as:
$$\gamma_{i2}^{(t)} = 1 - \gamma_{i1}^{(t)}$$
M-Step
In the M-step, we update the parameters based on the responsibilities computed in the E-step:
We perform the following calculations
Update the mean
$$\mu_1^{(t+1)} = \frac{\sum_{i=1}^n \gamma_{i1}^{(t)} x_i}{\sum_{i=1}^n \gamma_{i1}^{(t)}}$$
$$\mu_2^{(t+1)} = \frac{\sum_{i=1}^n \gamma_{i2}^{(t)} x_i}{\sum_{i=1}^n \gamma_{i2}^{(t)}}$$
Update the variances
$$\sigma_1^{(t+1)2} = \frac{\sum_{i=1}^n \gamma_{i1}^{(t)} (x_i - \mu_1^{(t+1)})^2}{\sum_{i=1}^n \gamma_{i1}^{(t)}}$$
$$\sigma_2^{(t+1)2} = \frac{\sum_{i=1}^n \gamma_{i2}^{(t)} (x_i - \mu_2^{(t+1)})^2}{\sum_{i=1}^n \gamma_{i2}^{(t)}}$$
Update the mixing coefficient
$$\pi_1^{(t+1)} = \frac{1}{n} \sum_{i=1}^n \gamma_{i1}^{(t)}$$
$$\pi_2^{(t+1)} = 1 - \pi_1^{(t+1)}$$
With Numbers
Let \(X=\left\{-5,-4,-3,0,1,2,3,8,9,10\right\}\).
Using our heuristic mentioned above for the mean and mixing coefficient and basic guess of 1 for variances, we'll give initial guesses of
$$\mu_1^{(0)}=-5;\mu_2^{(0)}=10$$
$$\sigma_1^{2(0)}=1, \sigma_2^{2(0)}=1$$
$$\pi_1=0.5$$
Step E-1
$$\gamma_{i1}=\frac{0.5\times \mathcal{N}(-5|-5,1)}{0.5\times \mathcal{N}(-5|-5,1)+0.5\times \mathcal{N}(-5|10,1)}$$
Since \(x=-5\) is close to \(\mu_1=-5\), the responsibility \(\gamma_{i1}\) will be near 1.
We then continue to do this for each data point.
Step M-1
After computing responsibilities for all data points, we update the parameters. The updates means might be, for example:
$$\mu_1^{(1)}=-4.9; \mu_2^{(1)}=9.2$$
Other Notes
In the example, we were given the number of components and their distribution families. If these are not known, it becomes a more difficult problem. For unknown number of components, you can use some sort of penalization method like AIC, BIC. Another quantitative method of assessing performance is cross-validation.
Label Switching
Because the likelihood is invariant under permutations of the component labels, you have multiple equivalent solutions
Bayes' Theorem for Mixtures
$$P(Z=i|X=x)=\frac{\pi_if_{X_i}(x)}{f_X(x)}$$
Sampling from a Mixture Distribution
Sampling from a mixture distribution is relatively simple, provided the component distributions are easy to sample from.
Steps
Step 1: Select a Component Based on Weights
Select a Component Based on Weights: The first step is to randomly select one of the component distributions according to their weights. This step uses the weights as probabilities, so if a component has a larger weight, it is more likely to be selected.
You can think of this step as performing a weighted coin flip or drawing from a categorical distribution where each category corresponds to one of the component distributions. This is essentially a multinomial sampling with 1 trial which was covered in
Discrete Sampling.
Example:
- Assume you have two components: a normal distribution \(N(\mu_1,\sigma_1^2)\) with weight 0.7 and an exponential distribution \(\text{Exp}(\lambda_2)\) with weight 0.3.
- Generate a random number u from the uniform distribution U(0,1). If \(u\leq 0.7\), select the normal distribution. Else, select the exponential distribution
Step 2: Sample from the Selected Component
After selecting a component distribution, the next step is to generate a sample from the chosen distribution.
Example:
- If you selected the normal distribution in the previous step, sample a value from \(N(\mu_1,\sigma_1^2)\)
- If you selected the exponential distribution, sample a value from \(\text{Exp}(\lambda_2)\).
Pseudocode
Here is pseudocode for sampling from a mixture distribution:
Input:
- List of component distributions: D1, D2, ..., Dk
(Each D represents a different distribution, like normal, exponential, or uniform. These are the components of the mixture.)
- List of corresponding weights: w1, w2, ..., wk
(The weights determine how likely each component is to be selected. They must sum to 1 so that they form a valid probability distribution.)
Output:
- Sampled value from the mixture distribution
(This is the final value sampled from one of the component distributions, determined by the weights.)
1. Generate a random number u from U(0, 1)
(Generate a uniform random number between 0 and 1. This number will be used to select a component distribution based on the weights. The value of `u` will help simulate the probability of selecting each component.)
2. Use the weights to determine which component distribution to sample from:
- Set cumulative_weight = 0
(This is an accumulator that will store the running sum of weights. It helps track when the random number `u` falls into a range corresponding to a specific component.)
- For each component i from 1 to k:
(Iterate through each of the k component distributions.)
a. cumulative_weight += wi
(Add the weight of the current component `wi` to the cumulative total. This step accumulates the probability mass for each component.)
b. If u <= cumulative_weight:
(Check if the random number `u` falls within the cumulative probability range for the current component.)
- Select component Di
(If the condition is met, the current component `Di` is selected for sampling, since the random number `u` fell within its weight range.)
- Break the loop
(Exit the loop once a component is selected, as there's no need to check the remaining components.)
3. Sample a value from the selected component distribution
(After a component is selected, sample a value from that specific distribution. Each distribution has its own sampling method, such as using the Box-Muller transform for a normal distribution, or inverse transform sampling for an exponential distribution.)
4. Return the sampled value
(The function ends by returning the value sampled from the selected component distribution, which represents the final sample from the mixture distribution.)