Distribution Textbook (Work in Progress)

by John Della Rosa

Mixture Distributions

Introduction to Mixture Distributions

Recommended Prerequesites

  1. Probability
  2. Probability 2
  3. Maximum Likelihood Estimation

Explanation

In probability and statistics, a mixture distribution arises when the population being studied is composed of several distinct subpopulations, each of which follows a different probability distribution. Instead of modeling the population as a whole with a single distribution, we assume that the data is generated from one of several component distributions, each with its own characteristics. Formally, a mixture distribution represents a probability distribution made up of a weighted combination of other distributions. This should not be confused with the concept of a sum of the realizations of random variables, which is different, but easily mistaken for it; that would be given by convolution.

Definition

A mixture distribution is defined as a weighted sum of two or more component distributions. Suppose there are k component distributions \(f_1(x),f_2(x),\dots,f_k(x)\), with weights \(w_1,w_2,\dots,w_k\), where \(w_i\geq 0\) and \(\sum_{i=1}^{k}w_if_i(x)=1\). Then the PDF of the mixture distribution is given by: $$f(x)=\sum_{i=1}^{k}w_if_i(x)$$ Similarly, a discrete case is given by: $$P(X=x)=\sum_{i=1}^k w_i P(X_i=x)$$ One possible interpretation is that \(w_i\) is the probability of sampling from \(f_i(x)\). Some literature will denote the weights by \(\pi_i\) instead of \(w_i\). This makes mixture models and its extensions useful for Latent Variable Models: $$X|Z\sim i\sim F_{X_i}(x)$$ $$f_{X}(x)=\sum_{i=1}^kP(Z=i)f_{X_i}(x)$$ This approach can also be useful when combining multiple models.

Moments

Mean

Since the expectation is a linear operator, the mean is given by: $$\mathbb{E}[X]=\sum_{i=1}^k \pi_{i}\mathbb{E}[X_i]$$

Variance

$$\text{Var}(X)=\sum_{i=1}^k\pi_i(\text{Var}(X_i)+(\mathbb{E}[X_i])^2)-(\mathbb{E}[X])^2$$

CDF

Since the integral is a linear operator, the CDF has a similar relationship to the PMF/PDF. $$F_X(x)=\sum_{i=1}^k \pi_i F_{X_i}(x)$$

MGF

$$M_X(t)=\sum_{i=1}^k\pi_iM_{X_i}(t)$$

Likelihood

For independent observations \(x_1,x_2,\dots,x_n\): $$L(\theta)=\prod_{j=1}^{n}f_X(x_j;\theta)=\prod_{j=1}^n(\sum_{i=1}^k\pi_if_{X_i}(x_j;\theta))$$

Expectation-Maximization Algorithm

Note: The reader may want to come back to this section after reading the Marginal Likelihood section. There is somewhat of a circular dependency where that chapter builds off of things that builds off of this chapter. The contents of this section remain here since they are strongly tied to mixture distributions.

In the MLE chapter, we went over the eponymous parameter estimation method. We will now build on that in the context of Mixture Distributions with something called the Expectation-Maximization Algorithm. The Expectation-Maximization (EM) algorithm is a powerful iterative method for estimating parameters in statistical models, particularly in cases where the data is incomplete or has a latent variable structure. It is widely used in a variety of applications such as clustering (e.g., Gaussian Mixture Models), missing data problems, and latent variable models. The EM algorithm alternates between two steps: the Expectation step (E-step) and the Maximization step (M-step), with the goal of finding the parameters that maximize the likelihood function.

Setup

Consider a statistical model with observed data \(X=\left\{x_1,x_2,\dots,x_n\right\}\), and unobserved (latent) variables \(Z=\left\{z_1,z_2,\dots,z_n\right\}\). The model is governed by a set of parameters \(\theta\), and we wish to estimate these parameters given only the observed data \(X\). The complete data likelihood is given by: $$L(\theta;X,Z)=P(X,Z|\theta)$$ However, since we do not observe Z, we must work with the marginal likelihood of the observed data: $$L(\theta;X)=\sum_{Z}P(X,Z|\theta)$$

Algorithm

The EM algorithm seeks to maximize the marginal likelihood by iteratively applying the following two steps:
E-step (Expectation Step)
In this step, we compute the expected value of the complete-data log likelihood, \(Q(\theta;\theta^{(t)})\), given the observed data and the current estimate of the parameters \(\theta^{(t)}\). This expectation is taken with respect to the conditional distribution of the latent variables Z given the observed data X and the parameter estimates. $$Q(\theta;\theta^{(t)})=\mathbb{E}_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)]$$
M-Step (Maximization Step)
In this step, we maximize the expected complete-data log-likelihood \(Q(\theta;\theta^{(t)})\) with respect to the parameters \(\theta\), to obtain an updated estimate of the parameters \(\theta^{(t+1)}\). $$\theta^{(t+1)}=\arg \max_{\theta}Q(\theta;\theta^{(t)})$$ The algorithm repeats these two steps until convergence, typically when the change in log-likelihood or parameter values between iterations falls below a predetermined threshold.

Basic Example

Before we dive into the general Gaussian Mixture Model (GMM) case, let's consider a simplified version: a mixture of two 1-dimensional normal distributions. This will help clarify the steps of the Expectation-Maximization (EM) algorithm in a less complex context.

Setup

Suppose we observed data \(X=\left\{x_1,x_2,\dots,x_n\right\}\), and we believe the data comes from a mixture of two normal distributiosn with unknown parameters. Specifically we assume each \(x_i\) is generated by one of the following two distributions:
  1. \(N(\mu_1,\sigma_1^2)\) with weight \(\pi_1\)
  2. \(N(\mu_2,\sigma_2^2)\) with weight \(1-\pi_1\)
Our goal is to estimate the parameters \(\mu_1,\sigma_1^2,\mu_2,\sigma_2^2,\pi_1\) from the data. We will initialize the parameters \(\mu_1,\sigma_1^2,\mu_2,\sigma_2^2,\pi_1\) randomly or using some heuristic. For simplicity, we can start with:

E-Step

In the E-step, we compute the responsibilities, which tell us how much each data point \(x_i\) "belongs" to each component (i.e., how likely it is that each point was generated by component 1 or component 2). The responsibility for the first component is calculated as: $$\gamma_{i1}^{(t)} = \frac{\pi_1^{(t)} \mathcal{N}(x_i \mid \mu_1^{(t)}, \sigma_1^{(t)2})}{\pi_1^{(t)} \mathcal{N}(x_i \mid \mu_1^{(t)}, \sigma_1^{(t)2}) + (1 - \pi_1^{(t)}) \mathcal{N}(x_i \mid \mu_2^{(t)}, \sigma_2^{(t)2})}$$ Since the responsibilities add up to 1, we can simply calculate the responsibility for the second component as: $$\gamma_{i2}^{(t)} = 1 - \gamma_{i1}^{(t)}$$

M-Step

In the M-step, we update the parameters based on the responsibilities computed in the E-step: We perform the following calculations
Update the mean
$$\mu_1^{(t+1)} = \frac{\sum_{i=1}^n \gamma_{i1}^{(t)} x_i}{\sum_{i=1}^n \gamma_{i1}^{(t)}}$$ $$\mu_2^{(t+1)} = \frac{\sum_{i=1}^n \gamma_{i2}^{(t)} x_i}{\sum_{i=1}^n \gamma_{i2}^{(t)}}$$
Update the variances
$$\sigma_1^{(t+1)2} = \frac{\sum_{i=1}^n \gamma_{i1}^{(t)} (x_i - \mu_1^{(t+1)})^2}{\sum_{i=1}^n \gamma_{i1}^{(t)}}$$ $$\sigma_2^{(t+1)2} = \frac{\sum_{i=1}^n \gamma_{i2}^{(t)} (x_i - \mu_2^{(t+1)})^2}{\sum_{i=1}^n \gamma_{i2}^{(t)}}$$
Update the mixing coefficient
$$\pi_1^{(t+1)} = \frac{1}{n} \sum_{i=1}^n \gamma_{i1}^{(t)}$$ $$\pi_2^{(t+1)} = 1 - \pi_1^{(t+1)}$$

With Numbers

Let \(X=\left\{-5,-4,-3,0,1,2,3,8,9,10\right\}\). Using our heuristic mentioned above for the mean and mixing coefficient and basic guess of 1 for variances, we'll give initial guesses of $$\mu_1^{(0)}=-5;\mu_2^{(0)}=10$$ $$\sigma_1^{2(0)}=1, \sigma_2^{2(0)}=1$$ $$\pi_1=0.5$$
Step E-1
$$\gamma_{i1}=\frac{0.5\times \mathcal{N}(-5|-5,1)}{0.5\times \mathcal{N}(-5|-5,1)+0.5\times \mathcal{N}(-5|10,1)}$$ Since \(x=-5\) is close to \(\mu_1=-5\), the responsibility \(\gamma_{i1}\) will be near 1. We then continue to do this for each data point.
Step M-1
After computing responsibilities for all data points, we update the parameters. The updates means might be, for example: $$\mu_1^{(1)}=-4.9; \mu_2^{(1)}=9.2$$

Other Notes

In the example, we were given the number of components and their distribution families. If these are not known, it becomes a more difficult problem. For unknown number of components, you can use some sort of penalization method like AIC, BIC. Another quantitative method of assessing performance is cross-validation.

Label Switching

Because the likelihood is invariant under permutations of the component labels, you have multiple equivalent solutions

Bayes' Theorem for Mixtures

$$P(Z=i|X=x)=\frac{\pi_if_{X_i}(x)}{f_X(x)}$$

Mixture Distribution Generator

Graph Customization

Sampling from a Mixture Distribution

Sampling from a mixture distribution is relatively simple, provided the component distributions are easy to sample from.

Steps

Step 1: Select a Component Based on Weights
Select a Component Based on Weights: The first step is to randomly select one of the component distributions according to their weights. This step uses the weights as probabilities, so if a component has a larger weight, it is more likely to be selected. You can think of this step as performing a weighted coin flip or drawing from a categorical distribution where each category corresponds to one of the component distributions. This is essentially a multinomial sampling with 1 trial which was covered in Discrete Sampling.

Example:
Step 2: Sample from the Selected Component
After selecting a component distribution, the next step is to generate a sample from the chosen distribution.
Example:

Pseudocode

Here is pseudocode for sampling from a mixture distribution:

                  Input:
                  - List of component distributions: D1, D2, ..., Dk
                    (Each D represents a different distribution, like normal, exponential, or uniform. These are the components of the mixture.)
                  
                  - List of corresponding weights: w1, w2, ..., wk
                    (The weights determine how likely each component is to be selected. They must sum to 1 so that they form a valid probability distribution.)
                
                Output:
                  - Sampled value from the mixture distribution
                    (This is the final value sampled from one of the component distributions, determined by the weights.)
                
                1. Generate a random number u from U(0, 1)
                   (Generate a uniform random number between 0 and 1. This number will be used to select a component distribution based on the weights. The value of `u` will help simulate the probability of selecting each component.)
                
                2. Use the weights to determine which component distribution to sample from:
                   - Set cumulative_weight = 0
                     (This is an accumulator that will store the running sum of weights. It helps track when the random number `u` falls into a range corresponding to a specific component.)
                   
                   - For each component i from 1 to k:
                     (Iterate through each of the k component distributions.)
                     
                     a. cumulative_weight += wi
                        (Add the weight of the current component `wi` to the cumulative total. This step accumulates the probability mass for each component.)
                
                     b. If u <= cumulative_weight:
                        (Check if the random number `u` falls within the cumulative probability range for the current component.)
                        
                        - Select component Di
                          (If the condition is met, the current component `Di` is selected for sampling, since the random number `u` fell within its weight range.)
                
                        - Break the loop
                          (Exit the loop once a component is selected, as there's no need to check the remaining components.)
                
                3. Sample a value from the selected component distribution
                   (After a component is selected, sample a value from that specific distribution. Each distribution has its own sampling method, such as using the Box-Muller transform for a normal distribution, or inverse transform sampling for an exponential distribution.)
                
                4. Return the sampled value
                   (The function ends by returning the value sampled from the selected component distribution, which represents the final sample from the mixture distribution.)
                
                

Mixture Distribution Practice Problems

For any parameter estimation problem that doesn't explicitly ask for it to be done by hand, feel free to use your programming language of choice and a reasonable stopping threshold. It is unreasonble to do more than 1 iteration for more than a few data points in the ME Algorithm.
  1. Write code in your preferred language which samples from a mixture distribution that is N(0,1) with weight 0.3 and N(4,2) with weight 0.7.
  2. Understanding Mixture Distributions:
    1. What is a mixture distribution, and how is it different from a simple distribution? Explain the concept of a mixture distribution using an example of combining two normal distributions.
  3. Plotting a Simple Mixture Distribution:
    1. Given two normal distributions \(N(2, 1)\) and \(N(6, 1)\) with equal weights in the mixture, plot the probability density function of the mixture. What do you observe about the shape of the mixture distribution?
  4. Comparing Mixtures:
    1. How does increasing the weight of one component in a mixture affect the overall mixture distribution? Consider the case of mixing two normal distributions with different means. Provide an example to illustrate your answer.
  5. Simulating a Mixture Distribution:
    1. Generate random samples from a mixture of two normal distributions with the following parameters: \(X_1 \sim N(1, 1)\) and \(X_2 \sim N(4, 1)\). The mixture proportion is 0.7 for \(X_1\) and 0.3 for \(X_2\). Simulate 1000 points and plot the histogram of the resulting mixture.
  6. Mixture of Exponential Distributions:
    1. Consider a mixture of two exponential distributions with rate parameters \(\lambda_1 = 1\) and \(\lambda_2 = 0.5\). The mixture proportions are 0.4 for \(X_1\) and 0.6 for \(X_2\). Write down the PDF of the mixture distribution.
    2. What is the expected value of the mixture distribution?
  7. Mean of a Mixture Distribution:
    1. Suppose you have two normal distributions \(X_1 \sim N(3, 1)\) and \(X_2 \sim N(7, 2)\), with mixture weights 0.3 and 0.7, respectively. Find the mean of the mixture distribution.
  8. Mixture of Categorical Distributions:
    1. Consider a mixture of two categorical distributions with the following probability mass functions: \[ P_1 = (0.1, 0.3, 0.6) \quad \text{and} \quad P_2 = (0.5, 0.3, 0.2) \] If the mixture weight for \(P_1\) is 0.4 and for \(P_2\) is 0.6, what is the probability mass function of the mixture distribution?
  9. Sampling from a Normal-Exponential Mixture Distribution

    Consider a mixture distribution with two components:

    • A normal distribution \( N(0, 1) \) with weight \( 0.6 \)
    • An exponential distribution with rate parameter \( \lambda = 2 \) and weight \( 0.4 \)

    1. Write the probability density function (PDF) of the mixture distribution.
    2. If you were to generate a sample from this mixture distribution, what is the probability that the sample is drawn from the normal distribution?
    3. Write a pseudocode algorithm to generate a sample from this mixture distribution.
  10. Mixture of Uniform Distributions

    You are given a mixture of two uniform distributions:

    • \( U(0, 1) \) with weight \( 0.3 \)
    • \( U(2, 4) \) with weight \( 0.7 \)
    1. Compute the probability that a randomly sampled value from the mixture is less than 1.
    2. Find the cumulative distribution function (CDF) of the mixture distribution.
    3. Write a Python function that samples from this mixture distribution.
  11. Basic Definition of Mixture Distributions:
    1. Suppose you have two component distributions: \(X_1 \sim N(\mu_1, \sigma_1^2)\) and \(X_2 \sim N(\mu_2, \sigma_2^2)\). The mixture distribution is given by: \[ f_X(x) = p f_{X_1}(x) + (1 - p) f_{X_2}(x) \] Derive the mean and variance of the mixture distribution in terms of the means and variances of \(X_1\) and \(X_2\), as well as the mixture weight \(p\).
    2. If \( \mu_1 = 1 \), \( \sigma_1 = 1 \), \( \mu_2 = 4 \), and \( \sigma_2 = 2 \), compute the mixture's mean and variance for \(p = 0.3\).
  12. Mixture of Three Normal Distributions

    Consider a mixture of three normal distributions with the following parameters:

    • \( N(-1, 0.5^2) \) with weight \( 0.2 \)
    • \( N(2, 1^2) \) with weight \( 0.5 \)
    • \( N(4, 0.25^2) \) with weight \( 0.3 \)
    1. Derive the expected value \( E[X] \) of the mixture distribution.
    2. Derive the variance \( \text{Var}(X) \) of the mixture distribution.
    3. Describe the steps to sample from this mixture distribution.
  13. Mixture of Exponential Distributions:
    1. Suppose a dataset is generated from a mixture of two exponential distributions with rate parameters \(\lambda_1\) and \(\lambda_2\). The mixture model is: \[ f_X(x) = p \lambda_1 e^{-\lambda_1 x} + (1 - p) \lambda_2 e^{-\lambda_2 x} \]
      1. Find the mean and variance of the mixture distribution in terms of \(\lambda_1\), \(\lambda_2\), and \(p\).
      2. Given the dataset: \[ \{0.2, 0.5, 0.8, 1.1, 2.5, 2.8, 3.2, 3.5\} \] Estimate the parameters \(\lambda_1\), \(\lambda_2\), and \(p\) using the method of moments.
  14. Mixture of Binomial Distributions:
    1. Consider a mixture of two binomial distributions, where: \[ X_1 \sim \text{Binomial}(n, p_1), \quad X_2 \sim \text{Binomial}(n, p_2) \] The mixture distribution is: \[ P(X = k) = p \binom{n}{k} p_1^k (1 - p_1)^{n-k} + (1 - p) \binom{n}{k} p_2^k (1 - p_2)^{n-k} \]
      1. Derive the mean and variance of the mixture distribution.
      2. If \(n = 10\), \(p_1 = 0.3\), \(p_2 = 0.7\), and \(p = 0.4\), calculate the probability mass function for \(X\).
  15. Identifying Mixture Distributions from Data:
    1. Given the following data: \[ \{0.5, 0.7, 1.0, 1.5, 4.2, 4.5, 5.0, 5.2\} \] Assume the data comes from a mixture of two normal distributions. Estimate the parameters (means, variances, and mixing proportion) for the component distributions using the method of moments.
  16. Mixture of Poisson Distributions:
    1. Suppose you have a mixture of two Poisson distributions: \[ X_1 \sim \text{Poisson}(\lambda_1), \quad X_2 \sim \text{Poisson}(\lambda_2) \] The mixture is defined as: \[ P(X = k) = p \frac{\lambda_1^k e^{-\lambda_1}}{k!} + (1 - p) \frac{\lambda_2^k e^{-\lambda_2}}{k!} \]
      1. Derive the mean and variance of the mixture distribution.
      2. If \(\lambda_1 = 3\), \(\lambda_2 = 7\), and \(p = 0.6\), compute the probability that \(X = 4\).