Marginal Likelihood

Introduction to

Recommended Prerequesites

Introduction

In the previous chapters, we explored compound distributions. In a compound distribution, one distribution controls the parameters of another, leading to a multi-layered structure. As we extend this understanding, marginal likelihood provides a framework for evaluating how well these models (and others) explain observed data, accounting for all possible values of the parameters. By integrating over the unknown parameters, marginal likelihood gives us a comprehensive measure of the model's fit to the data, offering insights that can guide model selection and decision-making.

Recap on Marginalization in Context of Previous Chapters

Mixture Distributions

In a mixture distribution, the observed data is generated from one of several component distributions. The components are chosen according to a set of probabilities (weights). For example, in a Gaussian Mixture Model (GMM) with K components: $$X\sim \sum_{k=1}^{K}\pi_{k}\mathcal{N}(X|\mu_k,\sigma_k^2)$$ Here, each component, $\mathcal{N}(X|\mu_k,\sigma_k^2)$, is a Gaussian distribution, and $\pi_k$ are the mixture weights (summing to 1). The marginal distribution of X is the weighted sum of the component densities, integrating over the latent variable (which component generated the data).

Compound Distributions

For example, in a Gamma-Poisson compound distribution, the rate parameter $\lambda$ of a Poisson distribution is drawn from a Gamma distribution: $$\lambda\sim \text{Gamma}(\alpha,\beta),\quad X|\lambda\sim\text{Poisson}(\lambda)$$ The marginal distribution of $X$ is found by integrating out the intermediate parameter $\lambda$: $$p(X=x)=\int_{0}^{\infty}p(X=x|\lambda)p(\lambda)d\lambda$$ The integration removes the dependence on the unknown/random parameter $\lambda$, yielding the marginal distribution of X.

Bridging to Marginal Likelihood

In both compound and mixture distributions, marginalization plays a crucial role: we integrate over hidden or latent variables (such as the parameters in compound distributions or the component assignments in mixture models) to find the overall probability of the observed data. This exact idea underpins marginal likelihood. The marginal likelihood of a model is the probability of the observed data D given the model, integrating out the unknown parameters $\theta$: $$p(D|M)=\int_{\theta}p(D|\theta,M)p(\theta,M)d\theta$$ where:

$p(D|\theta,M)$ is the likelihood of the data given parameters $\theta$
$p(\theta|M)$ is the prior distribution over the parameters
M is the model

This integral is precisely what we encountered in compound distributions: marginalizing over $\theta$, an nknown parameter, which itself follows a prior distribution. In essence, marginal likelihood is an application of the same principle: we integrate out the uncertainty in the parameters to assess how well the model explains the data on average across all parameter configurations.

Bayes Factor

The ratio of marginal likelihood of two models (Bayes factor) provides a quantitative measure for comparing: $$\text{BF}=\frac{P(X|M_1)}{P(X|M_2)}$$