Distribution Textbook (Work in Progress)

by John Della Rosa

Marginal Likelihood

Introduction to

Recommended Prerequesites

  1. Probability

Introduction

In the previous chapters, we explored compound distributions. In a compound distribution, one distribution controls the parameters of another, leading to a multi-layered structure. As we extend this understanding, marginal likelihood provides a framework for evaluating how well these models (and others) explain observed data, accounting for all possible values of the parameters. By integrating over the unknown parameters, marginal likelihood gives us a comprehensive measure of the model's fit to the data, offering insights that can guide model selection and decision-making.

Recap on Marginalization in Context of Previous Chapters

Mixture Distributions

In a mixture distribution, the observed data is generated from one of several component distributions. The components are chosen according to a set of probabilities (weights). For example, in a Gaussian Mixture Model (GMM) with K components: $$X\sim \sum_{k=1}^{K}\pi_{k}\mathcal{N}(X|\mu_k,\sigma_k^2)$$ Here, each component, \(\mathcal{N}(X|\mu_k,\sigma_k^2)\), is a Gaussian distribution, and \(\pi_k\) are the mixture weights (summing to 1). The marginal distribution of X is the weighted sum of the component densities, integrating over the latent variable (which component generated the data).

Compound Distributions

For example, in a Gamma-Poisson compound distribution, the rate parameter \(\lambda\) of a Poisson distribution is drawn from a Gamma distribution: $$\lambda\sim \text{Gamma}(\alpha,\beta),\quad X|\lambda\sim\text{Poisson}(\lambda)$$ The marginal distribution of \(X\) is found by integrating out the intermediate parameter \(\lambda\): $$p(X=x)=\int_{0}^{\infty}p(X=x|\lambda)p(\lambda)d\lambda$$ The integration removes the dependence on the unknown/random parameter \(\lambda\), yielding the marginal distribution of X.

Bridging to Marginal Likelihood

In both compound and mixture distributions, marginalization plays a crucial role: we integrate over hidden or latent variables (such as the parameters in compound distributions or the component assignments in mixture models) to find the overall probability of the observed data. This exact idea underpins marginal likelihood. The marginal likelihood of a model is the probability of the observed data D given the model, integrating out the unknown parameters \(\theta\): $$p(D|M)=\int_{\theta}p(D|\theta,M)p(\theta,M)d\theta$$ where: This integral is precisely what we encountered in compound distributions: marginalizing over \(\theta\), an nknown parameter, which itself follows a prior distribution. In essence, marginal likelihood is an application of the same principle: we integrate out the uncertainty in the parameters to assess how well the model explains the data on average across all parameter configurations.

Regularization

Basic

In frequentist statistics, regularization is a technique used to prevent models from overfitting the data by adding a penalty for complexity. This penalty is applied in various forms, such as: The ridge objective function to be minimized for OLS is given by: $$\sum_{i}^{n}(y_i-x_i^T\beta)^2+\lambda\sum_{j=1}^{p}\beta_j^2$$ where \(\beta\) is the vector of model coefficients and \(\lambda\) is the regularization parameter controlling the strength of the penalty.

Bayesian

From a Bayesian point of view, regularization arises naturally through the prior distribution on the parameters. Regularization can be interpreted as placing a prior belief on the parameters, which encodes our preference for simpler models or smaller parameter values.

Ridge

In Ridge regression, the quadratic penalty discourages large coefficients by assigning an increasingly steep penalty. In Bayesian terms, this is equivalent to placing a Gaussian prior on the regression coefficients \(\beta\), with mean zero and variance \(\sigma^2\). $$p(\beta_j|\sigma^2)\sim \mathcal{N}(0,\sigma^2)$$ The regularization strength, \(\lambda\), corresponds to this variance as follows: $$\lambda=\frac{1}{\sigma^2}$$

LASSO

Similarly, LASSO adds a penalty proportional to the absolute value of the coefficients, corresponding to a Laplace prior on the parameters: $$p(\beta_j|\lambda)\sim\frac{1}{2\lambda}\exp\left(-\frac{|\beta_j|}{\lambda}\right)$$

Marginal Likelihood Practice Problems