Distribution Textbook (Work in Progress)

by John Della Rosa

Bayesian Inference

Introduction to Bayesian Inference

Recommended Prerequesites

  1. Probability
  2. MLE
  3. Marginal Likelihood

Basics of Bayesian Inference

Bayes' theorem relates conditional probabilities when you switch what is being conditioned on. This ends up being useful in the context of trying to fit parameters.

Restating Bayes' Theorem

In the context of what we are studying, Bayes' theorem can be stated as: $$P(\theta|x)=\frac{P(x|\theta)P(\theta)}{P(x)}$$ Breaking down each component

Cromwell's Rule

Cromwell's Rule is principle that advises against assigning a prior probability of 0 or 1 unless it is logically impossible or certain. This is because it would make updating beliefs not possible.

Example

The classic example is flipping a coin for which you don't know the weight. The setup is this: we have a coin with an unknown probability of heads, \(\theta\).
Prior Distribution
What are our prior beliefs regarding \(\theta\)? Well, we can be reasonably sure that it will be in the interval [0,1]. In this example, we will assume that all valid values are equally weighted (uniform). $$P(\theta)=\begin{cases} 1 & \theta\in[0,1]\\ 0 &\text{otherwise}\end{cases}$$ This is an example of a non-informative prior, reflecting that we have no strong prior belief about \(\theta\).
Likelihood
Now we incorporate the data observed. We flip said coin n times and observe x heads. The likelihood \(P(x|\theta)\) is the probability of x heads, given that the probability of heads is \(\theta\). $$P(x|\theta)={n \choose x}\theta^x(1-\theta)^{n-x}$$
Posterior Distribution
$$P(\theta|x)=\frac{P(x|\theta)P(\theta)}{P(x)}$$ Given that \(P(x)\) is just a normalizing constant and we selected our prior distribution to be uniform, our unnormalized posterior is proportion to: $$P(\theta|x)\propto\theta^x(1-\theta)^{n-x}$$ which is a beta distribution. $$f(x;\alpha,\beta)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}$$ Through matching exponents, we can see that $$\alpha=x+1$$ $$\beta=n-x+1$$ What is the "average" estimate of \(\theta\) now? $$\mathbb{E}[\theta|X]=\frac{x+1}{n+2}$$ Now, one thing to note is that this is not the percentage of heads that we saw. The maximum likelihood estimate would correspond to that fraction, \(\hat{\theta}=\frac{x}{n}\).

Posterior Distribution

In Bayesian inference, after updating our beliefs about model parameters based on observed data, we often want to make predictions about future or unseen data. The posterior predictive distribution provides a way to do this by incorporating the uncertainty in the parameters, as reflected in the posterior distribution. While the posterior distribution provides a complete description of our updated belief about \(\theta\), in some cases we need a point estimate for the parameter. There are a few reasonable choices for how to go about it.

Maximum A Posteriori (MAP) Estimate

One is the Maximum A Posteriori (MAP) estimate, which selects the value of \(\theta\) that maximizes the posterior distribution.

Definition

The MAP estimate is defined as: $$\hat{\theta}_{\text{MAP}}=\arg \max_{\theta}P(\theta|x)$$ By Bayes' theorem, we can rewrite as: $$P(\theta|x)\propto P(x|\theta)P(\theta)$$ Thus, the MAP estimate can also be seen as maximizing the product of the likelihood with the prior. The MAP estimate balances fitting the data through likelihood and our prior beliefs through \(P(\theta)\).

MLE Revisited

The MLE is actually a special case of MAP estimation when the prior distribution \(P(\theta)\) is uniform. $$\hat{\theta}_{\text{MLE}}=\arg \max_\theta P(x|\theta)$$ Because multiplying every point by a positive constant doesn't affect the ordering, by plugging in a constant into our MAP equation, you can recover the MLE one above.

Summary of Point Estimates

Method Estimate Formula Best for Limitations
MAP Estimate Maximizes the posterior \[ \hat{\theta}_{\text{MAP}} = \arg \max_{\theta} P(\theta | x) \] Computational simplicity, regularization Dependent on prior, doesn't take into account all of distribution
Posterior Mean Expected value of posterior \[ \hat{\theta}_{\text{mean}} = E[\theta | x] = \int \theta P(\theta | x) d\theta \] Minimizing squared loss, average effects Sensitive to skewness
Posterior Median Value that splits posterior in half \[ P(\theta \leq \hat{\theta}_{\text{median}} | x) = 0.5 \] Skewed distributions, minimizing absolute loss Harder to compute

Posterior Predictive Distribution

Formula

The posterior predictive distribution is the distribution of a new, unobserved data point \(x_{\text{new}}\), given the observed data \(x\). It is computed by averaging the likelihood of \(x_{\text{new}}\), weighted by the posterior distribution of the parameter(s) \(\theta\): $$P(x_{\text{new}}|x)=\int_{-\infty}^{\infty}P(x_{\text{new}}|\theta)P(\theta|x)d\theta$$

Conjugate Priors

In Bayesian inference, a conjugate prior is a prior distribution that, when combined with a specific likelihood function, results in a posterior distribution that belongs to the same family of distributions as the prior. This conjugacy simplifies the calculation of the posterior distribution and makes Bayesian analysis more tractable, especially when working with common probability distributions.

Definition

Mathematically, a prior \(P(\theta)\) is said to be conjugate to the likelihood \(L(x|\theta)\) if the posterior distribution, \(P(\theta|x)\) is in the same family as the prior. Only the parameters of the distribution are updated based on the observed data.

Beta Prior for Binomial

See also Beta-Binomial simulation

Instead of the uniform prior that we used in a prior coin flip example, we will have a beta prior:

\[ P(\theta) = \frac{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}{B(\alpha, \beta)} \]

Binomial Likelihood

The likelihood for binomial data (observing \( x \) successes in \( n \) trials) is:

\[ P(x | \theta) = \binom{n}{x} \theta^x (1 - \theta)^{n - x} \]

Posterior Distribution

The posterior distribution, given a Beta prior and binomial likelihood, is:

\[ P(\theta | x) = \text{Beta}(x + \alpha, n - x + \beta) \]

Numerical Example

If we observe 7 heads in 10 flips and use a \( \text{Beta}(2, 2) \) prior, the posterior is:

\[ P(\theta | x = 7) = \text{Beta}(9, 5) \]

Normal-Normal Conjugacy

If the prior distribution for the mean \( \mu \) of a normal distribution is:

\[ P(\mu) = N(\mu_0, \tau_0^2) \]

Normal Likelihood

The likelihood for normally distributed data \( x \) with known variance \( \sigma^2 \) is:

\[ P(x | \mu) = N(\mu, \sigma^2) \]

Posterior Distribution

The posterior distribution, given \( n \) observations and a normal prior, is:

\[ P(\mu | x) = N\left( \frac{\tau_0^2 \sum_{i=1}^{n} x_i + \sigma^2 \mu_0}{n \sigma^2 + \tau_0^2}, \frac{\sigma^2 \tau_0^2}{n \sigma^2 + \tau_0^2} \right) \]

Numerical Example

Given a prior \( N(170, 5^2) \) for the average height of a population and 5 observations \( 168, 172, 169, 170, 171 \), the posterior distribution can be computed to update our estimate of the population mean.

Bayesian Inference Practice Problems