Bayesian Inference
Introduction to Bayesian Inference
Recommended Prerequesites
- Probability
- MLE
- Marginal Likelihood
Basics of Bayesian Inference
Bayes' theorem relates conditional probabilities when you switch what is being conditioned on. This ends up being useful in the context of trying to fit parameters.
Restating Bayes' Theorem
In the context of what we are studying, Bayes' theorem can be stated as:
$$P(\theta|x)=\frac{P(x|\theta)P(\theta)}{P(x)}$$
Breaking down each component
- \(\theta\) is the parameter (or vector of parameters) that we are interested in, as we've seen before
- x is the observed data
- \(P(x|\theta)\) is the likelihood, which is the same as we've been seeing before. It gives a relative probability of seeing the data for a distribution with given parameter values
- \(P(\theta|x)\) is the posterior distribuition, the updated belief about the parameter \(\theta\) after seeing the data x
- \(P(\theta)\) is the prior distribution, which is the initial belief about \(\theta\), before seeing the data.
- \(P(x)\) is the marginal likelihood, which we covered before. This acts as a normalization constant. This is computed as
$$P(x)=\int_{-\infty}^{\infty}P(x|\theta)P(\theta)d\theta$$
Cromwell's Rule
Cromwell's Rule is principle that advises against assigning a prior probability of 0 or 1 unless it is logically impossible or certain. This is because it would make updating beliefs not possible.
Example
The classic example is flipping a coin for which you don't know the weight.
The setup is this: we have a coin with an unknown probability of heads, \(\theta\).
Prior Distribution
What are our prior beliefs regarding \(\theta\)? Well, we can be reasonably sure that it will be in the interval [0,1]. In this example, we will assume that all valid values are equally weighted (uniform).
$$P(\theta)=\begin{cases} 1 & \theta\in[0,1]\\
0 &\text{otherwise}\end{cases}$$
This is an example of a non-informative prior, reflecting that we have no strong prior belief about \(\theta\).
Likelihood
Now we incorporate the data observed. We flip said coin n times and observe x heads. The likelihood \(P(x|\theta)\) is the probability of x heads, given that the probability of heads is \(\theta\).
$$P(x|\theta)={n \choose x}\theta^x(1-\theta)^{n-x}$$
Posterior Distribution
$$P(\theta|x)=\frac{P(x|\theta)P(\theta)}{P(x)}$$
Given that \(P(x)\) is just a normalizing constant and we selected our prior distribution to be uniform, our unnormalized posterior is proportion to:
$$P(\theta|x)\propto\theta^x(1-\theta)^{n-x}$$
which is a beta distribution.
$$f(x;\alpha,\beta)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}$$
Through matching exponents, we can see that
$$\alpha=x+1$$
$$\beta=n-x+1$$
- If you observe a lot of heads, the posterior will shift towards higher values of \(\theta\).
- If you observe a lot of tails, the posterior will shift towards lower values of \(\theta\).
- The more data you collect, the sharper the posterior distribution becomes, reflecting increasing confidence in the estimated value of \(\theta\).
What is the "average" estimate of \(\theta\) now?
$$\mathbb{E}[\theta|X]=\frac{x+1}{n+2}$$
Now, one thing to note is that this is
not the percentage of heads that we saw. The maximum likelihood estimate would correspond to that fraction, \(\hat{\theta}=\frac{x}{n}\).
Posterior Distribution
In Bayesian inference, after updating our beliefs about model parameters based on observed data, we often want to make predictions about future or unseen data. The posterior predictive distribution provides a way to do this by incorporating the uncertainty in the parameters, as reflected in the posterior distribution.
While the posterior distribution provides a complete description of our updated belief about \(\theta\), in some cases we need a point estimate for the parameter. There are a few reasonable choices for how to go about it.
Maximum A Posteriori (MAP) Estimate
One is the Maximum A Posteriori (MAP) estimate, which selects the value of \(\theta\) that maximizes the posterior distribution.
Definition
The MAP estimate is defined as:
$$\hat{\theta}_{\text{MAP}}=\arg \max_{\theta}P(\theta|x)$$
By Bayes' theorem, we can rewrite as:
$$P(\theta|x)\propto P(x|\theta)P(\theta)$$
Thus, the MAP estimate can also be seen as maximizing the product of the likelihood with the prior. The MAP estimate balances fitting the data through likelihood and our prior beliefs through \(P(\theta)\).
MLE Revisited
The MLE is actually a special case of MAP estimation when the prior distribution \(P(\theta)\) is uniform.
$$\hat{\theta}_{\text{MLE}}=\arg \max_\theta P(x|\theta)$$
Because multiplying every point by a positive constant doesn't affect the ordering, by plugging in a constant into our MAP equation, you can recover the MLE one above.
Summary of Point Estimates
Method |
Estimate |
Formula |
Best for |
Limitations |
MAP Estimate |
Maximizes the posterior |
\[
\hat{\theta}_{\text{MAP}} = \arg \max_{\theta} P(\theta | x)
\] |
Computational simplicity, regularization |
Dependent on prior, doesn't take into account all of distribution |
Posterior Mean |
Expected value of posterior |
\[
\hat{\theta}_{\text{mean}} = E[\theta | x] = \int \theta P(\theta | x) d\theta
\] |
Minimizing squared loss, average effects |
Sensitive to skewness |
Posterior Median |
Value that splits posterior in half |
\[
P(\theta \leq \hat{\theta}_{\text{median}} | x) = 0.5
\] |
Skewed distributions, minimizing absolute loss |
Harder to compute |
Posterior Predictive Distribution
Formula
The posterior predictive distribution is the distribution of a new, unobserved data point \(x_{\text{new}}\), given the observed data \(x\). It is computed by averaging the likelihood of \(x_{\text{new}}\), weighted by the posterior distribution of the parameter(s) \(\theta\):
$$P(x_{\text{new}}|x)=\int_{-\infty}^{\infty}P(x_{\text{new}}|\theta)P(\theta|x)d\theta$$
- \(P(x_{\text{new}}|\theta)\): The likelihood function that describes how new data \(x_{\text{new}}\) is generated, given a particular value of parameter \(\theta\).
- \(P(\theta|x)\): The posterior distribution of parameter \(\theta\), which reflects our updated belief about \(\theta\) after observing the data x.
Conjugate Priors
In Bayesian inference, a conjugate prior is a prior distribution that, when combined with a specific likelihood function, results in a posterior distribution that belongs to the same family of distributions as the prior. This conjugacy simplifies the calculation of the posterior distribution and makes Bayesian analysis more tractable, especially when working with common probability distributions.
Definition
Mathematically, a prior \(P(\theta)\) is said to be conjugate to the likelihood \(L(x|\theta)\) if the posterior distribution, \(P(\theta|x)\) is in the same family as the prior. Only the parameters of the distribution are updated based on the observed data.
Beta Prior for Binomial
See also Beta-Binomial simulation
Instead of the uniform prior that we used in a prior coin flip example, we will have a beta prior:
\[
P(\theta) = \frac{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}{B(\alpha, \beta)}
\]
Binomial Likelihood
The likelihood for binomial data (observing \( x \) successes in \( n \) trials) is:
\[
P(x | \theta) = \binom{n}{x} \theta^x (1 - \theta)^{n - x}
\]
Posterior Distribution
The posterior distribution, given a Beta prior and binomial likelihood, is:
\[
P(\theta | x) = \text{Beta}(x + \alpha, n - x + \beta)
\]
Numerical Example
If we observe 7 heads in 10 flips and use a \( \text{Beta}(2, 2) \) prior, the posterior is:
\[
P(\theta | x = 7) = \text{Beta}(9, 5)
\]
Normal-Normal Conjugacy
If the prior distribution for the mean \( \mu \) of a normal distribution is:
\[
P(\mu) = N(\mu_0, \tau_0^2)
\]
Normal Likelihood
The likelihood for normally distributed data \( x \) with known variance \( \sigma^2 \) is:
\[
P(x | \mu) = N(\mu, \sigma^2)
\]
Posterior Distribution
The posterior distribution, given \( n \) observations and a normal prior, is:
\[
P(\mu | x) = N\left( \frac{\tau_0^2 \sum_{i=1}^{n} x_i + \sigma^2 \mu_0}{n \sigma^2 + \tau_0^2}, \frac{\sigma^2 \tau_0^2}{n \sigma^2 + \tau_0^2} \right)
\]
Numerical Example
Given a prior \( N(170, 5^2) \) for the average height of a population and 5 observations \( 168, 172, 169, 170, 171 \), the posterior distribution can be computed to update our estimate of the population mean.