Bayesian Inference

Introduction to Bayesian Inference

Recommended Prerequesites

Basics of Bayesian Inference

Bayes' theorem relates conditional probabilities when you switch what is being conditioned on. This ends up being useful in the context of trying to fit parameters.

Restating Bayes' Theorem

In the context of what we are studying, Bayes' theorem can be stated as: $$P(\theta|x)=\frac{P(x|\theta)P(\theta)}{P(x)}$$ Breaking down each component

$\theta$ is the parameter (or vector of parameters) that we are interested in, as we've seen before
x is the observed data
$P(x|\theta)$ is the likelihood, which is the same as we've been seeing before. It gives a relative probability of seeing the data for a distribution with given parameter values
$P(\theta|x)$ is the posterior distribuition, the updated belief about the parameter $\theta$ after seeing the data x
$P(\theta)$ is the prior distribution, which is the initial belief about $\theta$, before seeing the data.
$P(x)$ is the marginal likelihood, which we covered before. This acts as a normalization constant. This is computed as $$P(x)=\int_{-\infty}^{\infty}P(x|\theta)P(\theta)d\theta$$

Cromwell's Rule

Cromwell's Rule is principle that advises against assigning a prior probability of 0 or 1 unless it is logically impossible or certain. This is because it would make updating beliefs not possible.

Example

The classic example is flipping a coin for which you don't know the weight. The setup is this: we have a coin with an unknown probability of heads, $\theta$.

Prior Distribution

What are our prior beliefs regarding $\theta$? Well, we can be reasonably sure that it will be in the interval [0,1]. In this example, we will assume that all valid values are equally weighted (uniform). $$P(\theta)=\begin{cases} 1 & \theta\in[0,1]\\ 0 &\text{otherwise}\end{cases}$$ This is an example of a non-informative prior, reflecting that we have no strong prior belief about $\theta$.

Likelihood

Now we incorporate the data observed. We flip said coin n times and observe x heads. The likelihood $P(x|\theta)$ is the probability of x heads, given that the probability of heads is $\theta$. $$P(x|\theta)={n \choose x}\theta^x(1-\theta)^{n-x}$$

Posterior Distribution

$$P(\theta|x)=\frac{P(x|\theta)P(\theta)}{P(x)}$$ Given that $P(x)$ is just a normalizing constant and we selected our prior distribution to be uniform, our unnormalized posterior is proportion to: $$P(\theta|x)\propto\theta^x(1-\theta)^{n-x}$$ which is a beta distribution. $$f(x;\alpha,\beta)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}$$ Through matching exponents, we can see that $$\alpha=x+1$$ $$\beta=n-x+1$$

If you observe a lot of heads, the posterior will shift towards higher values of $\theta$.
If you observe a lot of tails, the posterior will shift towards lower values of $\theta$.
The more data you collect, the sharper the posterior distribution becomes, reflecting increasing confidence in the estimated value of $\theta$.

What is the "average" estimate of $\theta$ now? $$\mathbb{E}[\theta|X]=\frac{x+1}{n+2}$$ Now, one thing to note is that this is not the percentage of heads that we saw. The maximum likelihood estimate would correspond to that fraction, $\hat{\theta}=\frac{x}{n}$.

Posterior Distribution

In Bayesian inference, after updating our beliefs about model parameters based on observed data, we often want to make predictions about future or unseen data. The posterior predictive distribution provides a way to do this by incorporating the uncertainty in the parameters, as reflected in the posterior distribution. While the posterior distribution provides a complete description of our updated belief about $\theta$, in some cases we need a point estimate for the parameter. There are a few reasonable choices for how to go about it.

Maximum A Posteriori (MAP) Estimate

One is the Maximum A Posteriori (MAP) estimate, which selects the value of $\theta$ that maximizes the posterior distribution.

Definition

The MAP estimate is defined as: $$\hat{\theta}_{\text{MAP}}=\arg \max_{\theta}P(\theta|x)$$ By Bayes' theorem, we can rewrite as: $$P(\theta|x)\propto P(x|\theta)P(\theta)$$ Thus, the MAP estimate can also be seen as maximizing the product of the likelihood with the prior. The MAP estimate balances fitting the data through likelihood and our prior beliefs through $P(\theta)$.

MLE Revisited

The MLE is actually a special case of MAP estimation when the prior distribution $P(\theta)$ is uniform. $$\hat{\theta}_{\text{MLE}}=\arg \max_\theta P(x|\theta)$$ Because multiplying every point by a positive constant doesn't affect the ordering, by plugging in a constant into our MAP equation, you can recover the MLE one above.

Summary of Point Estimates

Method	Estimate	Formula	Best for	Limitations
MAP Estimate	Maximizes the posterior	\[ \hat{\theta}_{\text{MAP}} = \arg \max_{\theta} P(\theta \| x) \]	Computational simplicity, regularization	Dependent on prior, doesn't take into account all of distribution
Posterior Mean	Expected value of posterior	\[ \hat{\theta}_{\text{mean}} = E[\theta \| x] = \int \theta P(\theta \| x) d\theta \]	Minimizing squared loss, average effects	Sensitive to skewness
Posterior Median	Value that splits posterior in half	\[ P(\theta \leq \hat{\theta}_{\text{median}} \| x) = 0.5 \]	Skewed distributions, minimizing absolute loss	Harder to compute

Posterior Predictive Distribution

Formula

The posterior predictive distribution is the distribution of a new, unobserved data point $x_{\text{new}}$, given the observed data $x$. It is computed by averaging the likelihood of $x_{\text{new}}$, weighted by the posterior distribution of the parameter(s) $\theta$: $$P(x_{\text{new}}|x)=\int_{-\infty}^{\infty}P(x_{\text{new}}|\theta)P(\theta|x)d\theta$$

$P(x_{\text{new}}|\theta)$: The likelihood function that describes how new data $x_{\text{new}}$ is generated, given a particular value of parameter $\theta$.
$P(\theta|x)$: The posterior distribution of parameter $\theta$, which reflects our updated belief about $\theta$ after observing the data x.

Conjugate Priors

In Bayesian inference, a conjugate prior is a prior distribution that, when combined with a specific likelihood function, results in a posterior distribution that belongs to the same family of distributions as the prior. This conjugacy simplifies the calculation of the posterior distribution and makes Bayesian analysis more tractable, especially when working with common probability distributions.

Definition

Mathematically, a prior $P(\theta)$ is said to be conjugate to the likelihood $L(x|\theta)$ if the posterior distribution, $P(\theta|x)$ is in the same family as the prior. Only the parameters of the distribution are updated based on the observed data.

Beta Prior for Binomial

Binomial Likelihood

The likelihood for binomial data (observing $ x $ successes in $ n $ trials) is:

\[ P(x | \theta) = \binom{n}{x} \theta^x (1 - \theta)^{n - x} \]

Posterior Distribution

The posterior distribution, given a Beta prior and binomial likelihood, is:

\[ P(\theta | x) = \text{Beta}(x + \alpha, n - x + \beta) \]

Numerical Example

If we observe 7 heads in 10 flips and use a $ \text{Beta}(2, 2) $ prior, the posterior is:

\[ P(\theta | x = 7) = \text{Beta}(9, 5) \]

Normal-Normal Conjugacy

If the prior distribution for the mean $ \mu $ of a normal distribution is:

\[ P(\mu) = N(\mu_0, \tau_0^2) \]

Normal Likelihood

The likelihood for normally distributed data $ x $ with known variance $ \sigma^2 $ is:

\[ P(x | \mu) = N(\mu, \sigma^2) \]

Posterior Distribution

The posterior distribution, given $ n $ observations and a normal prior, is:

\[ P(\mu | x) = N\left( \frac{\tau_0^2 \sum_{i=1}^{n} x_i + \sigma^2 \mu_0}{n \sigma^2 + \tau_0^2}, \frac{\sigma^2 \tau_0^2}{n \sigma^2 + \tau_0^2} \right) \]

Numerical Example

Given a prior $ N(170, 5^2) $ for the average height of a population and 5 observations $ 168, 172, 169, 170, 171 $, the posterior distribution can be computed to update our estimate of the population mean.

Distribution Textbook (Work in Progress)

by John Della Rosa

Bayesian Inference

Introduction to Bayesian Inference

Recommended Prerequesites

Basics of Bayesian Inference

Restating Bayes' Theorem

Cromwell's Rule

Example

Prior Distribution

Likelihood

Posterior Distribution

Posterior Distribution

Maximum A Posteriori (MAP) Estimate

Definition

MLE Revisited

Summary of Point Estimates

Posterior Predictive Distribution

Formula

Conjugate Priors

Definition

Beta Prior for Binomial

Binomial Likelihood

Posterior Distribution

Numerical Example

Normal-Normal Conjugacy

Normal Likelihood

Posterior Distribution

Numerical Example

Bayesian Inference Practice Problems