Maximum Likelihood Estimation (MLE)

Introduction to Maximum Likelihood Estimation

Recommended Prerequesites

What is MLE?

Maximum Likelihood Estimation (MLE) is a widely-used method for estimating the parameters of a statistical model. It is based on the principle of finding parameter values that maximize the likelihood function, which measures how likely it is that a given set of parameters would produce the observed data. You can think of likelihood function as the product of the PDF/PMF for each data point for a given set of parameters. While typically we think of data values as being the axis, but now it is the parameters.

Likelihood Function

Given a statistical model with a probability density function (pdf) or probability mass function (pmf) $f(x|\theta)$, where $x$ represents the observed data and $\theta$ denotes the parameters of the model. The likelihood function is defined as $$L(\theta|x)=f(x|\theta)$$ For a sample $x_1,x_2,\dots,x_n$drawn from this distribution, the likelihood function for the entire sample is: $$L(\theta|x_1,x_2,\dots,x_n)=\prod_{i=1}^{n}f(x_i|\theta)$$ The likelihood function measures how probable the observed data is for different values of $\theta$. This is in contrast to the PDF/PMF which talks about probability of a given data point over its support. With MLE, your axis is the possible values of the parameter(s) for a given set of observations.

Log-Likelihood Function

The likelihood function can be unwiedly due to its multiplicative nature, especially when dealing with large samples. To simplify calculations, we often use the log-likelihood function $l(\theta|x)$, which is the natural logarithm of the likelihood function: $$\ell(\theta \mid x) = \log L(\theta \mid x) = \log \left( \prod_{i=1}^{n} f(x_i \mid \theta) \right) = \sum_{i=1}^{n} \log f(x_i \mid \theta)$$ Maximizing the log-likelihood function is equivalent to maximizing the likelihood function as the log is a strictly increasing function, but it simplifies the computations and is less prone to numerical issues.

Score

Let $L(\theta|X)$ represent the likelihood function of the parameter vector \theta given the data $X=(X_1,X_2,\dots,X_n)$. Then the score function is given by the gradient of the log-likelihood: $$S(\theta,X)=\nabla \log L(\theta|X)$$

Interpreting the Score

$S(\theta|X)=0$: If the score function is zero, this means the log-likelihood has reached a critical point, which could be a maximum, minimum, or saddle point.
$S(\theta|X)\gt 0$:If the score is positive, increasing $\theta$ raises the likelihood, suggesting that the parameter is underestimated
$S(\theta|X)\lt 0$:If the score is negative, decreasing $\theta$ raises the likelihood, suggesting that the parameter is overestimated

Expected Value

$$\mathbb{E}[S(\theta|X)]=0$$

Fisher Information

$$\mathcal{I}(\theta)=\mathbb{E}[(\frac{\partial}{\partial \theta}\log L(\theta|X))^2]$$

Akaike Information Criterion

$$\text{AIC}=-2\log(L)+2k$$ where L is the likelihood of the model given the data and k is the number of estimated parameters in the model. The second term penalizes having a more complex model unless the benefit that the extra parameter brings outweighs that cost.

Bayesian Information Criterion

$$\text{BIC}=-2\log(L)+k\log(n)$$ where L is the likelihood of the model given the data, k is the number of estimated parameters in the model, and n is the number of data points. This has a strong penalty as the value of the logarithm will be greater than 1. This makes it favor simpler models much more than AIC based selection.

Example

Let’s consider a simple case of fitting a regression model with different numbers of predictor variables. You fit two models:

Model 1: Includes 3 predictors, has $L_1 = 0.01$, and $n = 100$.
Model 2: Includes 5 predictors, has $L_2 = 0.02$, and $n = 100$.

AIC Calculation

For Model 1:

\[ \text{AIC}_1 = -2 \log(0.01) + 2(3) = 9.21 + 6 = 15.21 \]

For Model 2:

\[ \text{AIC}_2 = -2 \log(0.02) + 2(5) = 7.60 + 10 = 17.60 \]

Since Model 1 has the lower AIC, it is the preferred model.

BIC Calculation

For Model 1:

\[ \text{BIC}_1 = -2 \log(0.01) + 3 \log(100) = 9.21 + 13.82 = 23.03 \]

For Model 2:

\[ \text{BIC}_2 = -2 \log(0.02) + 5 \log(100) = 7.60 + 23.03 = 30.63 \]

Since Model 1 has the lower BIC, it is the preferred model.

Comparison of AIC and BIC

Feature	AIC	BIC
Penalty term	$ 2k $	$ k \log(n) $
Penalty strength	Lower penalty for more parameters	Stronger penalty, especially as $n$ increases
Model selection	Tends to favor more complex models	Tends to favor simpler models
Asymptotic behavior	Minimizes prediction error	Consistent in selecting the true model
Philosophy	Predictive accuracy (minimize information loss)	Bayesian model comparison (posterior likelihood)

Maximum Likelihood Estimation

To find the Maximum Likelihood Estimates (MLEs) of the parameters $\theta$ we solve the optimization problem: $$\hat{\theta} = \arg\max_{\theta} \ell(\theta \mid x)$$ This involves taking the derivative of the log-likelihood function with respect to the parameters and setting it to zero: $$\frac{\partial \ell(\theta \mid x)}{\partial \theta} = 0$$

Estimating Parameters

Example

Problem

We have a dataset with: $$x_1=1,x_2=2,x_3=1.5,x_4=3,x_5=2.5$$ Let's assume that we want to fit this data to an exponential distribution due to some prior knowledge.

Calculate Likelihood function

$$f(x|\lambda)=\lambda e^{-\lambda x}$$ The likelihood function is the product of the PDF evaluated at every datapoint for a given $\lambda$ $$L(\lambda)=\prod_{i=1}^{n}f(x_i|\lambda)=\prod_{i=1}^n \lambda e^{-\lambda x_i}$$ By the properties of exponentials, we can rewrite this as the following: $$L(\lambda)=\lambda^n e^{-\lambda \sum_{i=1}^n x_i}$$ Plugging in our datapoints: $$L(\lambda)=\lambda^5 e^{-\lambda(1+2+1.5+3+2.5)}=\lambda^5e^{-10\lambda}$$

Take Log-Likelihood

$$\log L(\lambda)=5\log(\lambda)-10\lambda$$

Differentiate and Set to 0

$$\frac{d}{d\lambda} \log L(\lambda)=\frac{5}{\lambda}-10=0$$ $$\lambda=0.5$$

Example 2

You are given the following data drawn from a shifted Exponential distribution:

$ x_1 = 2 $, $ x_2 = 3 $, $ x_3 = 2.5 $, $ x_4 = 4 $, $ x_5 = 3.5 $

Assume that the data comes from a two-parameter Exponential distribution with unknown rate $ \lambda $ and shift $ \theta $. Use Maximum Likelihood Estimation (MLE) to estimate both $ \lambda $ and $ \theta $.

Step 1: Write Down the Likelihood Function

The PDF of the two-parameter Exponential distribution is:

\[ f(x \mid \lambda, \theta) = \lambda e^{-\lambda (x - \theta)}, \quad x \geq \theta \]

The likelihood function for $ n $ independent observations $ x_1, x_2, \dots, x_n $ is the product of the individual densities:

\[ L(\lambda, \theta) = \prod_{i=1}^{n} \lambda e^{-\lambda (x_i - \theta)} = \lambda^n e^{-\lambda \sum_{i=1}^{n} (x_i - \theta)} \]

Step 2: Take the Log-Likelihood

To simplify the maximization, take the natural logarithm of the likelihood function. The log-likelihood is:

\[ \log L(\lambda, \theta) = n \log(\lambda) - \lambda \sum_{i=1}^{n} (x_i - \theta) \]

where $ n = 5 $ in this case.

Step 3: Differentiate with Respect to $ \lambda $ and Set to Zero

First, differentiate the log-likelihood with respect to $ \lambda $ and set it equal to zero:

\[ \frac{\partial}{\partial \lambda} \log L(\lambda, \theta) = \frac{n}{\lambda} - \sum_{i=1}^{n} (x_i - \theta) = 0 \]

Solving for $ \lambda $ gives:

\[ \lambda = \frac{n}{\sum_{i=1}^{n} (x_i - \theta)} \]

This is the MLE for $ \lambda $, given the shift parameter $ \theta $.

Step 4: Differentiate with Respect to $ \theta $ and Set to Zero

Next, differentiate the log-likelihood with respect to $ \theta $ and set it equal to zero:

\[ \frac{\partial}{\partial \theta} \log L(\lambda, \theta) = \lambda n - \lambda \sum_{i=1}^{n} \frac{1}{x_i - \theta} = 0 \]

However, solving for $ \theta $ here simplifies in many cases by assuming that $ \theta = \min(x_1, x_2, \dots, x_n) $, which is often the MLE for the shift parameter.

Thus, $ \hat{\theta} = \min(x_1, x_2, \dots, x_n) = 2 $.

Step 5: Final Estimates

With $ \hat{\theta} = 2 $, we can substitute this back into the equation for $ \lambda $ to find its MLE:

\[ \hat{\lambda} = \frac{5}{(2 + 3 + 2.5 + 4 + 3.5) - 5 \times 2} = \frac{5}{15 - 10} = \frac{5}{5} = 1 \]

Therefore, the Maximum Likelihood Estimates for the parameters are:

$ \hat{\lambda} = 1 $
$ \hat{\theta} = 2 $

Properties of MLE

Consistency: As the sample size n approaches infinity, the MLE converges to the true parameter values with probability 1.

Likelihood Function Interactive Chart

Distribution:

Number of Intervals: 100

Data Points:

None

X-Axis Min: X-Axis Max:

AIC: N/A

BIC: N/A

Table of Maximum Likelihood Estimates for Common Distributions

Distribution	Parameters	MLE Formula
Normal (Gaussian)	$\mu, \sigma$	$\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} X_i$ $\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (X_i - \hat{\mu})^2$ (Reference: Casella & Berger, Statistical Inference, 2002)
Exponential	$\lambda$	$\hat{\lambda} = \frac{n}{\sum_{i=1}^{n} X_i}$ (Reference: Lehmann & Casella, Theory of Point Estimation, 1998)
Poisson	$\lambda$	$\hat{\lambda} = \frac{1}{n} \sum_{i=1}^{n} X_i$ (Reference: Myung, Tutorial on Maximum Likelihood Estimation, 2003)
Uniform $(a, b)$	$a, b$	$\hat{a} = \min(X_1, X_2, \dots, X_n)$ $\hat{b} = \max(X_1, X_2, \dots, X_n)$ (Reference: Casella & Berger, Statistical Inference, 2002)
Binomial $(n, p)$	$n, p$	$\hat{p} = \frac{1}{n} \sum_{i=1}^{n} X_i$ (Reference: Lehmann & Casella, Theory of Point Estimation, 1998)
Gamma $(\alpha, \beta)$	$\alpha, \beta$	Numerical optimization required (Reference: Myung, Tutorial on Maximum Likelihood Estimation, 2003)
Beta $(\alpha, \beta)$	$\alpha, \beta$	Numerical optimization required (Reference: Lehmann & Casella, Theory of Point Estimation, 1998)
Negative Binomial $(r, p)$	$r, p$	Numerical optimization required (Reference: Casella & Berger, Statistical Inference, 2002)
Log-Normal $(\mu, \sigma)$	$\mu, \sigma$	$\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} \log(X_i)$ $\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (\log(X_i) - \hat{\mu})^2$ (Reference: Casella & Berger, Statistical Inference, 2002)
Weibull $(\lambda, k)$	$\lambda, k$	Numerical optimization required (Reference: Myung, Tutorial on Maximum Likelihood Estimation, 2003)
Chi-Squared $(k)$	$k$	$\hat{k} = \frac{2}{n} \sum_{i=1}^{n} X_i$ (Reference: Casella & Berger, Statistical Inference, 2002)
Cauchy $(x_0, \gamma)$	$x_0, \gamma$	Numerical optimization required (Reference: Myung, Tutorial on Maximum Likelihood Estimation, 2003)
Pareto $(x_m, \alpha)$	$x_m, \alpha$	$\hat{\alpha} = \frac{n}{\sum_{i=1}^{n} \log(X_i/x_m)}$ (Reference: Casella & Berger, Statistical Inference, 2002)

Maximum Likelihood Estimate Practice Problems

You fit two models to a dataset with 150 data points:
- Model A: Includes 4 parameters and has a likelihood $L_A = 0.03$.
- Model B: Includes 6 parameters and has a likelihood $L_B = 0.04$.
Calculate the AIC and BIC for both models. Which model is preferred by AIC? Which model is preferred by BIC?
Assume you have the following data drawn from a uniform distribution $ U(0, \theta) $, where $ \theta $ is unknown. The observed data is:
- Data: $ x_1 = 2, \, x_2 = 3, \, x_3 = 1.5, \, x_4 = 4, \, x_5 = 2.5 $
1. Write down the likelihood function for the unknown parameter $ \theta $.

2. Take the log-likelihood and find the value of $ \theta $ that maximizes it.

3. Solve for $ \theta $ to find the MLE estimate.
Given datapoints $(0.5,1.5,2,2.5,1)$ that come from an exponential distribution:
1. Write down the likelihood function
2. Solve for the MLE for $\lambda$
Suppose you have the following data points drawn from a normal distribution with unknown mean $ \mu $ and known variance $ \sigma^2 = 1 $:
- Data: $ x_1 = 2, \, x_2 = 3, \, x_3 = 2.5, \, x_4 = 4, \, x_5 = 3.5 $
1. Write down the likelihood function for the unknown parameter $ \mu $.

2. Take the log-likelihood and differentiate it with respect to $ \mu $.

3. Solve for $ \mu $ to find the MLE estimate.
Assume you have the following data from a series of Bernoulli trials, where $ x_i \in \{0,1\} $ represents success (1) or failure (0) in a trial. The probability of success is $ p $, and you want to estimate $ p $ using MLE:
- Data: $ x_1 = 1, \, x_2 = 0, \, x_3 = 1, \, x_4 = 1, \, x_5 = 0 $
1. Write down the likelihood function for the unknown parameter $ p $.
2. Take the log-likelihood and differentiate it with respect to $ p $.
3. Solve for $ p $ to find the MLE estimate.
Consider the following data drawn from a geometric distribution where the random variable represents the number of trials until the first success. The probability of success is $ p $, and you want to estimate $ p $ using MLE:
- Data: $ x_1 = 3, \, x_2 = 1, \, x_3 = 2, \, x_4 = 4, \, x_5 = 2 $
1. Write down the likelihood function for the unknown parameter $ p $.
2. Take the log-likelihood and differentiate it with respect to $ p $.
3. Solve for $ p $ to find the MLE estimate.

Distribution	Parameters	MLE Formula
Normal (Gaussian)	\(\mu, \sigma\)	\(\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} X_i\) \(\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (X_i - \hat{\mu})^2\) (Reference: Casella & Berger, Statistical Inference, 2002)
Exponential	\(\lambda\)	\(\hat{\lambda} = \frac{n}{\sum_{i=1}^{n} X_i}\) (Reference: Lehmann & Casella, Theory of Point Estimation, 1998)
Poisson	\(\lambda\)	\(\hat{\lambda} = \frac{1}{n} \sum_{i=1}^{n} X_i\) (Reference: Myung, Tutorial on Maximum Likelihood Estimation, 2003)
Uniform \((a, b)\)	\(a, b\)	\(\hat{a} = \min(X_1, X_2, \dots, X_n)\) \(\hat{b} = \max(X_1, X_2, \dots, X_n)\) (Reference: Casella & Berger, Statistical Inference, 2002)
Binomial \((n, p)\)	\(n, p\)	\(\hat{p} = \frac{1}{n} \sum_{i=1}^{n} X_i\) (Reference: Lehmann & Casella, Theory of Point Estimation, 1998)
Gamma \((\alpha, \beta)\)	\(\alpha, \beta\)	Numerical optimization required (Reference: Myung, Tutorial on Maximum Likelihood Estimation, 2003)
Beta \((\alpha, \beta)\)	\(\alpha, \beta\)	Numerical optimization required (Reference: Lehmann & Casella, Theory of Point Estimation, 1998)
Negative Binomial \((r, p)\)	\(r, p\)	Numerical optimization required (Reference: Casella & Berger, Statistical Inference, 2002)
Log-Normal \((\mu, \sigma)\)	\(\mu, \sigma\)	\(\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} \log(X_i)\) \(\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (\log(X_i) - \hat{\mu})^2\) (Reference: Casella & Berger, Statistical Inference, 2002)
Weibull \((\lambda, k)\)	\(\lambda, k\)	Numerical optimization required (Reference: Myung, Tutorial on Maximum Likelihood Estimation, 2003)
Chi-Squared \((k)\)	\(k\)	\(\hat{k} = \frac{2}{n} \sum_{i=1}^{n} X_i\) (Reference: Casella & Berger, Statistical Inference, 2002)
Cauchy \((x_0, \gamma)\)	\(x_0, \gamma\)	Numerical optimization required (Reference: Myung, Tutorial on Maximum Likelihood Estimation, 2003)
Pareto \((x_m, \alpha)\)	\(x_m, \alpha\)	\(\hat{\alpha} = \frac{n}{\sum_{i=1}^{n} \log(X_i/x_m)}\) (Reference: Casella & Berger, Statistical Inference, 2002)

Distribution Textbook (Work in Progress)

by John Della Rosa

Maximum Likelihood Estimation (MLE)

Introduction to Maximum Likelihood Estimation

Recommended Prerequesites

What is MLE?

Likelihood Function

Log-Likelihood Function

Score

Interpreting the Score

Expected Value

Fisher Information

Akaike Information Criterion

Bayesian Information Criterion

Example

AIC Calculation

BIC Calculation

Comparison of AIC and BIC

Maximum Likelihood Estimation

Estimating Parameters

Example

Problem

Calculate Likelihood function

Take Log-Likelihood

Differentiate and Set to 0

Example 2

Step 1: Write Down the Likelihood Function

Step 2: Take the Log-Likelihood

Step 3: Differentiate with Respect to \( \lambda \) and Set to Zero

Step 4: Differentiate with Respect to \( \theta \) and Set to Zero

Step 5: Final Estimates

Properties of MLE

Likelihood Function Interactive Chart

Data Points:

Table of Maximum Likelihood Estimates for Common Distributions

Maximum Likelihood Estimate Practice Problems

Feature	AIC	BIC
Penalty term	\( 2k \)	\( k \log(n) \)
Penalty strength	Lower penalty for more parameters	Stronger penalty, especially as \(n\) increases
Model selection	Tends to favor more complex models	Tends to favor simpler models
Asymptotic behavior	Minimizes prediction error	Consistent in selecting the true model
Philosophy	Predictive accuracy (minimize information loss)	Bayesian model comparison (posterior likelihood)