Distribution Textbook (Work in Progress)

by John Della Rosa

Maximum Likelihood Estimation (MLE)

Introduction to Maximum Likelihood Estimation

Recommended Prerequesites

  1. Probability
  2. Statistics
  3. Probability II

What is MLE?

Maximum Likelihood Estimation (MLE) is a widely-used method for estimating the parameters of a statistical model. It is based on the principle of finding parameter values that maximize the likelihood function, which measures how likely it is that a given set of parameters would produce the observed data. You can think of likelihood function as the product of the PDF/PMF for each data point for a given set of parameters. While typically we think of data values as being the axis, but now it is the parameters.

Likelihood Function

Given a statistical model with a probability density function (pdf) or probability mass function (pmf) \(f(x|\theta)\), where \(x\) represents the observed data and \(\theta\) denotes the parameters of the model. The likelihood function is defined as $$L(\theta|x)=f(x|\theta)$$ For a sample \(x_1,x_2,\dots,x_n\)drawn from this distribution, the likelihood function for the entire sample is: $$L(\theta|x_1,x_2,\dots,x_n)=\prod_{i=1}^{n}f(x_i|\theta)$$ The likelihood function measures how probable the observed data is for different values of \(\theta\). This is in contrast to the PDF/PMF which talks about probability of a given data point over its support. With MLE, your axis is the possible values of the parameter(s) for a given set of observations.

Likelihood Function Interactive Chart

100

Data Points:

None

AIC: N/A

BIC: N/A

Log-Likelihood Function

The likelihood function can be unwiedly due to its multiplicative nature, especially when dealing with large samples. To simplify calculations, we often use the log-likelihood function \(l(\theta|x)\), which is the natural logarithm of the likelihood function: $$\ell(\theta \mid x) = \log L(\theta \mid x) = \log \left( \prod_{i=1}^{n} f(x_i \mid \theta) \right) = \sum_{i=1}^{n} \log f(x_i \mid \theta)$$ Maximizing the log-likelihood function is equivalent to maximizing the likelihood function as the log is a strictly increasing function, but it simplifies the computations and is less prone to numerical issues.

Score

Let \(L(\theta|X)\) represent the likelihood function of the parameter vector \theta given the data \(X=(X_1,X_2,\dots,X_n)\). Then the score function is given by the gradient of the log-likelihood: $$S(\theta,X)=\nabla \log L(\theta|X)$$

Interpreting the Score

Expected Value

$$\mathbb{E}[S(\theta|X)]=0$$

Fisher Information

$$\mathcal{I}(\theta)=\mathbb{E}[(\frac{\partial}{\partial \theta}\log L(\theta|X))^2]$$

Estimating Parameters

To find the Maximum Likelihood Estimates (MLEs) of the parameters \(\theta\) we solve the optimization problem: $$\hat{\theta} = \arg\max_{\theta} \ell(\theta \mid x)$$ This involves taking the derivative of the log-likelihood function with respect to the parameters and setting it to zero: $$\frac{\partial \ell(\theta \mid x)}{\partial \theta} = 0$$

Properties of MLE

  1. Consistency: As the sample size n approaches infinity, the MLE converges to the true parameter values with probability 1.

Maximum Likelihood Estimate Practice Problems