Let’s consider a simple case of fitting a regression model with different numbers of predictor variables. You fit two models:
For Model 1:
\[ \text{AIC}_1 = -2 \log(0.01) + 2(3) = 9.21 + 6 = 15.21 \]For Model 2:
\[ \text{AIC}_2 = -2 \log(0.02) + 2(5) = 7.60 + 10 = 17.60 \]Since Model 1 has the lower AIC, it is the preferred model.
For Model 1:
\[ \text{BIC}_1 = -2 \log(0.01) + 3 \log(100) = 9.21 + 13.82 = 23.03 \]For Model 2:
\[ \text{BIC}_2 = -2 \log(0.02) + 5 \log(100) = 7.60 + 23.03 = 30.63 \]Since Model 1 has the lower BIC, it is the preferred model.
Feature | AIC | BIC |
---|---|---|
Penalty term | \( 2k \) | \( k \log(n) \) |
Penalty strength | Lower penalty for more parameters | Stronger penalty, especially as \(n\) increases |
Model selection | Tends to favor more complex models | Tends to favor simpler models |
Asymptotic behavior | Minimizes prediction error | Consistent in selecting the true model |
Philosophy | Predictive accuracy (minimize information loss) | Bayesian model comparison (posterior likelihood) |
You are given the following data drawn from a shifted Exponential distribution:
Assume that the data comes from a two-parameter Exponential distribution with unknown rate \( \lambda \) and shift \( \theta \). Use Maximum Likelihood Estimation (MLE) to estimate both \( \lambda \) and \( \theta \).
The PDF of the two-parameter Exponential distribution is:
\[ f(x \mid \lambda, \theta) = \lambda e^{-\lambda (x - \theta)}, \quad x \geq \theta \]The likelihood function for \( n \) independent observations \( x_1, x_2, \dots, x_n \) is the product of the individual densities:
\[ L(\lambda, \theta) = \prod_{i=1}^{n} \lambda e^{-\lambda (x_i - \theta)} = \lambda^n e^{-\lambda \sum_{i=1}^{n} (x_i - \theta)} \]To simplify the maximization, take the natural logarithm of the likelihood function. The log-likelihood is:
\[ \log L(\lambda, \theta) = n \log(\lambda) - \lambda \sum_{i=1}^{n} (x_i - \theta) \]where \( n = 5 \) in this case.
First, differentiate the log-likelihood with respect to \( \lambda \) and set it equal to zero:
\[ \frac{\partial}{\partial \lambda} \log L(\lambda, \theta) = \frac{n}{\lambda} - \sum_{i=1}^{n} (x_i - \theta) = 0 \]Solving for \( \lambda \) gives:
\[ \lambda = \frac{n}{\sum_{i=1}^{n} (x_i - \theta)} \]This is the MLE for \( \lambda \), given the shift parameter \( \theta \).
Next, differentiate the log-likelihood with respect to \( \theta \) and set it equal to zero:
\[ \frac{\partial}{\partial \theta} \log L(\lambda, \theta) = \lambda n - \lambda \sum_{i=1}^{n} \frac{1}{x_i - \theta} = 0 \]However, solving for \( \theta \) here simplifies in many cases by assuming that \( \theta = \min(x_1, x_2, \dots, x_n) \), which is often the MLE for the shift parameter.
Thus, \( \hat{\theta} = \min(x_1, x_2, \dots, x_n) = 2 \).
With \( \hat{\theta} = 2 \), we can substitute this back into the equation for \( \lambda \) to find its MLE:
\[ \hat{\lambda} = \frac{5}{(2 + 3 + 2.5 + 4 + 3.5) - 5 \times 2} = \frac{5}{15 - 10} = \frac{5}{5} = 1 \]Therefore, the Maximum Likelihood Estimates for the parameters are:
AIC: N/A
BIC: N/A
Distribution | Parameters | MLE Formula |
---|---|---|
Normal (Gaussian) | \(\mu, \sigma\) |
\(\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} X_i\) \(\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (X_i - \hat{\mu})^2\) (Reference: Casella & Berger, *Statistical Inference*, 2002) |
Exponential | \(\lambda\) |
\(\hat{\lambda} = \frac{n}{\sum_{i=1}^{n} X_i}\) (Reference: Lehmann & Casella, *Theory of Point Estimation*, 1998) |
Poisson | \(\lambda\) |
\(\hat{\lambda} = \frac{1}{n} \sum_{i=1}^{n} X_i\) (Reference: Myung, *Tutorial on Maximum Likelihood Estimation*, 2003) |
Uniform \((a, b)\) | \(a, b\) |
\(\hat{a} = \min(X_1, X_2, \dots, X_n)\) \(\hat{b} = \max(X_1, X_2, \dots, X_n)\) (Reference: Casella & Berger, *Statistical Inference*, 2002) |
Binomial \((n, p)\) | \(n, p\) |
\(\hat{p} = \frac{1}{n} \sum_{i=1}^{n} X_i\) (Reference: Lehmann & Casella, *Theory of Point Estimation*, 1998) |
Gamma \((\alpha, \beta)\) | \(\alpha, \beta\) |
Numerical optimization required (Reference: Myung, *Tutorial on Maximum Likelihood Estimation*, 2003) |
Beta \((\alpha, \beta)\) | \(\alpha, \beta\) |
Numerical optimization required (Reference: Lehmann & Casella, *Theory of Point Estimation*, 1998) |
Negative Binomial \((r, p)\) | \(r, p\) |
Numerical optimization required (Reference: Casella & Berger, *Statistical Inference*, 2002) |
Log-Normal \((\mu, \sigma)\) | \(\mu, \sigma\) |
\(\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} \log(X_i)\) \(\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (\log(X_i) - \hat{\mu})^2\) (Reference: Casella & Berger, *Statistical Inference*, 2002) |
Weibull \((\lambda, k)\) | \(\lambda, k\) |
Numerical optimization required (Reference: Myung, *Tutorial on Maximum Likelihood Estimation*, 2003) |
Chi-Squared \((k)\) | \(k\) |
\(\hat{k} = \frac{2}{n} \sum_{i=1}^{n} X_i\) (Reference: Casella & Berger, *Statistical Inference*, 2002) |
Cauchy \((x_0, \gamma)\) | \(x_0, \gamma\) |
Numerical optimization required (Reference: Myung, *Tutorial on Maximum Likelihood Estimation*, 2003) |
Pareto \((x_m, \alpha)\) | \(x_m, \alpha\) |
\(\hat{\alpha} = \frac{n}{\sum_{i=1}^{n} \log(X_i/x_m)}\) (Reference: Casella & Berger, *Statistical Inference*, 2002) |
You fit two models to a dataset with 150 data points:
Calculate the AIC and BIC for both models. Which model is preferred by AIC? Which model is preferred by BIC?
Assume you have the following data drawn from a uniform distribution \( U(0, \theta) \), where \( \theta \) is unknown. The observed data is:
1. Write down the likelihood function for the unknown parameter \( \theta \).
2. Take the log-likelihood and find the value of \( \theta \) that maximizes it.
3. Solve for \( \theta \) to find the MLE estimate.
Suppose you have the following data points drawn from a normal distribution with unknown mean \( \mu \) and known variance \( \sigma^2 = 1 \):
1. Write down the likelihood function for the unknown parameter \( \mu \).
2. Take the log-likelihood and differentiate it with respect to \( \mu \).
3. Solve for \( \mu \) to find the MLE estimate.
Assume you have the following data from a series of Bernoulli trials, where \( x_i \in \{0,1\} \) represents success (1) or failure (0) in a trial. The probability of success is \( p \), and you want to estimate \( p \) using MLE:
Consider the following data drawn from a geometric distribution where the random variable represents the number of trials until the first success. The probability of success is \( p \), and you want to estimate \( p \) using MLE: