Generalized Linear Models (GLM)
Introduction to GLMs
Recommended Prerequesites
- Probability
- Probability II
- Introduction to Multivariate Distributions
Ordinary Least Squares
Recall Linear regression (OLS) which is the basic statistical technique for predicting a variable as a function of others:
$$y=\beta_0+\beta_1\times x_1+\beta_2\times x_2+\dots+\beta_n\times x_n=X\beta$$
Linear regression is a fundamental tool in statistical modeling, widely used to describe the relationship between a continuous response variable and one or more predictors. However, the assumptions underlying linear regression—such as normally distributed residuals and a constant variance—are often too restrictive when applied to many types of data encountered in practice. For instance, when modeling binary outcomes, count data, or proportions, linear regression may provide inadequate or misleading results.
Basic linear regression is unable to constrain the range which may be desirable when predicting probabilities for example.
Extending to GLMs
Generalized Linear Models (GLMs) extend the linear regression framework to accommodate a broader range of data types. Specifically, GLMs allow the response variable to follow a distribution from the exponential family, such as the binomial, Poisson, or gamma distribution. In addition, they introduce a link function, which establishes a relationship between the linear predictor and the expected value of the response variable, ensuring that the model adheres to the constraints of the distribution (e.g., ensuring that probabilities remain between 0 and 1 for binary data).
OLS as a GLM
Linear regression, in its standard form, is a special case of a GLM where the response variable is assumed to follow a normal distribution, the identity link function is used, and the variance is constant.
In ordinary least squares regression, the response variable \(Y\) is assumed to be mean \(\mu\) and error terms with constant variance \(\sigma^2\):
$$Y=X\beta + \varepsilon,\quad \mathbb{E}[Y|X]=\mu=X\beta, \quad \mathbb{E}[\epsilon^2|X]= \sigma^2$$
To interpret linear regression in the context of a GLM, we can view it through the lens of the exponential family of distributions. As discussed earlier, the normal distribution is a member of the exponential family, with the probability density function:
$$f(y;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right)$$
In the GLM formulation, we aim to model the mean of the response variable, \(\mu=\mathbb{E}[Y|X]\), as a linear function of the predictors.
For normally distributed data, the
identity link function is used:
$$g(\mathbb{E}[Y|X])=\mathbb{E}[Y|X]=X\beta$$
This link function simply implies that the mean of the response variable is directly modeled as a linear function of the predictors. The normal GLM, therefore, takes the form:
$$Y|X\sim \mathcal{N}(X\beta, \sigma^2)$$
This is equivalent to the standard OLS model when the error terms are normally distributed, but now seen within the broader GLM framework. In this case, the dispersion parameter \(\phi\) corresponds to the variance \(\sigma^2\), and the canonical link function is the identity function.
Those terms will be explained shortly.
Exponential Distribution Family
The name "exponential" comes from be ability to write its members' PMF or PDF as an exponential:
$$f(y; \theta, \phi)=\exp\left(\frac{y\theta - b(\theta)}{\phi}+c(y,\phi)\right)$$
where y is the outcome, \(\theta\) is something called the natural parameter, \(\phi\) is the dispersion parameter, and \(b(\theta)\) and \(c(y,\phi)\) are functions that are specific to the member distribution. The natural parameter will tend to be what we interpret y as.
Link Functions
In Generalized Linear Models (GLMs), the relationship between the linear predictor and the expected value of the response variable is mediated through a link function.
Formally, the link function \(g(\cdot)\) establishes the following relationship:
$$g(\mathbb{E}[Y|X])=X\beta$$
Canonical Link Function
Each distribution in the exponential family has a canonical link function, which is derived from the natural parameter of the distribution. The canonical link function often leads to simpler and more efficient parameter estimation, though non-canonical links may be used if they better suit the context of the data.
Normal Distribution
For a normally distributed response variable, the canonical link function is the identity link. That is, the expected value of the response is modeled directly as:
$$g(\mathbb{E}[Y|X])=\mathbb{E}[Y|X]=X\beta$$
Binomial Distribution
For binary or binomially distributed data, the canonical link function is the logit function. This function transforms probabilities, which are constrained to lie between 0 and 1, to the entire real line:
$$g(\mathbb{E}[Y|X])=\log\left(\frac{\mathbb{E}[Y|X]}{n-\mathbb{E}[Y|X]}\right)=X\beta$$
When n=1, this is a Bernoulli regression, more commonly known as
Logistic Regression and the link function is the log odds.
To solve for \(\mathbb{E}[Y|X]\), the logistic function is used:
$$\mathbb{E}[Y|X]=\frac{n}{1+e^{-X\beta}}$$
One thing to note is that n must be a fixed value for this framework. Exponential distributions have a support that is independent of parameter choice, which would be violated if we allowed n to vary.
For count data, typically modeled by a Poisson distribution, the canonical link is the logarithmic function, which ensures that the expected value remains non-negative:
$$g(\mathbb{E}[Y|X])=\log(\mathbb{E}[Y|X])=X\beta$$
The inverse link (mean) function is the exponential, so the expected value of the response is given by:
$$\mathbb{E}[Y|X]=e^{X\beta}$$
Alternatives to Poisson
In practice, Poisson can have limitations when we want to model count data. Common issues are that the variance is forced to be equal to the mean in Poisson as well as not being able to accomodate zero-inflated processes.
Negative Binomial Regression can be used when the variance is greater than the mean. An additional parameter which corresponds to overdispersion is estimated. This is a common enough method that it warrants specific mention. Other Poisson altneratives are zero-inflated models, such as Zero-inflated Poisson.
Gamma Distribution
For positively skewed continuous data, the (negative) inverse link is commonly used:
$$g(\mathbb{E}[Y|X])=-\frac{1}{\mathbb{E}[Y|X]}=X\beta$$
Thus, the mean is given by:
$$\mathbb{E}[Y|X]=-\frac{1}{X\beta}$$
Note that a limitation of this link function is the possibility for calculating a negative mean, which is outside its support.
Another thing to note is that some books will give the canonical link as the positive inverse rather than it having a negative sign. Ultimately, this would just lead to your coefficients all having opposite signs and the mean function would also not have a negative sign, so everything would essentially be the same. The author's choice for the negative convention is based on properties of the expotential family. It is inconsequential in the result so long as you're being consistent, but it's good to be aware of since both conventions are common, and you'll want to make sure you're being consistent, or else you'll get nonsense.
Users of the negative convenetion will often call it the "negative inverse."
Non-Canonical Link Functions
While canonical link functions arise naturally from the form of the exponential family distribution and often lead to simpler estimation procedures, there are many situations where non-canonical link functions may be preferred. These alternative link functions provide additional flexibility in modeling the relationship between the predictors and the response, particularly when the canonical link may not yield a model that is intuitive or interpretable for a given application.
Why Use Non-Canonical Link Functions?
Non-canonical link functions are chosen for several reasons:
- Interpretability: In some contexts, the canonical link may not produce coefficients that are easily interpretable. For instance, in binary regression models, the canonical logit link models the log-odds, but a probit or complementary log-log link may provide more natural interpretations based on the data's structure. The log link function is sometimes used for Gamma GLMs for similar reasons.
- Alternative Distributions of Error: Non-canonical links can arise naturally when the errors in the model deviate from the assumptions of the exponential family distribution or when a different functional form better captures the underlying data relationships.
- Improving Fit: When a canonical link fails to adequately capture the relationship between the response and predictors, a non-canonical link may improve model fit by providing a closer alignment to the underlying data-generating process.
Examples
Outside of mixing and matching the above mentioned ones, here are some others that are sometimes used:
Probit
$$g(\mu)=\Phi^{-1}(\mu)$$
Complementary Log-Log (Cloglog)
$$g(\mu)=\log(-\log(1-\mu))$$
Cauchit
$$g(\mu)=\tan(\pi(\mu-0.5))$$
Square Root
$$g(\mu)=\sqrt{\mu}$$
Using GLMs
Iteratively Reweighted Least Squares (IRLS)
The MLE of the coefficients, \(\beta\), can be obtained through the IRLS algorithm, which iteratively solved weighted least-squares problems:
$$\beta^{(k+1)}=\beta^{(k)}+(X^TWX)^{-1}X^TWz$$
where W is a weight matrix and z is the adjuted dependent variable.
GLM Estimators
The distribution of the MLE approaches a multivariate normal distribution centered at the true parameter values:
$$\sqrt{n}(\hat{\beta}-\beta_0)\overset{d}{\to}N(0,\mathcal{I}^{-1}(\beta_0))$$
Generalized Additive Models
The ideas of additive models and GLMs can be combined into GAMs:
$$g(\mu_i)=\beta_0+\sum_js_j(x_{ij})$$