Regularization
Introduction to Regularizatoin
Recommended Prerequesites
- Probability
- MLE
- Marginal Likelihood
- Bayesian Inference
OLS Regularization
Basic
In frequentist statistics, regularization is a technique used to prevent models from overfitting the data by adding a penalty for complexity. This penalty is applied in various forms, such as:
- LASSO (L1 regularization): Adds a penalty proportional to the absolute value of the model parameters
- Ridge (L2 regularization): Adds a penalty proportional to the square of the model parameters
For OLS, the objective function that gets minimized is the Residual Sum of Squares:
$$\text{Objective}=\text{RSS}=\sum_{i=1}^n(y_i-\hat{y}_i)^2$$
The LASSO objective function to be minimized for OLS is given by:
$$\text{Objective}=\text{RSS}+\lambda\sum_{j=1}^{p}|\beta_j|$$
The ridge objective function to be minimized for OLS is given by:
$$\text{Objective}=\text{RSS}+\lambda\sum_{j=1}^{p}\beta_j^2$$
where \(\beta\) is the vector of model coefficients and \(\lambda\) is the regularization parameter controlling the strength of the penalty.
Elastic Net
The two can combined into Elastic Net Regularization, which has both an L1 and L2 penalty term:
$$\text{Objective}=\text{RSS}+\lambda_1\sum_{j=1}^{p}|\beta_j|+\lambda_2\sum_{j=1}^p\beta_{j}^{2}$$
Regularization in Other Forms of Optimization
Regularization terms can be added for other objective functions.
$$\text{Loss}_{\text{regularized}}(\beta)=\text{Loss}(\beta)+\lambda R(\beta)$$
Example Logistic Regression
For example, the objective function for logistic regression:
$$\min_{\beta}\left\{-\sum_{i=1}^{n}[y_i\log p_i+(1-y_i)\log(1-p_i)]+\lambda\sum_{j=1}^p R(\beta_j)\right\}$$
where \(R(\beta_j)\) is the regularization term; e.g., \(|\beta_j|\).
Bayesian
From a Bayesian point of view, regularization arises naturally through the prior distribution on the parameters. Regularization can be interpreted as placing a prior belief on the parameters, which encodes our preference for simpler models or smaller parameter values.
Two priors that we will go over are the Gaussian prior and the Laplace prior.
Gaussian Prior
In Bayesian inference, placing a Gaussian prior on the parameters of a linear regression model leads to Ridge regression.
If we place a zero-mean normal prior on each parameter \(\beta_j\):
$$P(\beta_j)=\frac{1}{\sqrt{2\pi\tau^2}}\exp\left(-\frac{\beta_j^2}{2\tau^2}\right)$$
This prior expresses the belief that the parameters are likely to be small, centered around zero, with \(\tau^2\) controlling the prior variance.
For linear regression with Gaussian noise, the Maximum a Posteriori estimate of \(\beta\) is:
$$\text{MAP Objective}=\text{RSS}+\frac{1}{2\tau^2}\sum_{j=1}^p\beta_j^2$$
Comparing objective functions, we can see that \(\frac{1}{2\tau^2}=\lambda\)
Laplace Prior
Placing a (centered) Laplace prior on the parameters corresponds to a LASSO regression. The Laplace prior is given by:
$$P(\beta_j)=\frac{\lambda}{2}\exp\left(-\lambda|\beta_j|\right)$$
$$\text{MAP Objective}=\text{RSS}+\lambda\sum_{j=1}^p|\beta_j|$$
The Laplace distribution has a sharper peak and heavier tails than a Gaussian. This leads to parameters being set to exactly 0 or being higher in magnitude compared to a Gaussian prior which strongly punishes large coefficients, but won't reduce to 0.
Regularization Practice Problems