Distribution Textbook (Work in Progress)

by John Della Rosa

Regularization

Introduction to Regularizatoin

Recommended Prerequesites

  1. Probability
  2. MLE
  3. Marginal Likelihood
  4. Bayesian Inference

OLS Regularization

Basic

In frequentist statistics, regularization is a technique used to prevent models from overfitting the data by adding a penalty for complexity. This penalty is applied in various forms, such as: For OLS, the objective function that gets minimized is the Residual Sum of Squares: $$\text{Objective}=\text{RSS}=\sum_{i=1}^n(y_i-\hat{y}_i)^2$$ The LASSO objective function to be minimized for OLS is given by: $$\text{Objective}=\text{RSS}+\lambda\sum_{j=1}^{p}|\beta_j|$$ The ridge objective function to be minimized for OLS is given by: $$\text{Objective}=\text{RSS}+\lambda\sum_{j=1}^{p}\beta_j^2$$ where \(\beta\) is the vector of model coefficients and \(\lambda\) is the regularization parameter controlling the strength of the penalty.

Elastic Net

The two can combined into Elastic Net Regularization, which has both an L1 and L2 penalty term: $$\text{Objective}=\text{RSS}+\lambda_1\sum_{j=1}^{p}|\beta_j|+\lambda_2\sum_{j=1}^p\beta_{j}^{2}$$

Regularization in Other Forms of Optimization

Regularization terms can be added for other objective functions. $$\text{Loss}_{\text{regularized}}(\beta)=\text{Loss}(\beta)+\lambda R(\beta)$$

Example Logistic Regression

For example, the objective function for logistic regression: $$\min_{\beta}\left\{-\sum_{i=1}^{n}[y_i\log p_i+(1-y_i)\log(1-p_i)]+\lambda\sum_{j=1}^p R(\beta_j)\right\}$$ where \(R(\beta_j)\) is the regularization term; e.g., \(|\beta_j|\).

Bayesian

From a Bayesian point of view, regularization arises naturally through the prior distribution on the parameters. Regularization can be interpreted as placing a prior belief on the parameters, which encodes our preference for simpler models or smaller parameter values. Two priors that we will go over are the Gaussian prior and the Laplace prior.

Gaussian Prior

In Bayesian inference, placing a Gaussian prior on the parameters of a linear regression model leads to Ridge regression. If we place a zero-mean normal prior on each parameter \(\beta_j\): $$P(\beta_j)=\frac{1}{\sqrt{2\pi\tau^2}}\exp\left(-\frac{\beta_j^2}{2\tau^2}\right)$$ This prior expresses the belief that the parameters are likely to be small, centered around zero, with \(\tau^2\) controlling the prior variance. For linear regression with Gaussian noise, the Maximum a Posteriori estimate of \(\beta\) is: $$\text{MAP Objective}=\text{RSS}+\frac{1}{2\tau^2}\sum_{j=1}^p\beta_j^2$$ Comparing objective functions, we can see that \(\frac{1}{2\tau^2}=\lambda\)

Laplace Prior

Placing a (centered) Laplace prior on the parameters corresponds to a LASSO regression. The Laplace prior is given by: $$P(\beta_j)=\frac{\lambda}{2}\exp\left(-\lambda|\beta_j|\right)$$ $$\text{MAP Objective}=\text{RSS}+\lambda\sum_{j=1}^p|\beta_j|$$ The Laplace distribution has a sharper peak and heavier tails than a Gaussian. This leads to parameters being set to exactly 0 or being higher in magnitude compared to a Gaussian prior which strongly punishes large coefficients, but won't reduce to 0.

Regularization Practice Problems