Regularization

Introduction to Regularizatoin

Recommended Prerequesites

OLS Regularization

Basic

In frequentist statistics, regularization is a technique used to prevent models from overfitting the data by adding a penalty for complexity. This penalty is applied in various forms, such as:

LASSO (L1 regularization): Adds a penalty proportional to the absolute value of the model parameters
Ridge (L2 regularization): Adds a penalty proportional to the square of the model parameters

For OLS, the objective function that gets minimized is the Residual Sum of Squares: $$\text{Objective}=\text{RSS}=\sum_{i=1}^n(y_i-\hat{y}_i)^2$$ The LASSO objective function to be minimized for OLS is given by: $$\text{Objective}=\text{RSS}+\lambda\sum_{j=1}^{p}|\beta_j|$$ The ridge objective function to be minimized for OLS is given by: $$\text{Objective}=\text{RSS}+\lambda\sum_{j=1}^{p}\beta_j^2$$ where $\beta$ is the vector of model coefficients and $\lambda$ is the regularization parameter controlling the strength of the penalty.

Elastic Net

The two can combined into Elastic Net Regularization, which has both an L1 and L2 penalty term: $$\text{Objective}=\text{RSS}+\lambda_1\sum_{j=1}^{p}|\beta_j|+\lambda_2\sum_{j=1}^p\beta_{j}^{2}$$

Regularization in Other Forms of Optimization

Regularization terms can be added for other objective functions. $$\text{Loss}_{\text{regularized}}(\beta)=\text{Loss}(\beta)+\lambda R(\beta)$$

Example Logistic Regression

For example, the objective function for logistic regression: $$\min_{\beta}\left\{-\sum_{i=1}^{n}[y_i\log p_i+(1-y_i)\log(1-p_i)]+\lambda\sum_{j=1}^p R(\beta_j)\right\}$$ where $R(\beta_j)$ is the regularization term; e.g., $|\beta_j|$.

Bayesian

From a Bayesian point of view, regularization arises naturally through the prior distribution on the parameters. Regularization can be interpreted as placing a prior belief on the parameters, which encodes our preference for simpler models or smaller parameter values. Two priors that we will go over are the Gaussian prior and the Laplace prior.

Gaussian Prior

In Bayesian inference, placing a Gaussian prior on the parameters of a linear regression model leads to Ridge regression. If we place a zero-mean normal prior on each parameter $\beta_j$: $$P(\beta_j)=\frac{1}{\sqrt{2\pi\tau^2}}\exp\left(-\frac{\beta_j^2}{2\tau^2}\right)$$ This prior expresses the belief that the parameters are likely to be small, centered around zero, with $\tau^2$ controlling the prior variance. For linear regression with Gaussian noise, the Maximum a Posteriori estimate of $\beta$ is: $$\text{MAP Objective}=\text{RSS}+\frac{1}{2\tau^2}\sum_{j=1}^p\beta_j^2$$ Comparing objective functions, we can see that $\frac{1}{2\tau^2}=\lambda$

Laplace Prior

Placing a (centered) Laplace prior on the parameters corresponds to a LASSO regression. The Laplace prior is given by: $$P(\beta_j)=\frac{\lambda}{2}\exp\left(-\lambda|\beta_j|\right)$$ $$\text{MAP Objective}=\text{RSS}+\lambda\sum_{j=1}^p|\beta_j|$$ The Laplace distribution has a sharper peak and heavier tails than a Gaussian. This leads to parameters being set to exactly 0 or being higher in magnitude compared to a Gaussian prior which strongly punishes large coefficients, but won't reduce to 0.

Regularization Practice Problems

You are fitting a Gaussian distribution $ N(\mu, \sigma^2) $ to data $ x_1, x_2, \ldots, x_n $ by maximizing the likelihood. To prevent overfitting, add an $ L_2 $-regularization term on $ \mu $.
- Write the regularized log-likelihood objective function
Consider fitting a Poisson distribution $ P(x | \lambda) $ to data $ x_1, x_2, \ldots, x_n $ by minimizing the negative log-likelihood. To avoid extreme values of $ \lambda $, add an $ L_1 $-regularization term.
- Write the regularized objective function:
You are fitting a Gamma distribution $ \text{Gamma}(\alpha, \beta) $ to data $ x_1, x_2, \ldots, x_n $ by maximizing the likelihood. Regularize the parameters by penalizing large values of $ \alpha $ and $ \beta $ using $ L_2 $-regularization.
- Write the regularized log-likelihood function:
Consider fitting a Bernoulli distribution $ \text{Bernoulli}(\theta) $ to data $ x_1, x_2, \ldots, x_n $ by maximizing the likelihood. To enforce sparsity in $ \theta $, apply an $ L_1 $-regularization term.
- Write the regularized objective function:
Compare the effects of $ L_1 $-regularization and $ L_2 $-regularization on fitting a Normal distribution $ N(\mu, \sigma^2) $.

Distribution Textbook (Work in Progress)

by John Della Rosa