Distribution Textbook (Work in Progress)

by John Della Rosa

Modeling Zero-Inflated Data

Zero-Inflated Data

Traditional count models, such as the Poisson or negative binomial regression, may fail to adequately handle this excess of zeros, resulting in poor model fit and biased parameter estimates. To address this issue, two popular extensions of count models are used: zero-inflated models and hurdle models. These models explicitly account for the overabundance of zeros and are particularly useful when zeros arise from different processes than the positive counts.

Recommended Prerequesites

  1. Probability
  2. Probability II
  3. Mixture Distributions
  4. Compound Distributions
  5. GLMs

Zero-Inflated Model

Explanation

Zero-inflated models are designed to handle count data that exhibits more zeros than would be expected under a standard Poisson or negative binomial distribution. In a zero-inflated model, the data is assumed to be generated by a two-part process:
  1. One process generates structural zeros (i.e., zeros that cannot be attributed to the counting process).
  2. The other process generates both zeros and positive counts from a count distribution (Poisson or negative binomial).
The key idea is that there is a latent binary process determining whether an observation is a structural zero or comes from the count distribution.

Zero-Inflated Poisson (ZIP)

The Zero-Inflated Poisson (ZIP) model is an extension of the Poisson regression model that allows for excess zeros. The ZIP model assumes that, with probability \(\pi\), the response variable takes a zero value (structural zero), and with probability \(1-\pi\), the response follows a Poisson distribution.

The PMF for the ZIP model is given by: $$P_{\text{ZIP}}(k)= \begin{cases} (1-\pi)\frac{\lambda^ke^{-\lambda}}{k!}, & k\geq 1\\ \pi+(1-\pi)e^{-\lambda}, & k=0 \end{cases}$$

Interpretation and Tying it Back

This is really something that we have already seen in the mixture distribution chapter. The ZIP is a mixture distribution where with weight \(\pi\), it follows the degenerate distribution \(P(Y=1)=1\) and with weight \(1-\pi\), it follows a standard Poisson distribution. Because both component distributions can produce zero, this leads to the above formula for P(Y=0).

Zero-Inflated Negative Binomial Model

Similar to what we saw with GLMs, sometimes Poisson doesn't cut it when there's overdispersion, which leads to the choice of Negative Binomial regression over Poisson regression. The same logic applies for Zero-Inflated Models. $$P_{\text{ZINB}}(k)= \begin{cases} (1-\pi)\frac{\Gamma(y+r)}{k!\Gamma(r)}\left(\frac{r}{r+m}\right)^r\left(\frac{m}{r+m}\right)^k, & k\geq 1\\ \pi+(1-\pi)\frac{1}{k!}\left(\frac{r}{r+m}\right)^r, & k=0 \end{cases}$$ where m is the mean of the negative binmomial distribution and r is the dispersion parameter.

Hurdle Models

While zero-inflated models assume that zeros come from two different processes (structural zeros and sampling zeros), hurdle models take a different approach. Hurdle models also involve two stages, but they assume that once the hurdle (zero vs. positive counts) is crossed, the process generating the positive counts follows a truncated count distribution (typically Poisson or negative binomial).

A hurdle model can be described in two parts:
  1. A binary process models whether the response is zero or positive. This part of the model typically uses a binomial or logistic regression.
  2. A truncated count distribution (e.g., truncated Poisson or truncated negative binomial) models the positive counts. The distribution is truncated at zero, meaning that it only generates positive counts.

Hurdle Poisson Model

$$P_{\text{HP}}(k)= \begin{cases} (1-\pi)\frac{\lambda^ke^{-\lambda}}{(1-e^{-\lambda})k!}, & k\geq 1\\ \pi, & k=0 \end{cases}$$

Summary

Comparison

Note the difference in that the probability of getting a 0 is just \(\pi\). The hurdle model essentially directly adjusts the weights of the original PMF to have 0 more often by settng the frequency to \(\pi\) then scaling the PMF of the non-zero support. This is contrasted by zero-inflated which is implemented through a mixture distribution framework and has the notion of different kinds of zeroes.

The choice between zero-inflated models and hurdle models depends on the underlying data and context. In general:

Zero-Inflated Model Visualizer

Zero-Inflation

0.2

Distribution

Probability Mass Function (PMF)

Plot Settings

Zero-Inflated Data Modeling Exercises