Modeling Zero-Inflated Data
Zero-Inflated Data
Traditional count models, such as the Poisson or negative binomial regression, may fail to adequately handle this excess of zeros, resulting in poor model fit and biased parameter estimates. To address this issue, two popular extensions of count models are used: zero-inflated models and hurdle models. These models explicitly account for the overabundance of zeros and are particularly useful when zeros arise from different processes than the positive counts.
Recommended Prerequesites
- Probability
- Probability II
- Mixture Distributions
- Compound Distributions
Zero-Inflated Model
Explanation
Zero-inflated models are designed to handle count data that exhibits more zeros than would be expected under a standard Poisson or negative binomial distribution. In a zero-inflated model, the data is assumed to be generated by a two-part process:
- One process generates structural zeros (i.e., zeros that cannot be attributed to the counting process).
- The other process generates both zeros and positive counts from a count distribution (Poisson or negative binomial).
The key idea is that there is a latent binary process determining whether an observation is a structural zero or comes from the count distribution.
Zero-Inflated Poisson (ZIP)
The Zero-Inflated Poisson (ZIP) model is an extension of the Poisson regression model that allows for excess zeros. The ZIP model assumes that, with probability \(\pi\), the response variable takes a zero value (structural zero), and with probability \(1-\pi\), the response follows a Poisson distribution.
The PMF for the ZIP model is given by:
$$P_{\text{ZIP}}(k)= \begin{cases}
(1-\pi)\frac{\lambda^ke^{-\lambda}}{k!}, & k\geq 1\\
\pi+(1-\pi)e^{-\lambda}, & k=0
\end{cases}$$
Interpretation and Tying it Back
This is really something that we have already seen in the mixture distribution chapter. The ZIP is a mixture distribution where with weight \(\pi\), it follows the degenerate distribution \(P(Y=1)=1\) and with weight \(1-\pi\), it follows a standard Poisson distribution.
Because both component distributions can produce zero, this leads to the above formula for P(Y=0).
Moments
From using the moment equations for
Mixture Distributions, we get the following:
Let Y be a ZIP with Poisson rate \(\lambda\) and zero inflation parameter \(\pi\):
$$\mathbb{E}[Y]=(1-\pi)\lambda$$
$$\text{Var}(Y)=(1-\pi)\lambda(1+\pi\lambda)$$
Zero-Inflated Negative Binomial Model
Similar to what we saw with GLMs, sometimes Poisson doesn't cut it when there's overdispersion, which leads to the choice of Negative Binomial regression over Poisson regression. The same logic applies for Zero-Inflated Models.
$$P_{\text{ZINB}}(k)= \begin{cases}
(1-\pi)\frac{\Gamma(y+r)}{k!\Gamma(r)}\left(\frac{r}{r+m}\right)^r\left(\frac{m}{r+m}\right)^k, & k\geq 1\\
\pi+(1-\pi)\frac{1}{k!}\left(\frac{r}{r+m}\right)^r, & k=0
\end{cases}$$
where m is the mean of the negative binmomial distribution and r is the dispersion parameter.
Sampling from the ZIP Model
Given the underlying model of the dynamics, the algorithm to sample from the ZIP is fairly intuitive.
Step 1: Binary Zero Inflation Check
Generate a Bernoulli random variable Z with succcess probability \(\pi\). If Z=1, the sampled value is 0 (i.e., a zero-inflated observation).
If Z=0, proceed to Step 2.
Step 2: Sample from Normal Version of Poisson
If Z=0, draw a sample from a Poisson distribution with rate \(\lambda\).
Fitting Parameters
Likelihood Function
$$L(\pi,\lambda)=\prod_{i=1}^n\left[\pi I_{y_i=0}+(1-\pi)\frac{\lambda^{y_i}e^{-\lambda}}{y_i!}\right]$$
where \(I_{y_i=0}\) is an indicator function that equals 1 when \(y_i=0\) and 0 otherwise.
Parameters are often determined through Expectation-Maximization or Newton-Raphson.
Expectation Maximimization for ZIP
E-Step
Calculate the expected value of the latent variable \(Z_i\) given the current parameter estimates:
$$\gamma_i=\mathbb{E}[Z_i|y_i]=\begin{cases}\frac{\pi^{(t)}}{\pi^{(t)}+(1-\pi^{(t)})e^{-\lambda^{(t)}}}, & \text{ if }y_i=0\\
0, &\text{ if }y_i\gt 0\end{cases}$$
M-Step
Update the parameter estimates:
$$\pi^{(t+1)}=\frac{\sum_{i=1}^n\gamma_i}{n}$$
$$\lambda^{(t+1)}=\frac{\sum_{i=1}^n(1-\gamma_i)y_i}{\sum_{i=1}^n(1-\gamma_i)}$$
Hurdle Models
While zero-inflated models assume that zeros come from two different processes (structural zeros and sampling zeros), hurdle models take a different approach. Hurdle models also involve two stages, but they assume that once the hurdle (zero vs. positive counts) is crossed, the process generating the positive counts follows a truncated count distribution (typically Poisson or negative binomial).
A hurdle model can be described in two parts:
- A binary process models whether the response is zero or positive. This part of the model typically uses a binomial or logistic regression.
- A truncated count distribution (e.g., truncated Poisson or truncated negative binomial) models the positive counts. The distribution is truncated at zero, meaning that it only generates positive counts.
Hurdle Poisson Model
$$P_{\text{HP}}(k)= \begin{cases}
(1-\pi)\frac{\lambda^ke^{-\lambda}}{(1-e^{-\lambda})k!}, & k\geq 1\\
\pi, & k=0
\end{cases}$$
Summary
Comparison
Note the difference in that the probability of getting a 0 is just \(\pi\). The hurdle model essentially directly adjusts the weights of the original PMF to have 0 more often by settng the frequency to \(\pi\) then scaling the PMF of the non-zero support.
This is contrasted by zero-inflated which is implemented through a mixture distribution framework and has the notion of different kinds of zeroes.
The choice between zero-inflated models and hurdle models depends on the underlying data and context. In general:
- If the data generating process is believed to involve two different sources of zeros (e.g., structural zeros and zeros from the count process), a zero-inflated model may be more appropriate.
- If the zeros arise from a single process and the positive counts are generated once a hurdle is crossed, a hurdle model may be more suitable.