Distribution Textbook (Work in Progress)

by John Della Rosa

Distribution Sampling

Introduction to Distribution Sampling

Recommended Prerequesites

  1. Probability
  2. Probability II
  3. Sampling (Continuous)

Introduction

In this chapter, we will explore the methods used to sample from discrete probability distributions.

As with the continuous case, this chapter will be more applied than theoretical. Because there a fewer common discrete distributions, we will focus more on specifcs than in the continuous chapter. The main ones also tend to have common use cases in algorithms.

Common Distributions

We will briefly review the common distributions.

Bernoulli

The most basic, binary case which I will assume you are familiar with. Its PMF is $$P(X=x)=\begin{cases} p & \text{if }x=1\\ 1-p & \text{if }x=0 \end{cases}$$ Due to it corresponding to a binary outcome with a given probability, this is one of the fundamental building blocks. Often this will be available from a library in your language of choice, but it is good to know how to implement it.

Binomial Distribution

The number of successes in n independent Bernoulli trials has a PMF given by $$P(X=k)={n \choose k} p^k(1-p)^{n-k};\quad k=0,1,\dots,n$$

Multinomial Distribution

The multinomial distribution extends the binomial distribution into more than 2 possible outcomes.

Let n be the total number of trials; \(p_1,p_2,\dots,p_k\) be the probabilities for the k categories such that \(\sum_{i=1}^kp_i=1\); and \(x_1,x_2,\dots,x_k\) be the number of times each category is observed, where \(\sum_{i=1}^kx_i=n\).

The probability of a particular combination is given by the multinomial PMF: $$P(X_1=x_1,X_2=x_2,\dots,X_k=x_k)=\frac{n!}{x_1!x_2!\dots x_k!}p_1^{x_1}p_{2}^{x_2}\dots p_k^{x_k}$$ If you only care about one outcome of many, you essentially reduce it to a binomial case where the outcomes are desired category and any other category. Thus, the marginal distribution is given by: $$P(X_i=x_i)={n\choose x_i}p_i^{x_i}(1-p_i)^{n-x_i}$$ This can be a useful relationship for algorithms.

Geometric Distribution

Models the number of trials until the first success in a series of Bernoulli trials. Can be extended to the Negative Binomial Distribution. $$P(X=k)=(1-p)^{k-1}p;\quad k=1,2,3,\dots$$

Negative Binomial Distribution

Similar to how the binomial is the sum of independent Bernoulli trials, the negative binomial is a sum of Geometric distributions. This particular distribution has an annoyingly high number of ways to parameterize it, but for the purpose of the book, we will use the convention that it corresponds to the number of failures in a series of Bernoulli trials until you get r successes, where r is a given integer. $$P(X=k)={k+r-1 \choose k}(1-p)^kp^r$$

Poisson Distribution

Represents the number of events in a fixed interval of time or space when events occur independently at a constant rate \(\lambda\) $$P(X=k)=\frac{\lambda^ke^{-\lambda}}{k!};\quad k=0,1,2,\dots$$

Distribution-Specific Algorithms

While general methods like Inverse Transform Sampling and Rejection Sampling can be applied to any distribution, there are distribution-specific algorithms that offer more efficient sampling for particular distributions. These algorithms often take advantage of the structure of the distribution's probability mass function (PMF) or cumulative distribution function (CDF). Other properties that can be taken advantage of are convolution relationships, such as Bernoulli and Binomial; Geometric and Negative Binomial.

Bernoulli

The Bernoulli distribution can be sampled directly using a uniform random variable.

Generate a U(0,1)

Generate a random uniform number \(U\sim\mathcal{U}(0,1)\)

Compare Against P

If \(U\leq p\), return 1; else, return 0.

Binomial

Direct Summation

We can take advantage of the fact that the sum of iid Bernoullis is equivalent to a Binomial distribution.

For \(i=1\) to \(n\), sample \(X_i\sim\) Bernoulli(p). Return the sum of all \(X_i\)
Python Implemnetation
import random

def sample_bernoulli(p):
    U = random.uniform(0, 1)
    if U <= p:
        return 1
    else:
        return 0
                    
def sample_binomial(n, p):
    successes = 0
    for _ in range(n):
        successes += sample_bernoulli(p)
    return successes
                                           
                                    


This has time complexity of O(n). While this is a simple algorithm and not super computationally heavy, if you can avoid generating many uniform random variables, it can sometimes lead to noticeably better performance.

BINV

For large n, an alternative is the BINV Algorithm, which uses a similar approach to inverse transform sampling.
Python Implementation
import math
import random

def sample_binomial_binv(n, p):
    q = (1 - p)**n
    F = q
    X = 0
    U = random.uniform(0, 1)
    
    while U > F:
        X += 1
        q = q * (n - X + 1) * p / (X * (1 - p))
        F += q
    
    return X
                    


                

Multinomial

Naive Algorithm

Similar to the binomial distribution, we can just sample n times from the associated categorical distribution. $$P(k)=p_k$$ The categorical distribution sampling algorithm is very similar to that of the Bernoulli. The only difference is that we no longer have a single threshold.
Generate a U(0,1)
Same as with the Bernoulli, we generate a realization \(U\sim \mathcal{U}(0,1)\).
Use cumulative probability
While the Bernoulli had a threshold based on p, we have more than 3 areas that we break the line segment from 0 to 1 into. We do this based on something similar to a CDF: $$P(C_i)=\sum_{j=1}^ip_j$$ We then output the sample based on which region it lies in. An example will make this more clear.
Example
Let's say we have k=3, and our p-vector is \([0.2,0.3,0.5]\). Our thresholds are

General Sampling Methods

The basics techniques from continuous sampling can also be applied to discrete distributions.

Inverse Transform Sampling

Inverse Transform Sampling is a widely used technique for generating random samples from a given probability distribution, applicable to both discrete and continuous distributions. Despite the general similarities between the two cases, there are notable differences in implementation due to the nature of the cumulative distribution function (CDF) and the values the random variable can take.

For discrete distributions, the process involves generating a random value from a uniform distribution and finding the smallest discrete value whose cumulative probability is greater than or equal to this uniform random variable. This is fundamentally different from the continuous case, where the inverse CDF is computed analytically or using some root-finding method.

Algorithm

Compute the CDF
Define the cumulative function F(x) for a discrete random variable. $$F(x)=P(X\leq x)$$
Generate a Uniform Random Variable
Draw a random number U from a uniform distribution on the interval [0,1] where \(U\sim\mathcal{U}(0,1)\).
Find Smallest Value of X Satisfying Criterium
The final step is finding the smallest x in the support such that \(F(x)\geq U\). The sampled value is the smallest such x.
Example
For a geometric distribution with parameter p, $$P(X=k)=(1-p)^{k-1}p\quad k=1,2,3,\dots$$ The corresponding CDF is $$F(k)=1-(1-p)^k$$ To sample, generate \(U\sim\mathcal{U}(0,1)\), and find the smallest k such that: $$1-(1-p)^k\geq U$$ Rearranging, $$k=\left\lceil\frac{\log(1-U)}{\log(1-p)}\right\rceil$$ This is one case where we can get an easy answer through algebra, but this is not typical.

Rejection Sampling

Rejection Sampling is a method used when the distribution is difficult to sample directly, but a simpler distribution bounds its PMF. This method relies on generating samples from an easy-to-sample proposal distribution and then rejecting some of them based on a condition. This is very similar to the continuous case.

Algorithm

Choose a proposal distribution
The proposal distribution q(x) is the one that you will actually sample from. As in the continuous case, it must satisfy $$p(x)\leq cq(x)\quad\forall x$$ where p is our desired pmf.
Generate X from the Proposal Distribution
Sample from the proposal distribution and get a sample value x from it
Generate a U(0,1)
Generate a uniform random variable \(U\sim\mathcal{U}(0,1)\).
Check Acceptance Criterium
We accept the sample x if $$U\leq \frac{P(X=x)}{cq(x)}$$ If this condition does not hold, repeat the process starting from generating x from the proposal distribution.

Alias Method

The Alias Method is one that is unique to discrete distributions, so we haven't covered it in the prior chapter. It's useful when there are many, but finite, possible values that the distribution could output. It is able to perform sampling at O(1) once the preprocessing is done, which makes it useful when many samples are needed to be drawn. For other methods, the time complexity of sampling could be O(n) if you need to traverse the support and compare it to the CDF, such as with inverse transform sampling.

Pre-processing

The Alias method involves constructing two tables:
  1. Probability Table: Probabilities of the enumerated outcomes that are scaled such that it makes it easier to uniformly sample
  2. Alias Table: Stores the secondary outcomes that are used when the primary outcome's probability isn't high enough.
Each entry in these tables corresponds to one of the possible outcomes, and the sampling algorithm involves selecting a random index and deciding which outcome to return based on a simple comparison.
Normalize the Probabilities
Scale the original weights such that they sum up to 1 then multiply each probability by n, so that the average probability becomes 1: $$p_{i}'=p_i\times n$$ Now, we have a set of modified probabilities \(p_{1}',p_{2}',\dots,p_{n}'\), where some values will be greater than 1 ("heavy"), and others will be less than 1 ("light").

We create two lists:
Construct the Probability and Alias Tables
  1. Probability table: stores values that correspond to the adjusted probabilities for each outcome
  2. Alias table: stores indices for the "alias" outcomes, which represent the secondary outcome that can be sampled if the primary outcome is not chosen.
Now the probabilities have to be balanced out: Repeat the above steps until all probabilities are close to 1. At the end of the process, each entry in the Probability Table will have a value less than or equal to 1, and each entry in the Alias table will store an alias for the corresponding outcome. Both lists should be empty. If not, there might be an outcome left that is close to 1. Just set that probability to 1 in the Probability Table and have it be its own alias.

Sampling

Choose index
Pick a random index i uniformly from \(\{1,2,\dots,n\}\).
Sample Primary of Secondary
We now simulate a Bernoulli trial with p being the value in the Probability Table for index i. If success, then return i; else, return the alias stored in the Alias table at index i.

Discrete Sampling Practice Problems

  1. Use rejection sampling to generate samples from a Beta distribution with \(\alpha=2,\beta=3\). Use a Uniform as the proposal distribution and scale appropriately. For reference, the non-normalized PDF for a Beta Distribution is \(f(x|\alpha,\beta)=x^{\alpha-1}(1-x)^{\beta-1}\).
  2. Use rejection sampling to generate samples from a Raised Cosine distribution with \(\mu=0,s=1\). Use a Uniform as the proposal distribution and scale appropriately. For reference, the PDF for a Raised Cosine Distribution is \(f(x|\mu,s)=\frac{1}{2s}\left[1+\cos\left(\frac{x-\mu}{s}\pi\right)\right]\) with \(x\in[\mu-s,\mu+s]\).
  3. Use rejection sampling to generate samples from a Wigner Semicircle distribution with \(R=2\). Use a Uniform as the proposal distribution and scale appropriately. For reference, the PDF for a Wigner Semicircle Distribution is \(f(x|R)=\frac{2}{\pi R^2}\sqrt{R^2-x^2}\) with \(x\in[-R,R]\).