Distribution Textbook (Work in Progress)

by John Della Rosa

Kernel Density Estimation

Introduction to KDE

Recommended Prerequesites

  1. Probability
  2. Probability II
  3. Empirical Distribution Function
  4. Mixture Distributions

Building on Prior Chapters

In a previous chapter, we used mixture distributions to model a probability density function as a weighted sum of parametric distributions (e.g., Gaussian components). KDE is similar in that it also represents the density as a sum, but here, the sum is over non-parametric kernel functions centered at each data point. While mixture models assign each data point to a specific component, KDE essentially "spreads" each data point over the entire space using a kernel with equal weighting. In some sense, it is like a mixture of mixture distributions and empirical distribution functions: placing evenly weighted mixture components at every data point.

Kernel Density Estimation (KDE) can be viewed as a smooth, continuous generalization of the histogram. Instead of counting data points in fixed bins, KDE places a smooth kernel function at each data point and sums these functions to produce a smooth approximation of the probability density function.

The Equation

The kernel density estimate for a sample of n data points \(x_1, x_2, \dots, x_n\) is defined as: $$\hat{f}_h(x)=\frac{1}{n}\sum_{i=1}^nK_h(x-x_i)$$ where

Kernel functions

The kernel function \(K_h(\cdot)\) is a symmetric, non-negative function that integrates to 1. It determines the shape of the contribution that each data point makes to the overall density estimate.

Gaussian Kernel

The Gaussian kernel is given by: $$K_h(x)=\frac{1}{\sqrt{2\pi h^2}}\exp\left(-\frac{x^2}{2h^2}\right)$$
Features

Uniform kernel

$$K_h(x)=\frac{1}{2h}\quad\text{for }|x|\leq h$$

Epanechnikov Kernel

$$K(x)=\frac{3}{4}(1-x^2)\quad \text{for }|x|\leq 1$$
Features

Triangular Kernel

$$K(x)=1-|x|\quad\text{for }|x|\leq 1$$
Features

Biweight (Quartic) Kernel

Cosine Kernel

$$K(x)=\frac{\pi}{4}\cos(\frac{\pi x}{2})\quad\text{for }|x|\leq 1$$

Laplace Kernel

The Bandwidth Parameter h

The bandwidth \(h\) is a crucial parameter in KDE as it controls the spread of the kernel around each data point. A small bandwidth produces a highly detailed estimate that may overfit the data, while a large bandwidth results in an oversmoothed estimate that may fail to capture important features of the distribution.
Silverman's Rule
Silverman's Rule is a rule of thumb for picking h: $$h'=1.06\sigma n^{-1/5}$$

Contrasting with Regular Mixture Distributions

As mentioned, mixture distributions and KDE share a conceptual similarity, both combining multiple components to approximate the overall density. However, there are key differences:

Higher Dimensions

The KDE framework extends naturally to multivariate data. For a d-dimensional sample \(x_1,x_2,\dots,x_n\), the multivariate KDE is given by: $$\hat{f}_h(\vec{x})=\frac{1}{n}\sum_{i=1}^{n}|H|^{1/2}K(H^{-1/2}(\vec{x}-\vec{x}_i))$$ where H is the bandwidth matrix

Issues

KDE can have problems near the boundaries of the support of the distribution. For example, if the true distribution is restricted to a finite range (e.g., non-negative values), the kernel functions may assign non-zero density outside this range, leading to a biased estimate.

Standard kernel functions are symmetric and have support that extends beyond the data boundary. When estimating the density near a boundary, a portion of the kernel extends outside the domain where the true density is defined (e.g., negative values for a non-negative variable).
Since there is no data beyond the boundary to balance the kernel weight, the estimated density at points near the boundary is biased downward.

Reflection Method

$$\hat{f}_{ref}(x)=\frac{1}{nh}\sum_{i=1}^n[K(\frac{x-X_i}{h})+K(\frac{x+X_i-2a}{h})]$$ where a is the boundary point.

Sampling from a KDE

If we were going to try to treat the KDE by using its formula for \(f(x)\), it would be quite unweidly. Instead, we can take advantage of how the KDE is a form of a mixture distribution, which is easy to sample from.

Step 1: Select a Data Point

Randomly select one of the n data points \(X_1,X_2,\dots, X_n\) with equal probability.

Step 2: Generate Kernel Noise

Generate noise from the kernel distribution (e.g. Gaussian noise if you're using a Gaussian kernel), and scale it by the bandwidth h. In other words, after you've selected your data point, you can treat it like you're sampling from a regular version of the kernel's distribution, with the appropraite scaling parameter based on the bandwidth.

Step 3: Sample from Kernel Noise

Sample from that created distributed about the data point from step 1. This is your KDE sample.

Interactive Kernel Density Estimate (KDE)

Active KDEs

User Guide: Interactive Kernel Density Estimate (KDE) Tool

Welcome to the Kernel Density Estimate (KDE) tool! This guide will help you understand how to use the tool to estimate the probability density function of a dataset using various kernel functions. The tool allows you to input data points, select different kernels, adjust the bandwidth, and visualize multiple KDE curves simultaneously.

Step-by-Step Instructions

1. Entering Data Points

To begin, enter a list of comma-separated data points into the "Data Points" input field. The data points should represent the sample from which you wish to estimate the density.

You can also upload a CSV or text file with comma-separated values by using the "Upload CSV or Text File" option.

2. Selecting a Kernel Function

The tool supports several commonly used kernel functions for density estimation. You can choose the kernel type from the dropdown menu labeled "Select Kernel". Each kernel has different properties:

3. Adjusting the Bandwidth (h)

The bandwidth controls the smoothness of the KDE curve. A larger bandwidth results in a smoother curve, while a smaller bandwidth captures more detail but may lead to overfitting. Enter the desired bandwidth value in the "Bandwidth" input field (default: 1.0).

4. Adding Multiple KDE Curves

Once you have selected a kernel and adjusted the bandwidth, click the "Add KDE" button to generate and overlay the KDE curve on the chart. You can add multiple KDE curves with different bandwidths and kernel types. Each curve will be displayed with a unique color.

To remove a KDE curve, click the corresponding button in the "Active KDEs" list below the chart. You can also change the color of a KDE curve using the color selector next to each item in the list.

5. Adjusting the X-Axis Range

You can adjust the range of the x-axis by entering values for "X-Axis Min" and "X-Axis Max". This allows you to zoom in or out on specific regions of the data and KDE curves.

6. Updating the Data

If you wish to modify the data points, either by manually typing them in or uploading a new file, you can click the "Refresh Data" button to update the KDE curves based on the new data. This will ensure that all KDEs correspond to the latest data points.

Advanced Features

1. Kernel-Specific Behavior

Each kernel function has its own characteristics. For example:

2. Visualizing Data Points

The original data points are always plotted on the chart as red scatter points. This allows you to see how the KDE curves fit the data and how the smoothing effects of different kernels and bandwidths affect the estimate.

3. Customizing KDE Curves

You can customize the KDE curves by adjusting their color using the color selector next to each curve in the "Active KDEs" list. This is especially useful when you have multiple KDE curves on the same chart.

Frequently Asked Questions (FAQ)

1. How does the bandwidth affect the KDE?

The bandwidth controls how much smoothing is applied to the KDE. A smaller bandwidth captures more local detail, while a larger bandwidth results in a smoother, more global estimate.

2. Can I visualize multiple KDEs at once?

Yes! You can add multiple KDE curves with different kernels and bandwidths. They will be overlaid on the same chart, and each KDE will be listed in the "Active KDEs" section where you can manage them.

3. Can I change the color of the KDE curves?

Yes, each KDE in the "Active KDEs" section has a color picker that allows you to customize the color of that curve. This makes it easier to distinguish between multiple KDEs.

4. What kernels are available in the tool?

The tool includes the following kernels: Gaussian, Epanechnikov, Uniform, Triangular, Biweight (Quartic), Cosine, and Laplace.

5. How do I remove a KDE curve?

Click the "Remove KDE" button next to the corresponding KDE in the "Active KDEs" list to remove it from the chart.

6. How do I upload a file?

Click the "Upload CSV or Text File" button and select a CSV or text file containing comma-separated data points. The file will be uploaded, and the data points will be displayed in the input field. The KDE curves will then update accordingly.

KDE Practice Problems

  1. Basic KDE Construction:
    1. Explain the concept of a kernel density estimate (KDE). How does it differ from a histogram for estimating the underlying probability density function (PDF)?
    2. Given a sample dataset: \[ \{1.0, 2.3, 2.9, 3.7, 4.1, 4.5, 5.0\} \] Calculate the KDE at \(x = 3.0\) using the Gaussian kernel and a bandwidth of \(h = 0.5\).
  2. EDF for Large Datasets:
    1. Discuss the computational challenges of constructing the EDF for very large datasets. What are some methods to improve the efficiency of EDF calculation for large-scale data?
    2. For a dataset of size \(n = 10^6\), simulate a sample from a normal distribution \(N(0, 1)\), compute the EDF, and compare the computational performance with kernel density estimation for the same dataset.
  3. Bandwidth Selection:
    1. Explain the role of the bandwidth parameter \(h\) in KDE. What happens when \(h\) is too small or too large?
    2. For the sample dataset: \[ \{1.1, 2.0, 3.3, 3.9, 4.8, 5.2, 6.1\} \] Plot the KDE for different bandwidth values (\(h = 0.2, 0.5, 1.0\)) using the Gaussian kernel. Compare and describe the differences in the resulting estimates.
  4. Kernel Functions:
    1. List and describe the commonly used kernel functions in KDE (e.g., Gaussian, Epanechnikov, Uniform). What are the advantages and disadvantages of each?
    2. For the dataset: \[ \{0.5, 1.8, 2.2, 3.1, 4.0\} \] Compute the KDE at \(x = 2.5\) using the Epanechnikov kernel with bandwidth \(h = 0.5\).
  5. Comparison of KDE and Histograms:
    1. Given the dataset: \[ \{0.8, 1.5, 2.0, 3.3, 4.2, 4.7, 5.0, 5.5\} \] Plot the histogram with a bin width of 1.0 and overlay the KDE with a Gaussian kernel and bandwidth \(h = 0.7\). Discuss the advantages and disadvantages of KDE compared to histograms in terms of smoothness and sensitivity to bin size.
  6. Multivariate KDE:
    1. Explain how KDE can be extended to multivariate data. How does bandwidth selection differ in this case?
    2. For the 2D dataset: \[ \{(1.0, 2.0), (2.1, 3.3), (2.5, 3.7), (3.0, 4.0), (4.1, 5.0)\} \] Calculate the KDE at \((x_1, x_2) = (2.0, 3.0)\) using the Gaussian kernel with bandwidth matrix \(H = \text{diag}(0.5, 0.5)\).