Distribution Textbook (Work in Progress)

by John Della Rosa

Kernel Density Estimation

Introduction to KDE

Recommended Prerequesites

  1. Probability
  2. Probability II
  3. Mixture Distributions

Building on Prior Chapters

In a previous chapter, we used mixture distributions to model a probability density function as a weighted sum of parametric distributions (e.g., Gaussian components). KDE is similar in that it also represents the density as a sum, but here, the sum is over non-parametric kernel functions centered at each data point. While mixture models assign each data point to a specific component, KDE essentially "spreads" each data point over the entire space using a kernel with equal weighting.

Kernel Density Estimation (KDE) can be viewed as a smooth, continuous generalization of the histogram. Instead of counting data points in fixed bins, KDE places a smooth kernel function at each data point and sums these functions to produce a smooth approximation of the probability density function.

The Equation

The kernel density estimate for a sample of n data points \(x_1, x_2, \dots, x_n\) is defined as: $$\hat{f}_h(x)=\frac{1}{n}\sum_{i=1}^nK_h(x-x_i)$$ where

Kernel functions

The kernel function \(K_h(\cdot)\) is a symmetric, non-negative function that integrates to 1. It determines the shape of the contribution that each data point makes to the overall density estimate.

Gaussian Kernel

The Gaussian kernel is given by: $$K_h(x)=\frac{1}{\sqrt{2\pi h^2}}\exp\left(-\frac{x^2}{2h^2}\right)$$

Uniform kernel

$$K_h(x)=\frac{1}{2h}\quad\text{for }|x|\leq h$$

The Bandwidth Parameter h

The bandwidth \(h\) is a crucial parameter in KDE as it controls the spread of the kernel around each data point. A small bandwidth produces a highly detailed estimate that may overfit the data, while a large bandwidth results in an oversmoothed estimate that may fail to capture important features of the distribution.

Contrasting with Regular Mixture Distributions

As mentioned, mixture distributions and KDE share a conceptual similarity, both combining multiple components to approximate the overall density. However, there are key differences:

Higher Dimensions

The KDE framework extends naturally to multivariate data. For a d-dimensional sample \(x_1,x_2,\dots,x_n\), the multivariate KDE is given by: $$\hat{f}_h(\vec{x})=\frac{1}{n}\sum_{i=1}^{n}K_h(\vec{x}-\vec{x}_i)$$

Issues

KDE can have problems near the boundaries of the support of the distribution. For example, if the true distribution is restricted to a finite range (e.g., non-negative values), the kernel functions may assign non-zero density outside this range, leading to a biased estimate.

Interactive Kernel Density Estimate (KDE)

Active KDEs

User Guide: Interactive Kernel Density Estimate (KDE) Tool

Welcome to the Kernel Density Estimate (KDE) tool! This guide will help you understand how to use the tool to estimate the probability density function of a dataset using various kernel functions. The tool allows you to input data points, select different kernels, adjust the bandwidth, and visualize multiple KDE curves simultaneously.

Step-by-Step Instructions

1. Entering Data Points

To begin, enter a list of comma-separated data points into the "Data Points" input field. The data points should represent the sample from which you wish to estimate the density.

You can also upload a CSV or text file with comma-separated values by using the "Upload CSV or Text File" option.

2. Selecting a Kernel Function

The tool supports several commonly used kernel functions for density estimation. You can choose the kernel type from the dropdown menu labeled "Select Kernel". Each kernel has different properties:

3. Adjusting the Bandwidth (h)

The bandwidth controls the smoothness of the KDE curve. A larger bandwidth results in a smoother curve, while a smaller bandwidth captures more detail but may lead to overfitting. Enter the desired bandwidth value in the "Bandwidth" input field (default: 1.0).

4. Adding Multiple KDE Curves

Once you have selected a kernel and adjusted the bandwidth, click the "Add KDE" button to generate and overlay the KDE curve on the chart. You can add multiple KDE curves with different bandwidths and kernel types. Each curve will be displayed with a unique color.

To remove a KDE curve, click the corresponding button in the "Active KDEs" list below the chart. You can also change the color of a KDE curve using the color selector next to each item in the list.

5. Adjusting the X-Axis Range

You can adjust the range of the x-axis by entering values for "X-Axis Min" and "X-Axis Max". This allows you to zoom in or out on specific regions of the data and KDE curves.

6. Updating the Data

If you wish to modify the data points, either by manually typing them in or uploading a new file, you can click the "Refresh Data" button to update the KDE curves based on the new data. This will ensure that all KDEs correspond to the latest data points.

Advanced Features

1. Kernel-Specific Behavior

Each kernel function has its own characteristics. For example:

2. Visualizing Data Points

The original data points are always plotted on the chart as red scatter points. This allows you to see how the KDE curves fit the data and how the smoothing effects of different kernels and bandwidths affect the estimate.

3. Customizing KDE Curves

You can customize the KDE curves by adjusting their color using the color selector next to each curve in the "Active KDEs" list. This is especially useful when you have multiple KDE curves on the same chart.

Frequently Asked Questions (FAQ)

1. How does the bandwidth affect the KDE?

The bandwidth controls how much smoothing is applied to the KDE. A smaller bandwidth captures more local detail, while a larger bandwidth results in a smoother, more global estimate.

2. Can I visualize multiple KDEs at once?

Yes! You can add multiple KDE curves with different kernels and bandwidths. They will be overlaid on the same chart, and each KDE will be listed in the "Active KDEs" section where you can manage them.

3. Can I change the color of the KDE curves?

Yes, each KDE in the "Active KDEs" section has a color picker that allows you to customize the color of that curve. This makes it easier to distinguish between multiple KDEs.

4. What kernels are available in the tool?

The tool includes the following kernels: Gaussian, Epanechnikov, Uniform, Triangular, Biweight (Quartic), Cosine, and Laplace.

5. How do I remove a KDE curve?

Click the "Remove KDE" button next to the corresponding KDE in the "Active KDEs" list to remove it from the chart.

6. How do I upload a file?

Click the "Upload CSV or Text File" button and select a CSV or text file containing comma-separated data points. The file will be uploaded, and the data points will be displayed in the input field. The KDE curves will then update accordingly.

KDE Practice Problems