Understanding Histograms and Kernel Density Estimation | by Reza Bagheri | Dec, 2023

An in-depth exploration of histograms and KDE

Reza Bagheri
Towards Data Science

A histogram is a that visualizes the frequency of numerical . It is commonly used in data and statistics to have a raw estimate of the of a . Kernel density estimation (KDE) is a method for estimating the probability density (PDF) of a random variable with an unknown distribution using a random sample drawn from that distribution. Hence, it allows us to infer the probability density of a population, based on a finite dataset sampled from it. KDE is often used in signal and data science, as an essential tool to estimate the probability density. This article discusses the math and intuition behind histograms and KDE and their advantages and limitations. It also demonstrates how KDE can be implemented in from scratch. All figures in this article were created by the author.

Probability density function

Let X be a continuous random variable. The probability that X takes a value in the interval [a, b] can be written as

where f(x) is X‘s probability density function (PDF). The cumulative density function (CDF) of X is defined as:

Hence the CDF of X, evaluated at x, is the probability that X will take a value less than or equal to x. Using Equation 1, we can write:

Using the fundamental theorem of calculus, we can show that

which means that the PDF of X can be determined by taking the derivative of its CDF with respect to x. A histogram is the simplest approach to estimate the PDF of a dataset, and as we show in the next section it uses Equation 1 for this .

Histograms

In Listing 1, we create a bimodal distribution as a mixture of two normal distributions and draw a random sample of size 1000 from this distribution. Here we mix two normal distributions:

Hence, the mean of the normal distributions is 0 and 4 respectively and their variance is 1 and 0.8 respectively. The mixing coefficients are 0.7 and 0.3, so the PDF of the mixture of these distributions is:

Listing 1 plots this PDF and sample in 1.

Source link