An in-depth exploration of histograms and KDE
A histogram is a graph that visualizes the frequency of numerical data. It is commonly used in data science and statistics to have a raw estimate of the distribution of a dataset. Kernel density estimation (KDE) is a method for estimating the probability density function (PDF) of a random variable with an unknown distribution using a random sample drawn from that distribution. Hence, it allows us to infer the probability density of a population, based on a finite dataset sampled from it. KDE is often used in signal processing and data science, as an essential tool to estimate the probability density. This article discusses the math and intuition behind histograms and KDE and their advantages and limitations. It also demonstrates how KDE can be implemented in Python from scratch. All figures in this article were created by the author.
Probability density function
Let X be a continuous random variable. The probability that X takes a value in the interval [a, b] can be written as
where f(x) is X‘s probability density function (PDF). The cumulative density function (CDF) of X is defined as:
Hence the CDF of X, evaluated at x, is the probability that X will take a value less than or equal to x. Using Equation 1, we can write:
Using the fundamental theorem of calculus, we can show that
which means that the PDF of X can be determined by taking the derivative of its CDF with respect to x. A histogram is the simplest approach to estimate the PDF of a dataset, and as we show in the next section it uses Equation 1 for this purpose.
Histograms
In Listing 1, we create a bimodal distribution as a mixture of two normal distributions and draw a random sample of size 1000 from this distribution. Here we mix two normal distributions:
Hence, the mean of the normal distributions is 0 and 4 respectively and their variance is 1 and 0.8 respectively. The mixing coefficients are 0.7 and 0.3, so the PDF of the mixture of these distributions is:
Listing 1 plots this PDF and sample in Figure 1.