A Proof of the Central Limit Theorem | by Sachin Date | Apr, 2024

A Proof of the Central Limit Theorem | by Sachin Date | Apr, 2024

Let’s return to our parade of topics. An infinite series forms the basis for generating functions which is the topic I will cover next.

Generating Functions

The trick to understanding Generating is to appreciate the usefulness of a…Label Maker.

A label maker (CC BY 2.0)

Imagine that your job is to label all the shelves of newly constructed libraries, warehouses, storerooms, pretty much anything that requires an extensive application of labels. Anytime they build a new warehouse in Boogersville or revamp a library in Belchertown (I am not entirely making these names up), you get a call to label its shelves.

The Clapp Memorial Library in Belchertown, MA, USA, which I think is one of the finest public library buildings in . (CC BY-SA 3.0)

So imagine then that you just got a call to label out a shiny new warehouse. The aisles in the warehouse go from 1 through 26, and each aisle runs 50 spots deep and 5 shelves tall.

You could just print out 6500 labels like so:

A.1.1, A.1.2,…,A.1.5, A.2.1,…A.2.5,…,A50.1,…,A50.5,
B1.1,…B2.1,…,B50.5,.. and so on until Z.50.5,

And you could present yourself along with your suitcase stuffed with 6500 florescent dye coated labels at your local airport for a flight to Boogersville. It might take you a while to get through airport security.

Or here’s an idea. Why not program the sequence into your label maker? Just carry the label maker with you. At Boogersville, load the machine with a roll of tape, and off you go to the warehouse. At the warehouse, you press a button on the machine, and out flows the entire sequence for aisle ‘A’.

Your label maker is the generating function for this, and other sequences like this one:

A.1.1, A.1.2,…,A.1.5, A.2.1,…A.2.5,…,A50.1,…,A50.5

In math, a generating function is a mathematical function that you for generating sequences of your choosing so that you don’t have to remember the entire sequence.

If your proof uses a sequence of some kind, it’s often easier to substitute the sequence with its generating function. That instantly saves you the trouble of lugging around the entire sequence across your proof. Any operations, like differentiation, that you planned to perform on the sequence, you can instead perform them on its generating function.

But wait there’s more. All of the above advantages are magnified whenever the generating sequence has a closed form like the formula for e to the power x that we saw earlier.

A really simple generating function is the one shown in the figure below for the following infinite sequence: 1,1,1,1,1,…:

A generating function for an infinite sequence of 1s (Image by Author)

As you can see, a generating sequence is actually a series.

A slightly more complex generating sequence, and a famous one, is the one that generates a sequence of (n+1) binomial coefficients:

Sequence of binomial coefficients (Image by Author)

Each coefficient nCk gives you the number of different ways of choosing k out of n objects. The generating function for this sequence is the binomial of (1 + x) to the power n:

The generating function for a sequence of n+1 binomial coefficients

In both examples, it’s the coefficients of the x terms that constitute the sequence. The x terms raised to different powers are there primarily to keep the coefficients apart from each other. Without the x terms, the summation will just fuse all the coefficients into a single number.

The two examples of generating functions I showed you illustrate applications of the modestly named Ordinary Generating Function. The OGF has the following general form:

The Ordinary Generating Function (Image by Author)

Another greatly useful form is the Exponential Generating Function (EGF):

The Exponential Generating Function (Image by Author)

It’s called exponential because the value of the factorial term in the denominator increases at an exponential rate causing the values of the successive terms to diminish at an exponential rate.

The EGF has a remarkably useful property: its k-th derivative, when evaluated at x=0 isolates out the k-th element of the sequence a_k. See below for how the 3rd derivative of the above mentioned EGF when evaluated at x=0 gives you the coefficient a_3. All other terms disappear into nothingness:

Third derivative of the EGF yields a_3 (Image by Author)

Our next topic, the Taylor series, makes use of the EGF.

Taylor series

The Taylor series is a way to approximate a function using an infinite series. The Taylor series for the function f(x) goes like this:

Taylor Series expansion of f(x) at x = a (Image by Author)

In evaluating the first two terms, we use the fact that 0! = 1! = 1.

f⁰(a), f¹(a), f²(a), etc. are the 0-th, 1st, 2nd, etc. derivatives of f(x) evaluated at x=a. f⁰(a) is simple f(a). The value ‘a’ can be anything as long as the function is infinitely differentiable at x = a, that is, it’s k-th derivative exists at x = a for all k from 1 through infinity.

In spite of its startling originality, the Taylor series doesn’t always work well. It creates poor quality approximations for functions such as 1/x or 1/(1-x) which off to infinity at certain points in their domain such as at x = 0, and x = 1 respectively. These are functions with singularities in them. The Taylor series also has a hard time keeping up with functions that fluctuate rapidly. And then there are functions whose Taylor series based expansions will converge at a pace that will make continental drifts seem recklessly fast.

But let’s not be too withering of the Taylor series’ imperfections. What is really astonishing about it is that such an approximation works at all!

The Taylor series happens be to one of the most studied, and most used mathematical artifacts.

On some occasions, the upcoming proof of the CLT being one such occasion, you’ll find it useful to split the Taylor series in two parts as follows:

The Taylor expansion of f(x) series split around the r-th term of the series (Image by Author)

Here, I’ve split the series around the index ‘r’. Let’s call the two pieces T_r(x) and R_r(x). We can express f(x) in terms of the two pieces as follows:

f(x) as the sum of the Taylor polynomial of degree r, and the residual (Image by Author)

T_r(x) is known as the Taylor polynomial of order ‘r ‘ evaluated at x=a.

R_r(x) is the remainder or residual from approximating f(x) using the Taylor polynomial of order ‘r’ evaluated at x=a.

By the way, did you notice a glint of similarity between the structure of the above equation, and the general form of a linear regression model consisting of the observed value y, the modeled value β_capX, and the residual e?

The general form of a linear regression model (Image by Author)

But let’s not dim our focus.

Returning to the topic at hand, Taylor’s theorem, which we’ll use to prove the Central Limit Theorem, is what gives the Taylor’s series its legitimacy. Taylor’s theorem says that as x → a, the remainder term R_r(x) converges to 0 faster than the polynomial (x — a) raised to the power r. Shaped into an equation, the statement of Taylor’s theorem looks like this:

Taylor’s Theorem (Image by Author)

One of the great many uses of the Taylor series lies in creating a generating function for the moments of random variable. Which is what we’ll do next.

Moments and the Moment Generating Function

The k-th moment of a random variable X is the expected value of X raised to the k-th power.

The k-th raw moment of X (Image by Author)

This is known as the k-th raw moment.

The k-th moment of X around some value c is known as the k-th central moment of X. It’s simply the k-th raw moment of (X — c):

The k-th central moment of X around c (Image by Author)

The k-th standardized moment of X is the k-th central moment of X divided by k-th power of the standard deviation of X:

The k-th standardized moment of X (Image by Author)

The first 5 moments of X have specific values or meanings attached to them as follows:

  • The zeroth’s raw and central moments of X are E(X⁰) and E[(X — c)⁰] respectively. Both equate to 1.
  • The 1st raw moment of X is E(X). It’s the mean of X.
  • The second central moment of X around its mean is E[X — E(X)]². It’s the variance of X.
  • The third and fourth standardized moments of X are E[X — E(X)]³/σ³, and E[X — E(X)]⁴/σ⁴. They are the skewness and kurtosis of X respectively. Recall that skewness and kurtosis of X are used by the Jarque-Bera test of normality to test if X is normally distributed.

After the 4th moment, the interpretations become assuredly murky.

With so many moments flying around, wouldn’t it be terrific to have a generating function for them? That’s what the Moment Generating Function (MGF) is for. The Taylor series makes it super-easy to create the MGF. Let’s see how to create it.

We’ll define a new random variable tX where t is a real number. Here’s the Taylor series expansion of e to the power tX evaluated at t = 0:

Taylor series expansion of e to the power tX at t = 0 (Image by Author)

Let’s apply the Expectation operator on both sides of the above equation:

(Image by Author)

By linearity (and ) rule of expectation: E(aX + bY) = aE(X) + bE(Y), we can move the Expectation operator inside the summation as follows:

(Image by Author)

Recall that E(X^k] are the raw moments of X for k = 0,1,23,…

Let’s compare Eq. (2) with the general form of an Exponential Generating Function:

The Exponential Generating Function (Image by Author)

What do we observe? We see that E(X^k] in Eq. (2) are the coefficients a_k in the EGF. Thus Eq. (2) is the generating function for the moments of X, and so the formula for the Moment Generating Function of X is the following:

The formula for the Moment Generating Function of X (Image by Author)

The MGF has many interesting properties. We’ll use a few of them in our proof of the Central Limit Theorem.

Remember how the k-th derivative of the EGF when evaluated at x = 0 gives us the k-th coefficient of the underlying sequence? We’ll use this property of the EGF to pull out the moments of X from its MGF.

The zeroth derivative of the MGF of X evaluated at t = 0 is obtained by simply substituting t = 0 in Eq. (3). M⁰_X(t=0) evaluates to 1. The first, second, third, etc. derivatives of the MGF of X evaluated at t = 0 are denoted by M¹_X(t=0), M²_X(t=0), M³_X(t=0), etc. They evaluate respectively to the first, second, third etc. raw moments of X as shown below:

The derivatives of M(X) evaluate to the moments of X (Image by Author)

This gives us our first interesting and useful property of the MGF. The k-th derivative of the MGF evaluated at t = 0 is the k-th raw moment of X.

The k-th derivative of the MGF of X is the k-th raw moment of X (Image by Author)

The second property of MGFs which we’ll find useful in our upcoming proof is the following: if two random variables X and Y have identical Moment Generating Functions, then X and Y have identical Cumulative Distribution Functions:

Identical MGFs implies identical CDFs (Image by Author)

If X and Y have identical MGFs, it implies that their mean, variance, skewness, kurtosis, and all higher order moments (whatever humanly unfathomable aspects of reality those moments might represent) are all one-is-to-one identical. If every single property exhibited by the shapes of X and Y‘s CDF is correspondingly the same, you’d expect their CDFs to also be identical.

The third property of MGFs we’ll use is the following one that applies to X when X scaled by ‘a’ and translated by ‘b’:

MGF of aX + b (Image by Author)

The fourth property of MGFs that we’ll use applies to the MGF of the sum of ‘n’ , identically distributed random variables:

MGF of sum of n i.i.d. random variables (Image by Author)

A final result, before we prove the CLT, is the MGF of a standard normal random variable N(0, 1) which is the following (you may want to compute this as an exercise):

MGF of a standard normal random variable (Image by Author)

Speaking of the standard normal random variable, as shown in Eq. (4), the first, second, third, and fourth derivatives of the MGF of N(0, 1) when evaluated at t = 0 will give you the first moment (mean) as 0, the second moment (variance) as 1, the third moment (skew) as 0, and the fourth moment (kurtosis) as 1.

And with that, the machinery we need to prove the CLT is in place.

Proof of the Central Limit Theorem

Let X_1, X_2,…,X_n be ‘n’ i. i. d. random variables that form a random sample of size ‘n’. Assume that we’ve drawn this sample from a population that has a mean μ and variance σ².

Let X_bar_n be the sample mean:

The formula for the sample mean (Image by Author)

Let Z_bar_n be the standardized sample mean:

The standardized sample mean (Image by Author)

The Central Limit Theorem states that as ‘n’ tends to infinity, Z_bar_n converges in distribution to N(0, 1), i.e. the CDF of Z_bar_n becomes identical to the CDF of N(0, 1) which is often represented by the Greek letter ϕ (phi):

The CDF of Z_bar_n becomes identical to the CDF of N(0,1) as n → ∞ (Image by Author)

To prove this statement, we’ll use the property of the MGF (see Eq. 5) that if the MGFs of X and Y are identical, then so are their CDFs. Here, it’ll be sufficient to show that as n tends to infinity, the MGF of Z_bar_n converges to the MGF of N(0, 1) which as we know (see Eq. 8) is ‘e’ to the power t²/2. In short, we’d want to prove the following identity:

(Image by Author)

Let’s define a random variable Z_k as follows:

(Image by Author)

We’ll now express the standardized mean Z_bar_n in terms of Z_k as shown below:

(Image by Author)

Next, we apply the MGF operator on both sides of Eq. (9):

(Image by Author)

By construction, Z_1/√n, Z_2/√n, …, Z_n/√n are independent random variables. So we can use property (7a) of MGFs which expresses the MGF of the sum of n independent random variables:

MGF of sum of n independent random variables (Image by Author)

By their definition, Z_1/√n, Z_2/√n, …, Z_n/√n are also identical random variables. So we award ourselves the liberty to assume the following:

Z_1/√n = Z_2/√n = … = Z_n/√n = Z/√n.

Therefore using property (7b) we get:

MGF of n i.i.d. random variables (Image by Author)

Finally, we’ll also use the property (6) to express the MGF of a random variable (in this case, Z) that is scaled by a constant (in this case, 1/√n) as follows:

(Image by Author)

With that, we have converted our original goal of finding the MGF of Z_bar_n into the goal of finding the MGF of Z/√n.

M_Z(t/√n) is a function like any other function that takes (t/√n) as a parameter. So we can create a Taylor series expansion of M_Z(t/√n) at t = 0 as follows:

Taylor series expansion of M_Z(t/√n) at a = 0 (Image by Author)

Next, we split this expansion into two parts. The first part is a finite series of three terms corresponding to k = 0, k = 1, and k = 2. The second part is the remainder of the infinite series:

Taylor series expansion of M_Z(t/√n) at a = 0 split into two parts (Image by Author)

In the above series, M⁰, M¹, M², etc. are the 0-th, 1st, 2nd, and so on derivatives of the Moment Generating Function M_Z(t/√n) evaluated at (t/√n) = 0. We’ve seen that these derivatives of the MGF happen to be the 0-th, 1st, 2nd, etc. moments of Z.

The 0-th moment, M⁰(0), is always 1. Recall that Z is, by its construction, a standard normal random variable. Hence, its first moment (mean), M¹(0), is 0, and its second moment (variance), M²(0), is 1. With these values in hand, we can express the above Taylor series expansion as follows:

After substituting M⁰ = 1, M¹ = 0, and M² = 1 (Image by Author)

Another way to express the above expansion of M_Z is as the sum of a Taylor polynomial of order 2 which captures the first three terms of the expansion, and a residue term that captures the summation:

Taylor series expansion of M_Z(t/√n) at a = 0 expressed as an order-2 Taylor polynomial and a remainder term (Image by Author)

We’ve already evaluated the order-2 Taylor polynomial. So our task of finding the MGF of Z is now further reduced to calculating the remainder term R_2.

Before we tackle the task of R_2, let’s step back and review what we want to prove. We wish to prove that as the sample size ‘n’ tends to infinity, the standardized sample mean Z_bar_n converges in distribution to the standard normal random variable N(0, 1):

The CDF of Z_bar_n becomes identical to the CDF of N(0,1) as n → ∞ (Image by Author)

To prove this we realized that it was sufficient to prove that the MGF of Z_bar_n will converge to the MGF of N(0, 1) as n tends to infinity.

(Image by Author)

And that led us on a quest to find the MGF of Z_bar_n shown first in Eq. (10), and which I am reproducing below for reference:

(Image by Author)

But it is really the limit of this MGF as n tends to infinity that we not only wish to calculate, but also show it to be equal to e to the power t²/2.

To make it to that goal, we’ll unpack and simplify the contents of Eq. (10) by sequentially applying result (12) followed by result (11) as follows:

Here we come to an uncomfortable place in our proof. Look at the equation on the last line in the above panel. You cannot just force the limit on the R.H.S. into the large bracket and zero out the yellow term. The trouble with making such a misinformed move is that there is an ‘n’ looming large in the exponent of the large bracket — the very n that wants to march away to infinity. But now get this: I said you cannot force the limit into the large bracket. I never said you cannot sneak it in.

So we shall make a sly move. We’ll show that the remainder term R_2 colored in yellow independently converges to zero as n tends to infinity no matter what its exponent is. If we succeed in that endeavor, common-sense reasoning suggests that it will be ‘legal’ to extinguish it out of the R.H.S., exponent or no exponent.

To show this, we’ll use Taylor’s theorem which I introduced in Eq. (1), and which I am reproducing below for your reference:

Taylor’s Theorem (Image by Author)

We’ll bring this theorem to bear upon our pursuit by setting x to (t/√n), and r to 2 as follows:

Taylor’s theorem for x = (t/√n), and r = 2 (Image by Author)

Next, we set a = 0, which instantly allows us to switch the limit:

(t/√n) → 0, to,

n → ∞, as follows:

Taylor’s theorem for x = (t/√n), r = 2, and a = 0 (Image by Author)

Now we make an important and not entirely obvious observation. In the above limit, notice how the L.H.S. will tend to zero as long as n tends to infinity independent of what value t has as long as it’s finite. In other words, the L.H.S. will tend to zero for any finite value of t since the limiting behavior is driven entirely by the (√n)² in the denominator. With this revelation comes the to drop t² from the denominator without changing the limiting behavior of the L.H.S. And while we’re at it, let’s also swing over the (√n)² to the numerator as follows:

(Image by Author)

Let this result hang in your mind for a few seconds, for you’ll need it shortly. Meanwhile, let’s return to the limit of the MGF of Z_bar_n as n tends to infinity. We’ll make some more progress on simplifying the R.H.S of this limit, and then sculpting it into a certain shape:

Some algebraic manipulation of the limit of Z_bar_n (Image by Author)

It may not look like it, but with Eq. (14), we are literally two steps away from proving the Central Limit Theorem.

All thanks to Jacob Bernoulli’s blast-from-the-past discovery of the product-series based formula for ‘e’.

So this will be the point to fetch a few balloons, confetti, party horns or whatever.

?

Here, we go:

We’ll use Eq. (13) to extinguish the green colored term in Eq. (14):

After using Eq. (13) in Eq. (14) (Image by Author)

Next we’ll use the following infinite product series for (e to the power x):

Infinite product series for e to the power x (Image by Author)

Get your party horns ready.

In the above equation, set x = t²/2 and substitute this result in the R.H.S. of Eq. (15), and you have proved the Central Limit Theorem:

As the sample size tends to infinity, the MGF of the standardized sample mean equates to the MGF of the standard normal random variable (Image by Author)
(Image by Author)

Source link