In preparation for the next post on the central limit theorem, it’s worth recalling the fundamental results on convergence of the average of a sequence of random variables: the law of large numbers (both weak and strong), and its strengthening to non-IID sequences, the Birkhoff ergodic theorem.
1. Convergence of random variables
First we need to recall the different ways in which a sequence of random variables may converge. Let be a sequence of real-valued random variables and a single random variable to which we want the sequence to “converge”. There are various ways of formalising this.
1.1. Almost sure convergence
1.2. Convergence in probability
Almost sure convergence implies convergence in probability (by Egorov’s theorem, but not vice versa. For example, let be any sequence of intervals such that for every the sets
are both infinite. Let and let be the characteristic function of the interval . Then but .
1.3. Convergence in distribution
A still weaker notion of convergence is convergence “in distribution”: we write if, writing for the cumulative distribution functions of and , we have at all where is continuous.
Convergence in probability implies convergence in distribution, but the converse fails if is not a.s.-constant. Here is one broad class of examples showing this: suppose has for every interval (for example, this is true if is normal with zero mean). Then and have the same CDF, and so any sequence which converges in distribution to one of the two will also converge in distribution to the other; on the other hand, cannot converge in probability to both and unless a.s.
2. Weak law of large numbers
Given a sequence of real-valued random variables , we consider the sums
Then is the average of the first observations.
Suppose that the sequence is independent and identically distributed (IID) and that is integrable — that is, . Then in particular the mean is finite. The weak law of large numbers says that converges in probability to the constant function . Because the limiting distribution here is a constant, it is enough to show convergence in distribution. This fact leads to a well-known proof of the weak law of large numbers using characteristic functions.
If a random variable is absolutely continuous — that is, if it has a probability density function — then its characteristic function is the Fourier transform of . More generally, the characteristic function of is
Characteristic functions are related to convergence in distribution by Lévy’s continuity theorem, which says (among other things) that if and only if for all . In particular, to prove the weak law of large numbers it suffices to show that the characteristic functions of converge pointwise to the function .
Let be the characteristic function of . (Note that each has the same characteristic function because they are identically distributed.) Let be the characteristic function of — then
By Taylor’s theorem and by linearity of expectation, we have for that
and together with (6) this gives
which completes the proof.
3. Strong law of large numbers and ergodic theorem
The strong law of large numbers states that not only does converge to in probability, it also converges almost surely. This takes a little more work to prove. Rather than describe a proof here (a nice discussion of both laws, including a different proof of the weak law than the one above, can be found on Terry Tao’s blog), we observe that the strong law of large numbers can be viewed as a special case of the Birkhoff ergodic theorem, and then give a proof of this result. First we state the ergodic theorem (or at least, the version of it that is most relevant for us).
Let be a probability space and a measurable transformation. Suppose that is -invariant and ergodic. Then for any , we have
for -a.e. , where .
Before giving a proof, we describe how the strong law of large numbers is a special case of Theorem 1. Let be a sequence of IID random variables , and define a map by
Let be the probability measure on that determines , and let be the corresponding probability measure on .
Because the variables are independent, has the form , and because they are identically distributed, all the marginal distributions are the same, so in fact for some probability distribution on .
The measure is invariant and ergodic with respect to the dynamics on given by the shift map (this is an example of a Bernoulli measure). Writing and putting , we see that for we have
In particular, the convergence in (7) implies the strong law of large numbers.
4. Proving the ergodic theorem
To prove the ergodic theorem, it suffices to consider a function with and show that the set
has for every . Indeed, the set of points where (7) fails is the (countable) union of the sets for the functions , and thus has -measure zero if this result holds.
Note that is -invariant, and so by ergodicity we either have or . We assume that and derive a contradiction by showing that this implies .
this is the content of the maximal ergodic theorem.
which is integrable, and so the argument just given shows that (10) holds and in particular , contradicting the assumption on . This proves that , which as described above is enough to prove that (7) holds -a.e.