In preparation for the next post on the central limit theorem, it’s worth recalling the fundamental results on convergence of the average of a sequence of random variables: the law of large numbers (both weak and strong), and its strengthening to non-IID sequences, the Birkhoff ergodic theorem.

**1. Convergence of random variables **

First we need to recall the different ways in which a sequence of random variables may converge. Let be a sequence of real-valued random variables and a single random variable to which we want the sequence to “converge”. There are various ways of formalising this.

** 1.1. Almost sure convergence **

The strongest notion of convergence is “almost sure” convergence: we write if

If is the probability space on which the random variables are defined and is the probability measure defining , then this condition can be rewritten as

** 1.2. Convergence in probability **

A weaker notion of convergence is convergence “in probability”: we write if

In terms of and , this condition is

Almost sure convergence implies convergence in probability (by Egorov’s theorem, but not vice versa. For example, let be any sequence of intervals such that for every the sets

are both infinite. Let and let be the characteristic function of the interval . Then but .

** 1.3. Convergence in distribution **

A still weaker notion of convergence is convergence “in distribution”: we write if, writing for the cumulative distribution functions of and , we have at all where is continuous.

Convergence in probability implies convergence in distribution, but the converse fails if is not a.s.-constant. Here is one broad class of examples showing this: suppose has for every interval (for example, this is true if is normal with zero mean). Then and have the same CDF, and so any sequence which converges in distribution to one of the two will also converge in distribution to the other; on the other hand, cannot converge in probability to both and unless a.s.

**2. Weak law of large numbers **

Given a sequence of real-valued random variables , we consider the sums

Then is the average of the first observations.

Suppose that the sequence is independent and identically distributed (IID) and that is integrable — that is, . Then in particular the mean is finite. The weak law of large numbers says that converges in probability to the constant function . Because the limiting distribution here is a constant, it is enough to show convergence in distribution. This fact leads to a well-known proof of the weak law of large numbers using characteristic functions.

If a random variable is absolutely continuous — that is, if it has a probability density function — then its characteristic function is the Fourier transform of . More generally, the characteristic function of is

Characteristic functions are related to convergence in distribution by Lévy’s continuity theorem, which says (among other things) that if and only if for all . In particular, to prove the weak law of large numbers it suffices to show that the characteristic functions of converge pointwise to the function .

Let be the characteristic function of . (Note that each has the same characteristic function because they are identically distributed.) Let be the characteristic function of — then

Because the variables are independent, we have

By Taylor’s theorem and by linearity of expectation, we have for that

and together with (6) this gives

which completes the proof.

**3. Strong law of large numbers and ergodic theorem **

The strong law of large numbers states that not only does converge to in probability, it also converges almost surely. This takes a little more work to prove. Rather than describe a proof here (a nice discussion of both laws, including a different proof of the weak law than the one above, can be found on Terry Tao’s blog), we observe that the strong law of large numbers can be viewed as a special case of the Birkhoff ergodic theorem, and then give a proof of this result. First we state the ergodic theorem (or at least, the version of it that is most relevant for us).

Theorem 1

Let be a probability space and a measurable transformation. Suppose that is -invariant and ergodic. Then for any , we have

Before giving a proof, we describe how the strong law of large numbers is a special case of Theorem 1. Let be a sequence of IID random variables , and define a map by

Let be the probability measure on that determines , and let be the corresponding probability measure on .

Because the variables are independent, has the form , and because they are identically distributed, all the marginal distributions are the same, so in fact for some probability distribution on .

The measure is invariant and ergodic with respect to the dynamics on given by the shift map (this is an example of a Bernoulli measure). Writing and putting , we see that for we have

In particular, the convergence in (7) implies the strong law of large numbers.

**4. Proving the ergodic theorem **

To prove the ergodic theorem, it suffices to consider a function with and show that the set

has for every . Indeed, the set of points where (7) fails is the (countable) union of the sets for the functions , and thus has -measure zero if this result holds.

Note that is -invariant, and so by ergodicity we either have or . We assume that and derive a contradiction by showing that this implies .

The assumption on implies that for -a.e. . The key step now is to use this fact to show that

this is the content of the maximal ergodic theorem.

Proving the maximal ergodic theorem requires a small trick. Let and let . Then

and because for -a.e. , this implies that converges -a.e. to . Now we want to argue that

because the integral on the right is equal to by -invariance of , and this integral in turn is non-negative because is non-decreasing. So if (10) holds, then we have , which implies (8).

Pointwise convergence does not always yield convergence of integrals, so to verify (10) we need the Lebesgue dominated convergence theorem. Using (9) we have

which is integrable, and so the argument just given shows that (10) holds and in particular , contradicting the assumption on . This proves that , which as described above is enough to prove that (7) holds -a.e.

Pingback: Spectral methods 3 – central limit theorem | Vaughn Climenhaga's Math Blog

So the Birkhoff ergodic theorem is a strengthening of the strong law of large numbers. Would it be a stretch to say that ergodic theory is an offshoot in a certain direction from probability theory?

In that spirit, to what extent is it necessary for a worker dynamical systems to know about probability theory? Is it necessary for a person working in dynamical systems to be well-versed in the nitty-gritty details of the proofs of fundamental facts in probability such as Central Limit Theorem, Strong and Weak Laws, Markov Chains, etc.?

I think it’s reasonable to view ergodic theory as a version of probability theory, that studies stochastic processes arising from deterministic transformations equipped with an invariant measure. In particular, it studies processes that are identically distributed but not independent. There are many interesting questions in ergodic theory, and dynamical systems more broadly, that do not require knowledge of the details of the proofs of the basic probabilistic results that you mention. The extension of results from probability theory to dynamical systems is an important part of this area, and to me it is one of the most interesting parts, but it is by no means the only part.

Pingback: Law of large numbers for dependent but uncorrelated random variables | Vaughn Climenhaga's Math Blog