Entropy of S-gap shifts

1. S-gap shifts

S-gap shifts are a useful example for studying dynamics of shift spaces that are not subshifts of finite type but still exhibit some strong mixing properties. They are defined as follows: given {S\subset \{0,1,2,\dots,\}}, let {G\subset \{0,1\}^*} be the set of all words on two symbols of the form {0^n1} — that is, {n} 0s followed by a single 1. Then let {G^{{\mathbb Z}}} be the set of all bi-infinite sequences of 0s and 1s that can be written as an infinite concatenation of words in {G}, and let {X\subset \{0,1\}^{\mathbb Z}} be the smallest closed shift-invariant set containing {G^{\mathbb Z}}.

Equivalently, {X} is the collection of bi-infinite sequences {x\in \{0,1\}^{\mathbb Z}} for which every subword of the form {10^n1} has {n\in S}. If {S} is finite then {X} is a shift of finite type. We are usually most interested in the case where {S} is infinite — for example, in this paper (arXiv) where Dan Thompson and I considered questions of uniqueness of the measure of maximal entropy. For purposes of this post, {S} may be finite or infinite, it will not matter.

Recall that if {\mathcal{L}_n} denotes the set of words of length {n} that appear somewhere in the shift {X}, then the topological entropy of {X} is {h(X) = \lim \frac 1n \log \#\mathcal{L}_n}. The following result is well-known and often quoted.

Theorem 1 Given {S\subset \{0,1,2,\dots\}}, the topological entropy of the corresponding {S}-gap shift is {h(X) = \log\lambda}, where {\lambda>1} is the unique solution to {\sum_{n\in S} \lambda^{-(n+1)} = 1}.

Note that when {S=\{0,1,2,\dots\}}, the {S}-gap shift is the full shift on two symbols, and the equation has solution {\lambda=2}.

Despite the fact that Theorem 1 is well-known, I am not aware of a complete proof written anywhere in the literature. In a slightly different language, this result already appears in B. Weiss’ 1970 paper “Intrinsically ergodic systems” [Bull. AMS 76, 1266--1269] as example 3.(3), but no details of the proof are given. It is exercise 4.3.7 in Lind and Marcus’ book “Symbolic dynamics and coding”, and the section preceding it gives ideas as to how the proof may proceed. Finally, a more detailed proof appears in Spandl’s “Computing the topological entropy of subshifts” [Math. Log. Quart. 53 (2007), 493--510], but there is a gap in the proof. The goal of this post is to explain where the gap in Spandl’s proof is, and then to give two other proofs of Theorem 1, one more combinatorial, the other more ergodic theoretic.

2. An incomplete proof

The idea of Spandl’s proof is this. Given an {S}-gap shift, let {\mathcal{R}_n} be the set of words of length {n} that end in the symbol {1}. Every such word is either {0^{n-1}1} or is of the form {w0^k1}, where {w\in \mathcal{R}_{n-(k+1)}} and {k\in S}. Thus we have

\displaystyle  \#\mathcal{R}_n = 1 + \sum_{k=0}^{n-1} {\mathbf{1}}_S(k) \#\mathcal{R}_{n-(k+1)}. \ \ \ \ \ (1)

Moreover, every {w\in \mathcal{L}_n} is either {0^n} or is of the form {v0^{n-k}} for some {v\in \mathcal{R}_k}, so {\#\mathcal{L}_n = 1 + \sum_{k=1}^n \#\mathcal{R}_k}. With a little more work, one can use this together with (1) to get

\displaystyle  \#\mathcal{L}_n = c_n + \sum_{k=0}^{n-1} {\mathbf{1}}_S(k) \#\mathcal{L}_{n-(k+1)},

where {c_n\in [0,n+2]} for all {n}. Dividing through by {\#\mathcal{L}_n} gives

\displaystyle  1 = \frac{c_n}{\#\mathcal{L}_n} + \sum_{k=0}^{n-1} {\mathbf{1}}_S(k)\frac{\#\mathcal{L}_{n-(k+1)}}{\#\mathcal{L}_n} \ \ \ \ \ (2)

Writing {\lambda = e^{h(X)}}, Spandl now says that {\#\mathcal{L}_n} is asymptotically proportional to {\lambda^n}, and so for each {k} the ratio inside the sum converges to {\lambda^{-(k+1)}} as {n\rightarrow\infty}. Since {c_n} is subexponential, this would prove Theorem 1.

The problem is that the ratio {\frac{\#\mathcal{L}_{n-(k+1)}}{\#\mathcal{L}_n}} may not converge as {n\rightarrow\infty}. Indeed, taking {S = \{1,3,5,7,\dots\}} it is not hard to show that when {k} is even, the limit taken along odd values of {n} differs from the limit taken along even values of {n}.

One might observe that in this specific example, the terms where {k} is even do not contribute to the sum in (2), since {S} contains no even numbers. Thus it is plausible to make the following conjecture.

Conjecture. For any {S}, let {d = \mathrm{gcd}(S+1)} and let {h = \log\lambda} be the topological entropy of the corresponding {S}-gap shift. Then for every {j=0,1,\dots,d-1} the limit

\displaystyle  a_j := \lim_{n\rightarrow\infty} \frac{\#\mathcal{L}_{j+dn}}{\lambda^{j+dn}}

exists. In particular, if {k\in S} then

\displaystyle  \frac{\#\mathcal{L}_{n-(k+1)}}{\#\mathcal{L}_n} = \lambda^{-(k+1)}\frac{\#\mathcal{L}_{n-(k+1)}}{\lambda^{n-(k+1)}} \frac{\lambda^n}{\#\mathcal{L}_n} \rightarrow \lambda^{-(k+1)} \text{ as }n\rightarrow\infty.

If the conjecture is true, this would complete Spandl’s proof of Theorem 1. I expect that the conjecture is true but do not know how to prove it. In general, any shift space with entropy {h=\log \lambda} has the property that {\#\mathcal{L}_n\geq \lambda^n}. There are examples of shift spaces where {\#\mathcal{L}_n /\lambda^n} is not bounded above; however, it can be shown that every {S}-gap shift admits an upper bound, so that {\#\mathcal{L}_n/\lambda^n} is bounded away from 0 and {\infty} (this is done in my paper with Dan Thompson). I don’t see how those techniques can extend to a proof of the conjecture.

So instead of patching the hole in this proof, we investigate two others.

3. A combinatorial proof

Given a shift space {X} with language {\mathcal{L}}, let {a_n = \#\mathcal{L}_n} be the number of words of length {n}. Consider the following generating function:

\displaystyle  H(z) = \sum_{n=1}^\infty a_n z^n.

(This is similar to the dynamical zeta function but is somewhat different since we consider all words of length {n} and not just ones that represent periodic orbits of period {n}.) Observe that the radius of convergence of {H} is {\xi := e^{-h(X)}}. Indeed, for {|z|<\xi} we can put {\beta = |z|/\xi < 1} and observe that the quantity {|a_n z^n| = \frac{\#\mathcal{L}_n}{e^{nh(X)}} \beta^n} decays exponentially; similarly, for {|z|>\xi} the terms in the sum grow exponentially.

Now fix an {S}-gap shift and consider the function

\displaystyle  F_1(z) = \sum_{n\in S} z^{n+1}.

Our goal is to find a relationship between {H} and {F_1} allowing us to show that the radius of convergence of {H} is given by the positive solution to {F_1(z)=1}.

First recall the set {G = \{0^n1\mid n\in S\}}. Given integers {k,n}, let {A_n^k} be the number of words in {\{0,1\}^n} that can be written as a concatenation of exactly {k} words from {G}. Note that {A_n^1 = {\mathbf{1}}_S(n-1)}, so that

\displaystyle  F_1(z) = \sum_{n=1}^\infty A_n^1 z^n.

More generally, we consider the power series

\displaystyle  F_k(z) = \sum_{n=1}^\infty A_n^k z^n.

A little thought reveals that {F_k(z)F_\ell(z) = F_{k+\ell}(z)}, since the coefficient of {z^n} in {F_k(z)F_\ell(z)} is given by

\displaystyle  \sum_{m=1}^n A_{n-m}^k A_m^\ell,

which is equal to {A_n^{k+\ell}} (here {m} represents the location where the {\ell}th element of {G} ends). In particular, we get

\displaystyle  F_k(z) = (F_1(z))^k.

At this point, the natural thing to do is to say that {\#\mathcal{L}_n = \sum_{k\geq 1} A_n^k} and hence {H(z) = \sum_{k\geq 1} F_k(z)}. However, this is not quite correct because {\mathcal{L}_n} includes words that are not complete concatenations of words from {G} and so are not counted by any {A_n^k}. We return to this in a moment, but first point out that if this were true then we would have {H(z) = \sum_{k\geq 1} F_1(z)^k}, and so {H(z)} converges if {F_1(z)<1} and diverges if {F_1(z)>1}, which was our goal.

To make this precise, we observe that every word in {\mathcal{L}_n} is either {0^n} or is of the form {0^i w 0^j} where {w\in A_{n-(i+j)}^k} for some {k}. Thus we have the bounds

\displaystyle  \sum_{k\geq 1} A_n^k \leq \#\mathcal{L}_n \leq 1 + \sum_{k\geq 1} \sum_{i\geq 0} \sum_{j\geq 0} A_{n-(i+j)}^k.

In particular, for all {z\geq 0} we get

\displaystyle  \sum_{k\geq 1} F_k (z) \leq H(z) \leq \sum_{n\geq 1} z^n + \sum_{k\geq 1} F_k (z) \left(\sum_{i\geq 0} z^i \right) \left(\sum_{i\geq 0} z^j\right) \ \ \ \ \ (3)

Writing {\hat H(z) = \sum_k F_k(z) = \sum_k F_1(z)^k}, we note that for every {z\in [0,1)} we have {\sum_n z^n < \infty} and so {\hat H(z)} converges if and only if {H(z)} converges. Thus {\hat H} and {H} have the same radius of convergence; in particular, the radius of convergence of {H(z)} is the unique {z} such that {F_1(z) = \sum_{n\in S} z^{n+1} = 1}, and by the earlier discussion we have {h(X) = \log(1/z)}, proving Theorem 1.

4. An ergodic theory proof

Now we sketch a proof that goes via the variational principle, relating an {S}-gap shift (on the finite alphabet {\{0,1\}}) to a full shift on a (possibly countable) alphabet. The combinatorial proof above is elementary and requires little advanced machinery; this proof, on the other hand, requires a number of (rather deep) results from ergodic theory and thermodynamic formalism, but has the advantage of illuminating various aspects of the structure of {S}-gap shifts.

Let {X} be an {S}-gap shift and let {Y} be the set of sequences {x\in X} which contain the symbol 1 infinitely often both forwards and backwards. By the Poincaré recurrence theorem, {\mu(Y)=1} for every ergodic measure other than the delta-measure on the fixed point {0^\infty}. Note that {Y} is {\sigma}-invariant but not compact (unless {S} is finite).

Let {Z = Y \cap [1]}, and let {F\colon Z\rightarrow Z} be the first return map. Thus {F(x) = \sigma^n(x)} for all {x\in [10^n 1]}. Note that {F} is topologically conjugate to the full shift {S^{\mathbb Z}} on the alphabet {S}, which we allow to be finite or countable. The conjugacy is given by {\pi\colon S^{\mathbb Z}\rightarrow Z} that takes {\vec n = \cdots n_{-1}.n_0n_1\cdots} to {\cdots 10^{n_{-1}}.10^{n_0}10^{n_1}1\cdots}.

Given an ergodic {\sigma}-invariant probability measure {\mu} on {X} with {\mu(Y)=1}, let {\mu_Z} be the induced {F}-invariant measure on {Z}. Then by Abramov’s formula, we have

\displaystyle  h_\mu(X,\sigma) = \mu(Z) h_{\mu_Z}(Z,F).

Associate to each {\mu} the shift-invariant measure on {S^{\mathbb Z}} given by {h^{-1}_*\mu_Z}. Then we have

\displaystyle  h_\mu(X,\sigma) = \mu(Z) h_{\nu}(S^{\mathbb Z},\sigma).

Our goal is to relate the topological entropy on {(X,\sigma)} to the topological pressure of a suitable function on {(S^{\mathbb Z},\sigma)}. Let {\varphi\colon S^{\mathbb Z}\rightarrow {\mathbb R}} be the function taking {\vec n} to {n_0 + 1}, and observe that if {\mu} and {\nu} are identified as above, we have {\mu(Z) = \mu([1]) = 1/\int \varphi\,d\nu}, so that

\displaystyle  h_\nu = \left(\int \varphi\,d\nu \right) h_\mu.

At this point we elide some technical details regarding Gurevich pressure, etc., and simply remark that for {t>0} we have

\displaystyle  \begin{aligned} P(S^{\mathbb Z},-t\varphi) &= \lim_{k\rightarrow\infty} \frac 1k \log \sum_{(n_0,\dots,n_{k-1})\in S^k} \sup_{\vec m\in [n_0\cdots n_{k-1}]} e^{-t \sum_{i=0}^{k-1} \varphi(\sigma^i\vec m)} \\ &= \lim_{k\rightarrow\infty} \frac 1k \log \left(\sum_{n_0,\dots,n_{k-1}} e^{-t\sum_{i=0}^{k-1} (n_i + 1)} \right) \\ &= \lim_{k\rightarrow\infty} \frac 1k \log \left( \sum_{n\in S} e^{-t(n+1)} \right)^k = \log \sum_{n\in S} e^{-t(n+1)}, \end{aligned}

while by the variational principle

\displaystyle  P(S^{\mathbb Z},-t\varphi) = \sup_\nu \left(h_\nu - t \int\varphi\,d\nu\right) = \sup_\nu \left(\int\varphi\,d\nu\right) (h_\mu - t).

We conclude that

\displaystyle  \sup_\mu (h_\mu - t) \mu[1]^{-1} = \log \sum_{n\in S} e^{-t(n+1)}. \ \ \ \ \ (4)

Let {t} be such that the right-hand side is equal to 0; to prove Theorem 1 we need to prove that {h(X,\sigma) = t}. First observe that for every {\mu} we have {h_\mu - t \leq 0}, thus {h(X,\sigma) \leq t}. For the other inequality, let {\nu} be the Bernoulli measure on {S^{\mathbb Z}} that assigns weight {e^{-t(n+1)}} to the symbol {n}. Then {\nu} is an equilibrium state for {-t\varphi} on {S^{\mathbb Z}}, and by our choice of {t}, we have

\displaystyle  h_\nu - \int \varphi\,d\nu = 0,

so that in particular the left hand side of (4) vanishes and we get {h_\mu=t} for the measure {\mu} that corresponds to {\nu}. This shows that {h(X,\sigma) = t} and completes the proof of the theorem.

We note that this proof has the advantage of giving an explicit description of the MME. With a little more work it can be used to show that the MME is unique and has good statistical properties (central limit theorem, etc.).

Posted in ergodic theory, examples | Tagged , | Leave a comment

Slowly mixing sets

There are two equivalent definitions of mixing for a measure-preserving dynamical system {(X,f,\mu)}. One is in terms of sets:

\displaystyle  \lim_{n\rightarrow\infty} |\mu(A \cap f^{-n}B) - \mu(A)\mu(B)| = 0 \ \ \ \ \ (1)

for all measurable {A,B\subset X}. The other is in terms of functions:

\displaystyle  \lim_{n\rightarrow\infty} \left\lvert \int \varphi \cdot(\psi\circ f^n)\,d\mu - \int\varphi\,d\mu \int\psi\,d\mu\right\rvert = 0 \ \ \ \ \ (2)

for all {\varphi,\psi\in L^1(X,\mu)}. In both cases one may refer to the quantity inside the absolute value as the correlation function, and write it as {C_n(A,B)} or {C_n(\varphi,\psi)}, so that (1) and (2) become {C_n(A,B)\rightarrow 0} and {C_n(\varphi,\psi)\rightarrow 0}, respectively.

Then it is natural to ask how quickly mixing occurs; that is, how quickly the correlation functions {C_n} decay to 0. Is the convergence exponential in {n}? Polynomial? Something else?

In a previous post, we discussed one method for establishing exponential decay of correlations using a spectral gap for a certain operator associated to the system. The result there stated that as long as {\varphi} and {\psi} are chosen from a suitable class of test functions (often Hölder continuous functions), then {C_n(\varphi,\psi)} decays exponentially.

A comment was made in that post that it is important to work with sufficiently regular test functions, because for arbitrary measurable functions, or for the setwise correlations {C_n(A,B)}, the decay may happen arbitrarily slowly even when the system is “as mixing as possible”. But no examples were given there — so I’d like to explain this a little more completely here.

Let {X=\{0,1\}^{\mathbb N}} be the space of infinite sequences of 0s and 1s, and let {f=\sigma} be the shift map. Let {\mu} be {(\frac 12,\frac 12)}-Bernoulli measure, so that writing {[x_1\cdots x_n]} for the {n}-cylinder containing all sequences in {X} that begin with the symbols {x_1,\dots,x_n}, we have {\mu[x_1\cdots x_n] = 2^{-n}}. This is isomorphic to the doubling map on the circle with Lebesgue measure, which was discussed in the previous post and which has exponential decay of correlations for Hölder observables {\varphi,\psi}. Nevertheless, we will show that there are sets {A,B\subset X} such that {C_n(A,B)} decays quite slowly.

We can think of {(X,f,\mu)} as modeling sequences of coin flips, with tails corresponding to the symbol 0, and heads to 1.

Let {B = [1]} be the set of all sequences in {X} that begin with 1. Then {B} corresponds to the event that the first flip is heads, and {\sigma^{-n}B} corresponds to the event that the {(n+1)}st flip is heads.

The description of {A} is a little more involved. Given {j\leq k \in {\mathbb N}}, let {A_{j,k} = \bigcup_{i=j}^{k-1} \sigma^{-i}[1]}, so that {x\in A_{j,k}} if and only if (at least) one of the symbols {x_j,\dots,x_{k-1}} is 1. This corresponds to the event that heads appears at least once on or after the {j}th flip and before the {k}th flip. Note that {\mu(X \setminus A_{j,k}) = 2^{-(k-j)}}, and so {\mu(X \setminus A_{j,k}) = 1-2^{-(k-j)}}.

Fix an increasing sequence of integers {k_1,k_2,\dots} with {k_1 = 1}, and let {A = \bigcap_{i=1}^\infty A_{k_i, k_{i+1}}} be the set of all sequences {x\in X} such that every subword {x_{k_i} \cdots x_{k_{i+1}-1}} contains at least one occurrence of the symbol 1.

Using the interpretation as coin flips, we see that the events {A_{k_i,k_{i+1}-1}} are independent, and so

\displaystyle  \mu(A) = \mathop{\mathbb P}(A) = \prod_{i=1}^\infty \mathop{\mathbb P}(A_{k_i,k_{i+1}}) = \prod_{i=1}^\infty (1-2^{-(k_{i+1} - k_i)}). \ \ \ \ \ (3)

We recall the following elementary lemma.

Lemma 1 Let {t_i\in {\mathbb R}} be a sequence of real numbers such that {0\leq t_i < 1} for all {i\in {\mathbb N}}. Then {\prod_{i=1}^\infty (1-t_i) > 0} if and only if {\sum_{i=1}^\infty t_i < \infty}.

It follows from Lemma 1 that {\mu(A)>0} if and only if

\displaystyle  \sum_{i=1}^\infty 2^{-(k_{i+1} - k_i)} < \infty. \ \ \ \ \ (4)

Thus from now on we assume that the sequence {k_i} is chosen such that (4) holds. Note that the correlation function {C_n(A,B)} can be computed via conditional probabilities: using

\displaystyle  \mathop{\mathbb P}(A \mid \sigma^{-n}B) = \frac{\mu(A \cap \sigma^{-n}B)}{\mu(B)}

(notice that this uses shift-invariance of {\mu}), we have

\displaystyle  \begin{aligned} C_n(A,B) &= \mu(A\cap \sigma^{-n}B) - \mu(A)\mu(B) \\ &= \mu(B) (\mathop{\mathbb P}(A \mid \sigma^{-n}B) - \mu(A)). \end{aligned} \ \ \ \ \ (5)

Recalling that {A} is the intersection of the independent events {A_{k_i,k_{i+1}}}, we let {j=j(n)} be such that {n\in [k_j, k_{j+1})}, and observe that {\sigma^{-n}(B) \subset A_{k_j,k_{j+1}}}. Thus

\displaystyle  \mathop{\mathbb P}(A \mid \sigma^{-n}B) = \prod_{i\neq j} (1-2^{-(k_{i+1}-k_i)}) = \frac{\mu(A)}{1-2^{-(k_{j+1}-k_j)}},

where the last equality uses the expression for {\mu(A)} from (3). Together with the expression (5) for {C_n(A,B)} in terms of the conditional probability {\mathop{\mathbb P}(A\mid \sigma^{-n}B)}, this gives

\displaystyle  \begin{aligned} C_n(A,B) &= \mu(A) \mu(B) \left( \frac 1{1-2^{-(k_{j+1}-k_j)}} - 1 \right) \\ &= \mu(A)\mu(B) \frac 1{2^{k_{j+1} - k_j} - 1}, \end{aligned} \ \ \ \ \ (6)

where we emphasise that {j} is a function of {n}.

Example 1 Take {k_i = \lfloor 2 i\log_2 i \rfloor}, so that {k_{i+1} - k_i \approx 2 i \log_2 (1+\frac 1i) + 2\log_2(i+1)}, and {2^{k_{i+1} - k_i} \approx (i+1)^2 (1+\frac 1i)^{2i}}. Since {(1+\frac 1i)^{2i} \rightarrow e^2}, we see that (4) is satisfied, so that {\mu(A)>0}.

Moreover, {j=j(n)} satisfies {2j\log_2 j \leq n}, and so in particular {j\leq n}. This implies that {k_{j+1} - k_j \leq k_{n+1} - k_n}, and so we can estimate the correlation function using (6) by

\displaystyle  C_n(A,B) \geq \mu(A)\mu(B)\frac 1{2^{k_{n+1} - k_n} - 1} \approx \mu(A)\mu(B) \frac 1{n^2}.

The example shows that the correlation function of sets may only decay polynomially, even if the system has exponential decay of correlations for regular observable functions. In fact, we can use the construction above to produce sets with correlations that decay more or less as slowly as we like along some subsequence of times {n}.

Theorem 2 Let {\gamma_n>0} be any sequence of positive numbers converging to 0. Then there is a sequence {k_i} such that the sets {A,B} from the construction above have {\limsup_{n\rightarrow\infty} C_n(A,B)/\gamma_n = \infty}.

The result says that no matter how slowly {\gamma_n} converges to 0, we can choose {A,B} such that {C_n(A,B)} is not bounded above by {K\gamma_n} for any constant {K}.

To prove Theorem 2, we will choose {k_i} such that {C_n(A,B)/\gamma_n} is large whenever {n\in [k_j, k_{j+1})} with {j} even. Let {c_i,d_i} be any increasing sequences of integers (conditions on these will be imposed soon) and define {k_i} recursively by

\displaystyle  k_1 = 1, \quad k_{2i} = k_{2i-1} + c_i, \qquad k_{2i+1} = k_{2i} + d_i.

Because {c_i,d_i} are increasing, we have

\displaystyle  \sum_{i=1}^\infty 2^{-(k_{i+1} - k_i)} = \sum_{i=1}^\infty 2^{-c_i} + 2^{-d_i} < \infty,

so (4) holds and {\mu(A)>0}. The idea is that we will take {c_i \gg d_i}, so that when {n\in [k_{2j},k_{2j+1})}, we have {k_{2j+1} - k_{2j} = d_j} small relative to {k_{2j}}, and hence to {n}.

Let’s flesh this out a little. Given {n\in [k_{2j},k_{2j+1})}, it follows from (6) that

\displaystyle  C_n(A,B) = \mu(A)\mu(B) \frac 1{2^{k_{j+1} - k_j} - 1} \geq \mu(A)\mu(B) 2^{-d_j}. \ \ \ \ \ (7)

On the other hand, {n\geq k_{2j} \geq \sum_{i=1}^j c_i}. Now we may take {d_j} to be any increasing sequence ({d_j=j} will do just fine) and define {c_j} recursively in terms of {c_1,\dots, c_{j-1}} and {d_j}. We do this by observing that since {\gamma_n \rightarrow 0}, given any {c_1,\dots, c_{j-1}} and {d_j}, we can take {c_j} large enough so that for every {n\geq \sum_{i=1}^j c_i}, we have

\displaystyle  j\gamma_n \leq \mu(A) \mu(B) 2^{-d_j}.

In particular, (7) gives {C_n(A,B) \geq j\gamma_n} for every {n\in [k_{2j},k_{2j+1})}. Sending {j\rightarrow\infty} completes the proof of Theorem 2.

Posted in ergodic theory, examples | Tagged | 1 Comment

Law of large numbers for dependent but uncorrelated random variables

One of the fundamental results in probability theory is the strong law of large numbers, which was discussed in an earlier post under the guise of the Birkhoff ergodic theorem.

Suppose we have a sequence of random variables {t_n} which take values in {{\mathbb N}}. If the sequence {t_n} is independent and identically distributed with {\mathop{\mathbb E}[t_n] = T}, then the strong law of large numbers shows that {\frac 1n(t_1+\cdots + t_n)\rightarrow T} almost surely.

It is often the case, however, that one or both of these conditions fail; there may be some dependence between the variables {t_n}, or the distribution may not be identical for all values of {n}. For example, the following situation arose recently in a project I have been working on: there is a sequence of random variables {t_n} as above, which has the property that no matter what values {t_1,\dots,t_{n-1}} take, the random variable {t_n} has expected value at most {T}. In the language of conditional expectation, we have

\displaystyle  \mathop{\mathbb E}[t_n \mid t_1,\dots, t_{n-1}] \leq T.

This is true despite the fact that the distribution of {t_n} may vary according to the values of {t_1,\dots,t_{n-1}}. In what follows we will write {\mathcal{F}_n} for the {\sigma}-algebra generated by {t_1,\dots, t_n} and write the above condition as

\displaystyle  \mathop{\mathbb E}[t_n \mid \mathcal{F}_{n-1}]\leq T.

It is plausible to conjecture that in this setting one still has {\varlimsup \frac 1n(t_1+\cdots+t_n) \leq T} almost surely. And indeed, it turns out that a statement very close to this can be proved.

Let {\mathop{\mathcal F}_n} be an increasing sequence of {\sigma}-algebras and let {t_n} be a sequence of {{\mathbb N}}-values random variables such that {t_n} is {\mathop{\mathcal F}_n}-measurable. Suppose that there is {p\colon {\mathbb N} \rightarrow [0,1]} such that

\displaystyle  \mathop{\mathbb P}[t_n=t \mid \mathcal{F}_{n-1}] \leq p(t) \ \ \ \ \ (1)

and moreover,

\displaystyle   T := \sum_{t=1}^\infty t \cdot p(t) < \infty. \ \ \ \ \ (2)

The following result is proved by modifying one of the standard proofs of the strong law of large numbers, which can be found, for example, in Billingsley’s book “Probability and Measure”, where it is Theorem 22.1. There is also a discussion on Terry Tao’s blog, which emphasises the role of the main tools in this proof: the moment method and truncation.

Proposition 1 If {t_n} is any sequence of random variables satisfying (1) and (2), then {\varlimsup_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n t_k \leq T} almost surely.

Proof: Consider the truncated random variables {s_n = t_n {\mathbf{1}}_{[t_n\leq n]}}, and note that by (1) and (2) we have

\displaystyle  0 \leq s_n \leq t_n \Rightarrow \mathop{\mathbb E}[s_n \mid \mathcal{F}_{n-1}] \leq T.

Now consider the random variables {r_n = s_n + T - \mathop{\mathbb E}[s_n \mid \mathcal{F}_{n-1}]}. Note that {r_n} is {\mathcal{F}_n}-measurable, takes values in {{\mathbb N}}, and moreover

\displaystyle  \mathop{\mathbb E}[r_n \mid \mathcal{F}_{n-1}] = T \text{ for all } n. \ \ \ \ \ (3)

The most important consequence of this is that the sequence {r_n} is uncorrelated — that is, {\mathop{\mathbb E}[r_ir_j] = \mathop{\mathbb E}[r_i]\mathop{\mathbb E}[r_j] = T^2} for all {i\neq j}, which in turn implies that {\mathop{\mathbb E}[(r_i - T)(r_j-T)] = 0}. This means that the variance is additive for the sequence {r_n}:

\displaystyle  \begin{aligned} \mathop{\mathbb E}\Big[ \Big(\sum_{k=1}^n (r_k -T)\Big)^2\Big] &= \sum_{k=1}^n \mathop{\mathbb E}[(r_k -T)^2] + 2\sum_{1\leq i<j\leq n} \mathop{\mathbb E}[(r_i-T)(r_j-T)] \\ &= \sum_{k=1}^n \mathop{\mathbb E}[(r_k -T)^2]. \end{aligned}

Let {X_n = \frac 1n (r_1 + \cdots + r_n) - T}, so that {\mathop{\mathbb E}[X_N] = 0}. We can estimate {\mathop{\mathbb E}[X_n^2]} using the previous computation:

\displaystyle  \begin{aligned} \mathop{\mathbb E}[X_n^2] &= \frac 1{n^2} \sum_{k=1}^n \mathop{\mathbb E}[(r_k-T)^2] = \frac 1{n^2} \sum_{k=1}^n (\mathop{\mathbb E}[r_k^2] - 2T\mathop{\mathbb E}[r_k] + T^2) \\ &= \frac 1{n^2} \left( nT^2 + \sum_{k=1}^n \mathop{\mathbb E}[r_k^2]\right). \end{aligned} \ \ \ \ \ (4)

Recall that {p(t)} is the sequence bounding {\mathop{\mathbb P}[t_n = t \mid \mathop{\mathcal F}_{n-1}]}, and let {\bar t} be such that {\sum_{t\geq \bar t} p(t) \leq 1}. Let {Y} be a random variable taking the value {t} with probability {p(t)} for {t\geq \bar t}. Note that by the definition of {s_n} and {r_n}, we have {r_n \leq s_n + T \leq n+T}, thus

\displaystyle  \mathop{\mathbb E}[r_k^2] = \sum_{t=1}^{k+T} t^2 \mathop{\mathbb P}[r_k = t] \leq \sum_{t=1}^{k+T} t^2 p(t) \leq C + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq k+T]}]

for some fixed constant {C}. Note that the final expression is non-decreasing in {k}, and so together with (4) we have

\displaystyle  \mathop{\mathbb E}[X_n^2] \leq \frac 1n \left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n+T]}] \right), \ \ \ \ \ (5)

where again {C'} is a fixed constant. Fixing {\varepsilon>0} and using (5) in Chebyshev’s inequality yields

\displaystyle  \mathop{\mathbb P}[|X_n| \geq \varepsilon] \leq \frac 1{\varepsilon^2 n} \left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n+T]}] \right). \ \ \ \ \ (6)

Now let {n_k} be a sequence of integers such that {\lim_k \frac{n_{k+1}}{n_k} =: \alpha > 1}. We see that (6) yields

\displaystyle  \begin{aligned} \sum_{k=1}^\infty \mathop{\mathbb P}[|X_{n_k}| \geq \varepsilon] &\leq \frac 1{\varepsilon^2} \sum_{k=1}^\infty n_k^{-1}\left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n_k+R]}]\right) \\ &\leq C'' + \frac 1{\varepsilon^2} \mathop{\mathbb E} \left[ Y^2 \sum_{k=1}^\infty n_k^{-1} {\mathbf{1}}_{[Y\leq n_k+T]}\right], \end{aligned} \ \ \ \ \ (7)

where {C''<\infty} depends on {\alpha} and on {\varepsilon}. Observe that

\displaystyle  \sum_{k=1}^\infty n_k^{-1} {\mathbf{1}}_{[Y\leq n_k +T]} \leq \sum_{k=k_0(Y)}^\infty n_k^{-1},

where {k_0(Y)} is the minimal value of {k} for which {n_k\geq Y-T}. Because {n_k} grows like {\alpha^k}, the sum is bounded above by a constant times {n_{k_0(Y)}^{-1}}, so that together with (7) we have

\displaystyle  \sum_{k=1}^\infty \mathop{\mathbb P}[|X_{n_k}| \geq \varepsilon] \leq C'' + C''' \mathop{\mathbb E}[Y] < \infty,

where finiteness of {\mathop{\mathbb E}[Y]} is the crucial place where we use (2).

Using the first Borel-Cantelli lemma and taking an intersection over all rational {\varepsilon>0} gives

\displaystyle  \lim_{n\rightarrow\infty} \frac 1{n_k} \sum_{j=1}^{n_k} r_j = T \text{ a.s.} \ \ \ \ \ (8)

Let {Z_n = \sum_{j=1}^n r_j}. Because {r_j\geq 0} we have {Z_{n_k} \leq Z_n \leq Z_{n_{k+1}}} for all {n_k\leq n\leq n_{k+1}}, and in particular

\displaystyle  \frac {n_k}{n_{k+1}} \frac{Z_{n_k}}{n_k} \leq \frac{Z_n}{n} \leq \frac{n_{k+1}}{n_k} \frac{Z_{n_{k+1}}}{n_{k+1}}.

Taking the limit and using (8) gives

\displaystyle  \frac 1\alpha T \leq \varliminf_{k\rightarrow\infty} \frac 1k Z_k \leq \varlimsup_{k\rightarrow\infty} \frac 1k Z_k \leq \alpha T \text{ a.s.}

Taking an intersection over all rational {\alpha>1} gives

\displaystyle  \lim_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n r_k = T \text{ a.s.} \ \ \ \ \ (9)

Finally, we recall from the definition of {s_n} and {r_n} that {t_n\leq r_n} whenever {t_n\leq n}. In particular, we may observe that

\displaystyle  \sum_{n=1}^\infty \mathop{\mathbb P}[t_n > r_n] \leq \sum_{n=1}^\infty \mathop{\mathbb P}[t_n > n] \leq \mathop{\mathbb E}[t_n] \leq T

and apply Borel-Cantelli again to deduce that with probability one, {t_n > r_n} for at most finitely many values of {n}. In particular, (9) implies that

\displaystyle  \varlimsup_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n t_k \leq T \text{ a.s.} \ \ \ \ \ (10)

which completes the proof of Proposition 1. \Box

Posted in statistical laws, theorems | Tagged | Leave a comment

Equidistribution for random rotations

Two very different types of dynamical behaviour are illustrated by a pair of very well-known examples on the circle: the doubling map and an irrational rotation. On the unit circle in {{\mathbb C}}, the doubling map is given by {z\mapsto z^2}, while an irrational rotation is given by {z\mapsto e^{2\pi i\theta}z} for some irrational {\theta}.

Lebesgue measure (arc length) is invariant for both transformations. For the doubling map, it is just one of many invariant measures; for an irrational rotation, it turns out to be the only invariant measure. We say that the doubling map exhibits hyperbolic behaviour, while the irrational rotation exhibits elliptic behaviour.

Systems with hyperbolicity have many invariant measures, as we saw in a series of previous posts. The goal of this post is to recall a proof that the opposite situation is true for an irrational rotation, and that in particular every orbit equidistributes with respect to Lebesgue measure; then we consider orbits generated by random rotations, where instead of rotating by a fixed angle {\theta}, we rotate by either {\theta_1} or {\theta_2}, with the choice of which to use being made at each time step by flipping a coin.

1. Invariant measures via {C(X)}

First we recall some basic facts from ergodic theory for topological dynamical systems. Given a compact metric space {X} and a continuous map {f\colon X\rightarrow X}, let {\mathcal{M}} denote the space of Borel probability measures on {X}. Writing {C(X)} for the space of all continuous functions {X\rightarrow {\mathbb C}}, recall that {C(X)^*} is the space of all continuous linear functionals {C(X)\rightarrow {\mathbb C}}. Then {C(X)^*} is (isomorphic to) the space of finite complex Borel measures on {X}. (This last assertion uses the fact that {X} is a compact metric space and combines various results from “Linear Operators” by Dunford and Schwartz, but a more precise reference will have to wait until I have the book available to look at.)

Using this fact together with the polar decomposition for finite complex Borel measures, we have the following: for every {L\in C(X)^*}, there is {\mu\in \mathcal{M}} and a measurable function {\theta\colon X\rightarrow {\mathbb R}} such that

\displaystyle  L(\varphi) = \|L\| \int \varphi e^{i\theta} \,d\mu \text{ for all } \varphi\in C(X). \ \ \ \ \ (1)

Note that although {C(X)^*} is endowed with the operator norm, we will usually think of it as a topological vector space with the weak* topology. Thus {\mathcal{M}} embeds naturally into {C(X)^*}, and (1) shows that every element of {C(X)^*} can be described in a canonical way in terms of {\mathcal{M}}.

Let {P\subset C(X)} be a countable set whose span is dense in {C(X)}. Then to every {L\in C(X)^*} we can associate the sequence {\Phi_P(L) = \{ L(p) \mid p\in P\} \subset {\mathbb C}^P}, where depending on the context we may index using either {{\mathbb N}} or {{\mathbb Z}}. If {P} is bounded then {\Phi_P(L)\in \ell^\infty} for every {L\in C(X)^*}, and so such a {P} defines a linear map {\Phi_P\colon C(X)^*\rightarrow \ell^\infty}.

Because {L} is determined by the values {L(p)} (by linearity and continuity of {L}, and density of the span of {P}), the map {\Phi} is 1-1. In particular, it is an isomorphism onto its image, which we denote by

\displaystyle  V_P := \Phi_P(C(X)^*) \subset \ell^\infty

Note that {V_P \neq \ell^\infty} because {C(X)^*} is separable and {\ell^\infty} is not.

It is straightforward to see that {\Phi_P} is continuous, and its inverse is also continuous on {V_P}. Thus we can translate questions about {C(X)^*}, and in particular about {\mathcal{M}}, into questions about {\ell^\infty}.

Remark 1 It is a nontrivial problem to determine which elements of {\ell^\infty} correspond to elements of {C(X)^*}, and also to determine which of those sequences correspond to actual measures (elements of {\mathcal{M}}). We will not need to address either of these problems here.

The action {f\colon X\rightarrow X} induces an action on {C(X)} by {\varphi\mapsto \varphi\circ f}, and hence on {C(X)^*} by duality. This action {f_*\colon C(X)^*\rightarrow C(X)^*} is given by

\displaystyle  (f_*L)(\varphi) = L(\varphi\circ f). \ \ \ \ \ (2)

In particular, {f} also induces an action {f_*\colon \mathcal{M}\rightarrow\mathcal{M}} by

\displaystyle  \int\varphi\,d(f_*\mu) = \int (\varphi \circ f)\,d\mu. \ \ \ \ \ (3)

A measure {\mu} is {f}-invariant iff it is a fixed point of {f_*}. Let {f_P} be the action induced by {f} on {V_P\subset \ell^\infty}; that is,

\displaystyle  f_P(\Phi_P(\mu)) = \Phi_P(f_*\mu). \ \ \ \ \ (4)

If {P} can be chosen so that {f_P} takes a particularly nice form, then this can be used to understand what invariant measures {f} has, and how empirical measures converge.

Let us say more clearly what is meant by convergence of empirical measures. Given {x\in X} and {n\in {\mathbb N}}, let {\mathcal{E}_n(x) = \frac 1n \sum_{k=0}^{n-1} \delta_{f^kx}} be the empirical measure along the orbit segment {x,f(x),\dots,f^{n-1}(x)}. Let {V(x)\subset \mathcal{M}} be the set of weak* accumulation points of the sequence {\mathcal{E}_n(x)}. By compactness of {\mathcal{M}}, the set {V(x)} is non-empty, and it is a standard exercise to show that every measure in {V(x)} is {f}-invariant.

In particular, if {(X,f)} is uniquely ergodic, then it only has one invariant measure {\mu}, and so {V(x) = \{\mu\}} for every {x\in X}. In this case we have {\mathcal{E}_n(x) \rightarrow \mu} for every {x}, and it is reasonable to ask how quickly this convergence occurs.

2. Irrational rotations

Now we specialise to the case of an irrational rotation. Let {X=S^1\subset {\mathbb C}} be the unit circle, fix {\theta\in {\mathbb R}} irrational, and let {f\colon X\rightarrow X} be given by {f(z) = e^{2\pi i\theta}z}. We will show that {\mu\in \mathcal{M}} is {f}-invariant iff it is Lebesgue measure, and then examine what happens in a broader setting.

Given {n\in {\mathbb Z}}, let {p_n(z) = z^n}, and let {P = \{p_n \mid n\in {\mathbb Z}\}}. Then the span of {P} contains all functions {S^1\rightarrow {\mathbb C}} that are polynomials in {z} and {\bar{z}}, thus it is a subalgebra of {C(S^1)} that contains the constant functions, separates points, and is closed under complex conjugation. By the Stone–Weierstrass theorem, this span is dense in {C(S^1)}, and since {P} is bounded the discussion from above gives an isomorphism {\Phi\colon C(S^1)^* \rightarrow V\subset \ell^\infty}, where we suppress {P} in the notation. This isomorphism is given by

\displaystyle  \Phi(L)_n = L(p_n) \ \ \ \ \ (5)

for a general {L\in C(S^1)^*}, and for {\mu\in \mathcal{M}} we write

\displaystyle  \Phi(\mu)_n = \int_{S^1} z^n \,d\mu(z). \ \ \ \ \ (6)

The sequence {\Phi(\mu)} is the sequence of Fourier coefficients associated to the measure {\mu}. The choice of {p_n} means that the action {f} induces on {V\subset \ell^\infty} takes a simple form: the Fourier coefficients of {L} and {f_*L} are related by

\displaystyle  \Phi(f_*L)_n = (f_*L)(z^n) = L((f(z))^n) = L\left(e^{2\pi i \theta n} z^n\right) = e^{2\pi i \theta n} \Phi(L)_n.

Thus if {\mu} is invariant, we have {\Phi(\mu)_n = e^{2\pi i \theta n} \Phi(\mu)_n} for all {n=0,1,2,\dots}. Because {\theta} is irrational, we have {e^{2\pi i\theta n} \neq 1} for all {n\neq 0}, and so {\Phi(\mu)_n=0}. Thus the only non-zero Fourier coefficient is {\Phi(\mu)_0 = \int {\mathbf{1}} \,d\mu(z) = 1}. Because {\Phi} is an isomorphism between {C(S^1)^*} and {V\subset \ell^\infty}, this shows that the only {L\in C(S^1)^*} with {f_*L=L} is Lebesgue measure. In particular, {f} is uniquely ergodic, with Lebesgue as the only invariant measure, and thus for any {z\in S^1}, the empirical measures {\mathcal{E}_n(z)} converge to Lebesgue.

3. Random rotations

Consider a sequence of points {z_1,z_2,z_3,\dots\in S^1}, and let {m_n\in \mathcal{M}} be the average of the point masses on the first {n} points of the sequence:

\displaystyle  m_n = \frac 1n \sum_{k=1}^n \delta_{z_k}, \qquad\qquad m_n(\varphi) = \frac 1n \sum_{k=1}^n \varphi(z_k). \ \ \ \ \ (7)

We say that the sequence {z_n} equidistributes if {m_n} converges to Lebesgue on {S^1} in the weak* topology.

The previous sections showed that if the points of the sequence are related by {z_{n+1} = e^{2\pi i \theta} z_n}, where {\theta} is irrational, then the sequence equidistributes. A natural generalisation is to ask what happens when the points {z_n} are related not by a fixed rotation, but by a randomly chosen rotation.

Here is one way of making this precise. Let {\Omega = \{1,2\}^{\mathbb N}} be the set of infinite sequences of 1s and 2s, and let {\mu} be the {\left(\frac 12, \frac 12\right)}-Bernoulli measure on {\Omega}, so that all sequences of length {n} are equally likely. Fix real numbers {\theta_1} and {\theta_2}, and fix {z_1\in S^1}. Given {\omega\in \Omega}, consider the sequence {z_n(\omega)} given by

\displaystyle  z_{n+1}(\omega) = e^{2\pi i \theta_{\omega_n}} z_n(\omega). \ \ \ \ \ (8)

Then one may ask whether or not {z_n(\omega)} equidistributes almost surely (that is, with probability 1 w.r.t. {\mu}). The remainder of this post will be dedicated to proving the following result.

Theorem 1 If either of {\theta_1} or {\theta_2} is irrational, then {z_n(\omega)} equidistributes almost surely.

Remark 2 The proof given here follows a paper by Lagarias and Soundararajan, to which I was referred by Lucia on MathOverflow.

Using Fourier coefficients as in the previous section, we have that {z_n(\omega)} equidistributes iff all the non-constant Fourier coefficients of {m_n(\omega)} converge to zero — that is, iff {\Phi(z_n(\omega))_k \rightarrow 0} as {n\rightarrow\infty} for all {k\neq 0}. This is Weyl’s criterion for equidistribution.

Fix a value of {k\neq 0}, which will be suppressed in the notation from now on. Write {a_n} for the absolute value of the {k}th Fourier coefficient of {m_n(\omega)}, and note that

\displaystyle  a_n := |\Phi(z_n(\omega))_k| = |m_n(\omega)(z^k)| = \frac 1n \left|\sum_{j=1}^n z_j(\omega)^k\right|. \ \ \ \ \ (9)

The outline of the proof is as follows.

  1. Show that there is a constant {C} such that the expected value of {a_n} is at most {C/n}.
  2. Given {\delta>0}, show that there is a constant {C'} such that the probability that {a_n} exceeds {\delta} is at most {C'/n}.
  3. Find an exponentially increasing sequence {n_j\rightarrow\infty} such that if {a_{n_j}\leq \delta}, then {a_n\leq 2\delta} for every {n\in [n_j,n_{j+1}]}.
  4. Use the Borel–Cantelli lemma to deduce that with probability 1, {a_{n_j}} exceeds {\delta} only finitely often, hence {a_n} exceeds {2\delta} only finitely often. Since {\delta>0} was arbitrary this shows that {a_n\rightarrow 0}.

Step 1

Given {\xi\in {\mathbb R}}, let {\|\xi\|} denote the distance between {\xi} and the nearest integer.

Lemma 2

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \left( 1 + \frac 1{\|k\theta_1\|^2 + \|k\theta_2\|^2}\right) \frac 1n \ \ \ \ \ (10)

Proof: Let {y_1\in {\mathbb R}} be such that {z_1 = e^{2\pi i y_1}}, and define {y_n} recursively by {y_{n+1} = y_n + \theta_{\omega_n}}, so that {z_n = e^{2\pi i y_n}}.

\displaystyle  \begin{aligned} a_n^2 &= \frac 1{n^2} \left\lvert \sum_{j=1}^n z_j^k \right\rvert^2 = \frac 1{n^2} \left(\sum_{\ell=1}^n z_\ell^k\right) \left(\sum_{j=1}^n \overline{z_j^k}\right) \\ &= \frac 1{n^2} \left( n + \sum_{\ell\neq j} z_\ell^k z_j^{-k} \right) = \frac 1{n^2} \left( n + \sum_{\ell\neq j} e^{2\pi i k(y_\ell - y_j)} \right). \end{aligned}

Using the fact that {z+\bar{z} = 2\Re(z)}, we have

\displaystyle  \sum_{\ell \neq j} e^{2\pi i k(y_\ell - y_j)} = 2\Re \sum_{1\leq \ell < j \leq n} e^{2\pi i k(y_\ell - y_j)}.

If {\ell - j = r}, then

\displaystyle  \begin{aligned} \mathop{\mathbb E}_\mu[e^{2\pi i k (y_\ell - y_j)}] &= \frac 1{2^r} \sum_{\omega_\ell,\dots, \omega_{j-1}} e^{2\pi i k \sum_{m=\ell}^{j-1} \theta_{\omega_{m}}} = \frac 1{2^r} \sum \prod_{m=\ell}^{j-1} e^{2\pi i k \theta_{\omega_m}} \\ &= \left(\frac{e^{2\pi i k\theta_1} + e^{2\pi i k\theta_2}}{2}\right)^r, \end{aligned}

where the sums are over all {\omega_m\in \{1,2\}} for {\ell\leq m<j}. Since there are {n-r} values of {\ell,j} with {\ell - j = r}, we have

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] = \frac 1{n^2}\left(n + 2\Re \sum_{r=1}^n (n-r)z^r\right), \ \ \ \ \ (11)

where {z=\frac 12(e^{2\pi i k\theta_1} + e^{2\pi i k\theta_2})}. Now

\displaystyle  \begin{aligned} \sum_{r=1}^n (n-r) z^r &= \sum_{r=1}^{n-1} z_r + \sum_{r=1}^{n-2} z^r + \cdots + \sum_{r=1}^1 z^r \\ &= \frac z{1-z}\sum_{s=1}^{n-1} (1-z^s) = \frac z{1-z} \left( n-\sum_{s=0}^{n-1} z^s \right). \end{aligned}

and since {|z|\leq 1} we have

\displaystyle  \left\lvert \sum_{r=1}^n (n-r) z^r \right\rvert \leq \frac{2n}{|1-z|}.

Together with (11), this gives

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \frac 1n \left( 1 + \frac 4{|1-z|}\right) = \frac 1n\left( 1 + \frac 8{|2 - e^{2\pi i k \theta_1} - e^{2\pi i k \theta_2}|} \right).

Using the fact that {|w|\geq |\Re w|} and that {\Re(1-e^{2\xi i}) \geq 1-\cos(2\xi) = 2\sin^2\xi}, we have

\displaystyle  \begin{aligned} |2 - e^{2\pi ik\theta_1} - e^{2\pi ik\theta_2}| &\geq 2(\sin^2 \pi k\theta_1 + \sin^2 \pi k\theta_2), \end{aligned}

and so

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \frac 1n\left(1 + \frac 4{\sin^2\pi k\theta_1 + \sin^2\pi k\theta_2}\right)

Finally, for {|\xi|\leq \frac 12} we have {|\sin \pi\xi| \geq |2\xi|}, which proves the bound in (10). \Box

Because one of {\theta_1,\theta_2} is irrational, the denominator in (10) is positive, and so {C := 1 + (\|k\theta_1\|^2 + \|k\theta_2\|^2)^{-1} < \infty}, which completes Step 1.

Step 2

Given {\delta>0}, we have

\displaystyle  \mathop{\mathbb E}[a_n^2] \geq \delta^2 \mathop{\mathbb P}(a_n\geq \delta),

and so by Lemma 2 we have

\displaystyle  \mathop{\mathbb P}(a_n\geq \delta) < \frac C{\delta^2 n}.

Putting {C' = C\delta^{-2}} completes step 2.

Step 3

Given {m\leq n\in {\mathbb N}}, we have

\displaystyle  na_n \leq ma_m + \left|\sum_{\ell=m}^{n-1} z_\ell^k\right| \leq ma_m + n-m,

and so

\displaystyle  a_n \leq \frac{n-m}n + \frac mn a_m.

In particular, if {n\leq (1+\delta)m} for some {\delta>0}, we have

\displaystyle  a_n \leq \delta + a_m. \ \ \ \ \ (12)

Let {n_j} be such that

\displaystyle  1+\frac \delta 2 \leq \frac{n_{j+1}}{n_j} \leq 1+\delta \ \ \ \ \ (13)

for all {j}. If {a_{n_j}\leq \delta}, then (12) implies that {a_n\leq 2\delta} for all {n\in [n_j, n_{j+1}]}.

Step 4

Let {E_j} be the event that {a_{n_j} \geq \delta}. By Part 2, we have {\mathop{\mathbb P}(E_j) \leq C'/n_j}, and because {n_j} increases exponentially in {j} by (13), we have {\sum_{j=1}^\infty \mathop{\mathbb P}(E_j) < \infty}. By the Borel–Cantelli lemma, this implies that with probability 1, there are only finitely many values of {j} for which {a_{n_j}\geq \delta}.

By the previous part, this in turn implies that there are only finitely many values of {n} for which {a_n\geq 2\delta}. In particular, {\varlimsup a_n \leq 2\delta}, and since {\delta>0} was arbitrary, we have {a_n\rightarrow 0}. Thus the sequence {z_n} satisfies Weyl’s criterion almost surely, which completes the proof of Theorem 1.

Posted in ergodic theory, examples, random dynamics, statistical laws | Tagged | Leave a comment

Fubini foiled

An important issue in hyperbolic dynamics is that of absolute continuity. Suppose some neighbourhood {U} of a smooth manifold {M} is foliated by a collection of smooth submanifolds {\{W_\alpha \mid \alpha\in A\}}, where {A} is some indexing set. (Here “smooth” may mean {C^1}, or {C^2}, or even more regularity depending on the context.)

Fixing a Riemannian metric on {M} gives a notion of volume {m} on the manifold {M}, as well as a notion of “leaf volume” {m_\alpha} on each {W_\alpha}, which is just the volume form coming from the induced metric. Then one wants to understand the relationship between {m} and {m_\alpha}.

The simplest example of this comes when {U=(0,1)^2} is the (open) unit square and {\{W_\alpha\}} is the foliation into horizontal lines, so {I=(0,1)} and {W_\alpha = (0,1)\times \{\alpha\}}. Then Fubini’s theorem says that if {E} is any measurable set, then {m(E)} can be found by integrating the leaf volumes {m_\alpha(E)}. In particular, if {m(E)=0}, then {m_\alpha(E)=0} for almost every {\alpha}, and conversely, if {E} has full measure, then {m_\alpha(E)=1} for almost every {\alpha}.

For the sake of simplicity, let us continue to think of a foliation of the two-dimensional unit square by one-dimensional curves, and assume that these curves are graphs of smooth functions {\phi_\alpha\colon (0,1)\rightarrow (0,1)}. The story in other dimensions is similar.

Writing {\Phi(x,y) = \phi_y(x)} gives a map {\Phi\colon (0,1)^2 \rightarrow (0,1)^2}, and we see that our foliation is the image under {\Phi} of the foliation by horizontal lines. Now we must make a very important point about regularity of foliations. The leaves of the foliation are smooth, and so {\Phi} depends smoothly on {x}. However, no assumption has been made so far on the transverse direction — that is, dependence on {y}. In particular, we cannot assume that {\Phi} depends smoothly on {y}.

If {\Phi} depends smoothly on both {x} and {y}, then we have a smooth foliation, not just smooth leaves, and in this case basic results from calculus show that a version of Fubini’s theorem holds:

\displaystyle  m(E) = \int_A \int_{W_\alpha} \rho_\alpha(z) {\mathbf{1}}_E(z) \,dm_\alpha(z)\,d\alpha,

where the density functions {\rho_\alpha\in L^1(W_\alpha,m_\alpha)} can be determined in terms of the derivative of {\Phi}.

It turns out that when the foliation is by local stable and unstable manifolds of a hyperbolic map, smoothness of the foliation is too much to ask for, but it is still possible to find density functions {\rho_\alpha} such that Fubini’s theorem holds, and the relationship between {m}-null sets and {m_\alpha}-null sets is as expected.

However, there are foliations where the leaves are smooth but the foliation as a whole does not even have the absolute continuity properties described in the previous paragraph. This includes a number of dynamically important examples, namely intermediate foliations (weak stable and unstable, centre directions in partial hyperbolicity) for certain systems. These take some time to set up and study. There is an elementary non-dynamical example first described by Katok; a similar example was described by Milnor (Math. Intelligencer 19(2), 1997), and it is this construction which I want to outline here, since it is elegant, does not require much machinery to understand, and neatly illustrates the possibility that absolute continuity may fail.

The example consists of two constructions: a set {E} and a foliation of the square by a collection of smooth curves {W_\alpha}. These are built in such a way that {E} has full Lebesgue measure ({m(E)=1}) but intersects each curve {W_\alpha} in at most a single point, so that in particular {m_\alpha(E)=0} for all {\alpha}.

1. The set

Given {p\in (0,1)}, consider the piecewise linear map {f_p\colon (0,1)\rightarrow (0,1]} given by

\displaystyle  f_p(y) = \begin{cases} \frac yp & y\leq p \\ \frac {y-p}{1-p} &y >p \end{cases}

It is not hard to see that {f_p} preserves Lebesgue measure {dy} and that {(f_p,dx)} is measure-theoretically conjugate to {(\sigma,\mu_p)}, where {\sigma} is the shift map on {\Sigma=\{0,1\}^{\mathbb N}} and {\mu_p} is the {(p,1-p)}-Bernoulli measure. The conjugacy {\pi_p\colon (0,1) \rightarrow \Sigma} is given by coding trajectories according to whether the {n}th iterate falls into {I_0 = (0,p]} or {I_1=(p,1)}.

Given {y}, let {\beta_n(y)} be the number of times that {y,f_p(y),\dots,f_p^{n-1}(y)} lands in {I_1}. Then by the strong law of large numbers, Lebesgue almost every {y\in (0,1)} has {\frac 1n \beta_n(y) \rightarrow 1-p}. Let {E} be the set of all pairs {(p,y)} such that this is true; then {m(E) = 1} by Fubini’s theorem (applied to the foliation of the unit square into vertical lines). It only remains to construct a foliation that intersects {E} in a “strange” way.

2. The foliation

The curves {W_\alpha} will be chosen so that two points {(p,y)} and {(p',y')} lie on the same {W_\alpha} if and only if they are coded by the same sequence in {\Sigma} — that is, if and only if {\pi_p(y) = \pi_{p'}(y')}. Then because the limit of {\frac 1n\beta_n} must be constant on {W_\alpha} if it exists, we deduce that each {W_\alpha} intersects the set {E} at most once.

It remains to see that this condition defines smooth curves. Given {\alpha\in (0,1)}, let {a=\pi_{1/2}(\alpha)=a_1a_2\cdots} be the binary expansion of {\alpha}, so that {\alpha = \sum a_n 2^{-n}}. Now let {W_\alpha = \{(p,y) \mid \pi_p(y)=a\}}.

Fix {p\in (0,1)}, let {y_1=y = \phi_\alpha(p)} be such that {(p,y)\in W_\alpha}, and let {y_{n+1}=f_p(y_n)}, so that {y_n\in I_{a_n}} for all {n}. Thus

\displaystyle  y_{n+1} = \begin{cases} \frac {y_n}{p} & a_n=0 \\ \frac{y_n-p}{1-p} & a_n=1 \end{cases}

For convenience of notation, write {p(0)=p} and {p(1)=1-p}. Then the relationship between {y_{n+1}} and {y_n} can be written as

\displaystyle  y_{n+1} = \frac{y_n - a_n p(0)}{p(a_n)} \qquad\Rightarrow\qquad y_n = p(0)a_n + p(a_n)y_{n+1}.

This can be used to write {y=\phi_\alpha(p)} in terms of the sequence {a_n}. Indeed,

\displaystyle  \begin{aligned} \phi_\alpha(p) &=y = y_1 = p(0)a_1 + p(a_1)y_2 \\ &= p(0)a_1 + p(a_1)\big(p(0) a_2 + p(a_2)y_3\big) \\ &= p(0)\big(a_1 + p(a_1)a_2\big) + p(a_1)p(a_2)y_3 \\ &= p(0)\big(a_1 + p(a_1)a_2 + p(a_1)p(a_2)a_3\big) + p(a_1)p(a_2)p(a_3)y_4, \end{aligned}

and so on, so that writing {\psi_n(p) = p(a_1)p(a_2)\cdots p(a_{n-1})}, one has

\displaystyle  \phi_\alpha(p) = p(0) \sum_{n=1}^\infty \psi_n(p) a_n.

The summands are analytic functions of {p}, and the sum converges uniformly on each interval {[\epsilon,1-\epsilon]}, since for {p\in [\epsilon, 1-\epsilon]} we have {\psi_n(p) \leq (1-\epsilon)^n}. In fact, this uniform convergence extends to complex values of {p} in the disc of radius {\frac 12 - \epsilon} centred at {\frac 12}, and so by the Weierstrass uniform convergence theorem, the function {\phi_\alpha} is analytic in {p}.

As discussed above, {W_\alpha} is the graph of the analytic function {\phi_\alpha}, and each {W_\alpha} intersects {E} at most once, despite the fact that {E} has Lebesgue measure 1 in the unit square. This demonstrates the failure of absolute continuity.

Posted in ergodic theory, examples, smooth dynamics | Tagged | 2 Comments

Sharkovsky’s theorem

This post is based on a talk given by Keith Burns in the UH dynamical systems seminar yesterday, in which he presented a streamlined proof of Sharkovsky’s theorem due to him and Boris Hasselblatt.

1. Background and statement of the theorem

We recall the statement of Sharkovsky’s theorem. The setting is a continuous map {f\colon {\mathbb R}\rightarrow {\mathbb R}}, and one is interested in what dynamical properties {f} must necessarily have if we know that {f} has a periodic orbit of some given least period {p} — for example, does existence of such an orbit force the existence of periodic points with least period {q} for various values of {q}? Or does it force the map to have positive entropy?

The assumptions of continuity and one-dimensionality are vital: for discontinuous one-dimensional maps, or for continuous higher-dimensional maps, it is possible to have a point of least period {p} without having points of any other period or having positive entropy.

In one dimension, on the other hand, one can begin with the following simple observation: if {f} has a periodic point with least period 2, then there is {p} such that {f(p) > p=f^2(p)}, and so applying the intermediate value theorem to {f(x)-x}, the map must have a fixed point.

What if {f} has a periodic point {p} with least period 3? Suppose that {p<f(p)<f^2(p)} (the other configurations are similar), and let {I_1=[p,f(p)]} and {I_2=[f(p),f^2(p)]}. Once again using the intermediate value theorem, we see that {f(I_1) \supset I_2} and {f(I_2) \supset I_1 \cup I_2}. Thus we may consider the topological Markov chain {(X,\sigma)} given by the graph in Figure 1, and observe that it embeds into the dynamics of {f}, so that in particular, {f} has periodic points of every order.

Fig 1

Sharkovsky’s theorem generalises the above observations. Consider the following unorthodox total order on the integers:

\displaystyle  \begin{aligned} 3\prec 5&\prec 7\prec \cdots \\ 6\prec 10&\prec 14\prec \cdots \\ 12\prec 20&\prec 28\prec \cdots \\ &\vdots \\ \cdots \prec 16 &\prec 8 \prec 4 \prec 2 \prec 1 \end{aligned}

Then if {f} has a periodic point of least period {p}, it has a point of least period {q} for every {q\succ p}.

The paper by Burns and Hasselblatt includes a discussion of the history of the theorem and the various proofs that have been given. Here we outline the most streamlined proof available at the present time.

2. Sketch of the proof

Given two intervals {I,J\subset {\mathbb R}}, write {I\rightarrow J} if {f(I)\supset J}. As in the case of a period 3 orbit above, the key will be to construct a collection of intervals {I_j} (whose endpoints are certain points of the period {p} orbit) such that the covering relations {I_j\rightarrow I_k} yield a directed graph with periodic paths of all the desired periods.

To deduce existence of various periodic orbits for {f} from these covering relations, one makes the following observations.

  1. If {f^p(I)\supset I}, then {I} contains a fixed point of {f^p}.
  2. If {I_0 \rightarrow I_1 \rightarrow \cdots \rightarrow I_k}, then {f^k(I_0)\supset I_0}.
  3. The above two observations guarantee existence of {x\in I_0} with {f^k(x)=x}. To guarantee that {x} can be chosen with least period {k}, it suffices to know that some {I_j} to have disjoint interior from all the others, and that the endpoints of this {I_j} do not follow the itinerary given by the sequence of covering relations above.

Detailed proofs of the above observations are in the Burns-Hasselblatt paper. With these tools in hand, the proof comes down to constructing the directed graph of covering relations referred to above.

Fig 2

To this end, consider a periodic orbit {P} with {p} points — the green points in Figure 2. These are permuted by {f}; some are moved to the left, some are moved to the right. Of all the points in the orbit that are moved to the right, let {z} be the rightmost, so that all points in the orbit to the right of {z} are moved to the left. Let {z'} be the next point to the right of {z}, and let {\Delta=[z,z']} as in the figure.

Now consider the intervals {I_k = f^k(\Delta)}. Note that {I_0} has the following property: every point in {P} lying to the left of {\Delta} moves to the right of {\Delta}, and vice versa. Let {I_m} be the first interval that does not have this property, and let {x\in I_m\cap P} be a point that does not switch sides of {\Delta} under the action of {f}. For concreteness we consider the case where {x} is to the left of {\Delta}, as in Figure 2.

By construction, all the points between {x} and {\Delta} are mapped to the right of {\Delta}. Of these, let {y} be the point that is mapped furthest to the right.

Write {R_y} for the interval {[y,\infty)}, and consider the intervals {J_k = I_k \cap R_y} for {0\leq k < m}, together with {J_m = [x,y]}. We claim that {J_0\rightarrow J_0}, {J_m\rightarrow J_0}, and {J_k \rightarrow J_{k+1}} for each {0\leq k<m}, so that we have the covering relations shown in Figure 3, which can be used to establish existence of periodic points with all least periods greater than {m}.

Fig 3

So why do the intervals {J_k} have the covering relations shown?

  1. The fact that {J_0\rightarrow J_0} follows from the fact that {J_0=I_0=\Delta} and the endpoints of {\Delta} switch sides of {\Delta}.
  2. The definition of {I_k} gives {I_k\rightarrow I_{k+1}}.
  3. Writing {I_k^R} for the part of {J_k} lying in or to the right of {\Delta} and {I_k^L} for the part lying in or to the left, we see that {I_k^R\rightarrow I_{k+1}^L} and {I_k^L\rightarrow I_{k+1}^R}; this is because for {0\leq k < m}, all points of {P\cap I_k} switch sides of {\Delta} under the action of {f}. (This is the definition of {m}.)
  4. Using similar notation for {J_k^L} and {J_k^R}, we see that {I_k^R = J_k^R} and {J_k^L \rightarrow I_k^R}; this is because {y} was chosen to be the point that maps furthest to the right.
  5. This establishes that {J_k^L \rightarrow J_k^R}, and similarly {J_k^R \rightarrow J_k^L}.

For {k=m} we see that {f(x)} lies to the left of {\Delta}, and {f(y)} to the right, so that {f(J_m)\supset \Delta=I_0}.

The extra piece of information in the final item of the list above shows that the graph in Figure 3 can be replaced by the one in Figure 4.

Fig 4

Let us sum up what we know. One of the following two cases holds. Case 1: There is some point {x} of the orbit such that {x} and {f(x)} both lie on the same side of {\Delta}. Case 2: Every point of the orbit is mapped to the other side of {\Delta} by {f}.

In Case 1, the arguments above show that for every {q>p} and every even {q<p}, there is an orbit of least period {q}, which is at least as strong as the conclusion in Sharkovsky’s theorem.

In Case 2, one may observe that since {f} permutes the elements of {P} and every element is moved from one side of {\Delta} to the other, {p} must be even. Moreover, by considering {P^L} and {P^R} separately, as invariant sets for {f^2}, we can repeat the argument inductively to complete the proof of the theorem.

Posted in theorems | Tagged | Leave a comment

Central Limit Theorem for dynamical systems using martingales

This post is based on notes from Matt Nicol’s talk at the UH summer school in dynamical systems. The goal is to present the ideas behind a proof of the central limit theorem for dynamical systems using martingale approximations.

1. Conditional expectation

Before we can define and use martingales, we must recall the definition of conditional expectation. Let {(\Omega,\mathop{\mathbb P})} be a probability space, with {\mathop{\mathbb P}} defined on a {\sigma}-algebra {\mathop{\mathcal B}}. Let {\mathop{\mathcal F}\subset \mathop{\mathcal B}} be a sub-{\sigma}-algebra of {\mathop{\mathcal B}}.

Example 1 Consider the doubling map {T\colon [0,1]\rightarrow [0,1]} given by {T(x) = 2x\pmod 1}. Let {\mathop{\mathbb P}} be Lebesgue measure, {\mathop{\mathcal B}} the Borel {\sigma}-algebra, and {\mathop{\mathcal F} = T^{-1}\mathop{\mathcal B} = \{T^{-1}(B) \mid B\in \mathop{\mathcal B}\}}. Then {\mathop{\mathcal F}} is a sub-{\sigma}-algebra of {\mathop{\mathcal B}}, consisting of precisely those sets in {\mathop{\mathcal B}} which are unions of preimage sets {T^{-1}(x)} — that is, those sets {F\in\mathop{\mathcal B}} for which a point {y\in [0,\frac 12]} is in {F} if and only if {y+\frac 12\in F}.

This example extends naturally to yield a decreasing sequence of {\sigma}-algebras

\displaystyle  \mathop{\mathcal B} \supset T^{-1}\mathop{\mathcal B} \supset T^{-2}\mathop{\mathcal B} \supset \cdots.

Given a sub-{\sigma}-algebra {\mathop{\mathcal F}\subset \mathop{\mathcal B}} and a random variable {Y\colon \Omega\rightarrow {\mathbb R}} that is measurable with respect to {\mathop{\mathcal B}}, the conditional expectation of {Y} given {\mathop{\mathcal F}} is any random variable {Z} such that

  1. {Z} if {\mathop{\mathcal F}}-measurable (that is, {Z^{-1}(I)\in\mathop{\mathcal F}} for every interval {I\subset {\mathbb R}}), and
  2. {\int_A Z\,d\mathop{\mathbb P} = \int_A Y\,d\mathop{\mathbb P}} for every {A\in \mathop{\mathcal F}}.

It is not hard to show that these conditions characterise {Z} for an almost-sure choice of {\omega\in\Omega}, and so the conditional expectation is uniquely defined as a random variable. We write it as {\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]}.

A key property is that conditional expectation is linear: for every {\mathop{\mathcal F}\subset \mathop{\mathcal B}}, every {Y_1,Y_2\colon \Omega\rightarrow{\mathbb R}}, and every {a\in {\mathbb R}}, we have

\displaystyle  \mathop{\mathbb E}[aY_1 + Y_2\mid \mathop{\mathcal F}] = a\mathop{\mathbb E}[Y_1\mid\mathop{\mathcal F}] + \mathop{\mathbb E}[Y_2\mid\mathop{\mathcal F}].

Example 2 If {Y} is already {\mathop{\mathcal F}}-measurable, then {\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]=Y}.

Example 3 At the other extreme, if {Y} and {\mathop{\mathcal F}} are independent — that is, if {\mathop{\mathbb E}[Y{\mathbf{1}}_A]=\mathop{\mathbb E}[Y]\mathop{\mathbb P}[A]} for every {A\in\mathop{\mathcal F}} — then {\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]} is the constant function {\mathop{\mathbb E}[Y]{\mathbf{1}}}.

Example 4 Suppose {\{\Omega_i\mid i\in{\mathbb N}\}} is a countable partition of {\Omega} such that {\mathop{\mathbb P}[\Omega_i]>0} for every {i}. Let {\mathop{\mathcal F}=\sigma(\Omega_1,\Omega_2,\dots)} be the {\sigma}-algebra generated by the sets {\Omega_i}. Then

\displaystyle  \mathop{\mathbb E}[Y|\mathop{\mathcal F}] = \sum_{i\in{\mathbb N}} \frac{\mathop{\mathbb E}[Y{\mathbf{1}}_{\Omega_i}]}{\mathop{\mathbb P}[\Omega_i]} {\mathbf{1}}_{\Omega_i}.

2. Martingales

Now we can define martingales, which are a particular sort of stochastic process (sequence of random variables) with “enough independence” to generalise results from the IID case.

Definition 1 A sequence {Y_n\colon \Omega\rightarrow {\mathbb R}} of random variables is a martingale if

  1. {\mathop{\mathbb E}[|Y_n|]<\infty} for all {n};
  2. there is an increasing sequence of {\sigma}-algebras (a filtration) {\mathop{\mathcal F}_1\subset\mathop{\mathcal F}_2\subset\cdots \subset \mathop{\mathcal B}} such that {Y_n} is measurable with respect to {\mathop{\mathcal F}_n};
  3. the conditional expectations satisfy {\mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] = Y_n}.

The first condition guarantees that everything is in {L^1}. If {\mathop{\mathcal F}_n} is taken to be the {\sigma}-algebra of events that are determined by the first {n} outcomes of a sequence of experiments, then the second condition states that {Y_n} only depends on those first {n} outcomes, while the third condition requires that if the first {n} outcomes are known, then the expected value of {Y_{n+1} - Y_n} is 0.

Example 5 Let {Y_n} be a sequence of fair coin flips — IID random variables taking the values {\pm1} with equal probability. Let {S_n=Y_1+\cdots+Y_n}. As suggested in the previous paragraph, let {\mathop{\mathcal F}_n=\sigma(Y_1,\dots,Y_n)} be the smallest {\sigma}-algebra with respect to which {Y_1,\dots,Y_n} are all measurable. (The sets in {\mathop{\mathcal F}_n} are precisely those sets in {\mathop{\mathcal B}} which are determined by knowing the values of {Y_1,\dots,Y_n}.)

It is easy to see that {S_n} satisfies the first two properties of a martingale, and for the third, we use linearity of expectation and the definition of {\mathop{\mathcal F}_n} to get

\displaystyle  \mathop{\mathbb E}[S_{n+1}|\mathop{\mathcal F}_n] = \mathop{\mathbb E}[Y_1 + \cdots + Y_n|\mathop{\mathcal F}_n] + \mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] = S_n + 0 = S_n.

When {Y_n} is a sequence of random variables for which {S_n = Y_1+\cdots+Y_n} is a martingale, we say that the sequence {Y_n} is a martingale difference.

In the previous example the martingale property (the third condition) was a direct consequence of the fact that the random variables {Y_n = S_{n+1}-S_n} were IID. However, there are examples where the martingale differences are not IID.

Example 6 Polya’s urn is a stochastic process defined as follows. Consider an urn containing some number of red and blue balls. At each step, a single ball is drawn at random from the urn, and then returned to the urn, along with a new ball that matches the colour of the one drawn. Let {Y_n} be the fraction of the balls that are red after the {n}th iteration of this process.

Clearly the sequence of random variables {Y_n} is neither independent nor identically distributed. However, it is a martingale, as the following computation shows: suppose that at time {n} there are {p} red balls and {q} blue balls in the urn. (This knowledge represents knowing which element of {\mathop{\mathcal F}_n} we are in.) Then at time {n+1}, there will be {p+1} red balls with probability {\frac{p}{p+q}}, and {p} red balls with probability {\frac{q}{p+q}}. Either way, there will be {p+q+1} total balls, and so the expected fraction of red balls is

\displaystyle  \begin{aligned} \mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] &= \frac{p}{p+q}\cdot\frac{p+1}{p+q+1} + \frac{q}{p+q}\cdot \frac{p}{p+q+1} \\ &= \frac{p(p+q+1)}{(p+q)(p+q+1)} = \frac{p}{p+q} = Y_n. \end{aligned}

If we assume that the martingale differences are stationary (that is, identically distributed) and ergodic, then we have the following central limit theorem for martingales, from a 1974 paper of McLeish (we follow some notes by S. Sethuraman for the statement).

Theorem 2 Let {Y_i} be a stationary ergodic sequence such that {\sigma^2 = \mathop{\mathbb E}[Y_i^2]<\infty} and {\mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n]=0}, where {\mathop{\mathcal F}_n} is the {\sigma}-algebra generated by {\{Y_i\mid 1\leq i\leq n\}}. Then {S_n = \sum_{i=1}^n Y_i} is a martingale, and {\frac{S_n}{\sigma\sqrt{n}}} converges in distribution to {N(0,1)}.

More sophisticated versions of this result are available, but this simple version will suffice for our needs.

3. Koopman operator and transfer operator

Now we want to apply 2 to a dynamical system {T\colon X\rightarrow X} with an ergodic measure {\mu} by taking {Y_i=\varphi\circ T^i} for some observable {\varphi\colon X\rightarrow{\mathbb R}}.

To carry this out, we consider two operators on {L^2(\mu)}. First we consider the Koopman operator {U\colon \varphi\mapsto \varphi\circ T}. Then we define the transfer operator {\mathop{\mathcal P}} to be its {L^2} adjoint — that is,

\displaystyle  \int(\mathop{\mathcal P}\varphi)\psi\,d\mu = \int\varphi(\psi\circ T)\,d\mu \ \ \ \ \ (1)

for all {\varphi,\psi\in L^2}. The key result for our purposes is that the operators {\mathop{\mathcal P}} and {U} are one-sided inverses of each other.

Proposition 3 Given {\varphi\in L^2}, we have

  1. {\mathop{\mathcal P} U\varphi=\varphi};
  2. {U\mathop{\mathcal P}\varphi=\mathop{\mathbb E}[\varphi|T^{-1}\mathop{\mathcal B}]}, where {\mathop{\mathcal B}} is the {\sigma}-algebra on which {\mu} is defined.

Proof: For the first claim, we see that for all {\psi\in L^2} we have

\displaystyle  \int\psi\cdot (\mathop{\mathcal P} U\varphi)\,d\mu = \int (\psi\circ T)(\varphi\circ T)\,d\mu = \int\psi\varphi\,d\mu,

where the first equality uses the definition of {\mathop{\mathcal P}} and the second uses the fact that {\mu} is invariant. To prove the second claim, we first observe that given an interval {I\subset{\mathbb R}}, we have

\displaystyle  (\mathop{\mathcal P} U\varphi)^{-1}(I) = T^{-1}((\mathop{\mathcal P} \varphi)^{-1}(I)) \in T^{-1}\mathop{\mathcal B},

since {\mathop{\mathcal P}} maps {\mathop{\mathcal B}}-measurable functions to {\mathop{\mathcal B}}-measurable functions. This shows that {\mathop{\mathcal P} U\varphi} is {T^{-1}\mathop{\mathcal B}}-measurable, and it remains to show that

\displaystyle  \int_{T^{-1}B} U\mathop{\mathcal P}\varphi\,d\mu = \int_{T^{-1}B} \varphi\,d\mu \text{ for all }B\in \mathop{\mathcal B}. \ \ \ \ \ (2)

This follows from a similar computation to the one above: given {B\in \mathop{\mathcal B}} we have

\displaystyle  \begin{aligned} \int_{T^{-1}B} U\mathop{\mathcal P}\varphi\,d\mu &= \int ((\mathop{\mathcal P}\varphi)\circ T) \cdot {\mathbf{1}}_{T^{-1}B}\,d\mu = \int((\mathop{\mathcal P}\varphi)\circ T) ({\mathbf{1}}_B \circ T)\,d\mu \\ &= \int(\mathop{\mathcal P}\varphi) {\mathbf{1}}_B\,d\mu = \int\varphi\cdot ({\mathbf{1}}_B\circ T)\,d\mu = \int_{T^{-1}B} \varphi\,d\mu, \end{aligned}

which establishes (2) and completes the proof. \Box

We see from Proposition 3 that a function has zero conditional expectation with respect to {T^{-1}\mathop{\mathcal B}} if and only if it is in the kernel of {\mathop{\mathcal P}}. In particular, if {\mathop{\mathcal P} h = 0} then {\sum_{j=1}^n h\circ T^j} is a martingale; this will be a key tool in the next section.

Example 7 Let {X=[0,1]} and {T} be the doubling map. Let {\mu} be Lebesgue measure. For convenience of notation we consider the {L^2} space of complex-valued functions on {X}; the functions {\varphi_n\colon x\mapsto e^{2\pi inx}} form an orthonormal basis for this space. A simple calculation shows that

\displaystyle  U\varphi_n(x) = \varphi_n(2x) = e^{2\pi in(2x)} = \varphi_{2n}(x),

so {U\colon \varphi_n\rightarrow \varphi_{2n}}. For the transfer operator we obtain {\mathop{\mathcal P}\colon \varphi_{2n}\rightarrow \varphi_n}, while for odd values of {n} we have

\displaystyle  \begin{aligned} \mathop{\mathcal P}\varphi_n(x) &= \frac 12 \left(\varphi_n\left(\frac x2\right) + \varphi_n\left(\frac{1+x}{2}\right)\right) \\ &= \frac 12 \left( e^{\pi i n x} + e^{\pi i n x} e^{\pi i n}\right) = 0. \end{aligned}

4. Martingale approximation and CLT

The machinery of the Koopman and transfer operators from the previous section can be used to apply the martingale central limit theorem to observations of dynamical systems via the technique of martingale approximation, which was introduced by M. Gordin in 1969.

The idea is that if {\mathop{\mathcal P}^n\varphi\rightarrow 0} quickly enough for functions {\varphi} with {\int\varphi\,d\mu=0}, then we can approximate the sequence {\varphi\circ T^n} with a martingale sequence {h\circ T^n}.

More precisely, suppose that the sequence {\|\mathop{\mathcal P}^n\varphi\|_2} is summable; then we can define an {L^2} function {g} by

\displaystyle  g = \sum_{n=1}^\infty \mathop{\mathcal P}^n\varphi. \ \ \ \ \ (3)

Let {h = \varphi + g - g\circ T\in L^2}. We claim that {S_nh = \sum_{j=1}^n h\circ T^j} is a martingale. Indeed,

\displaystyle  \mathop{\mathcal P} h = \mathop{\mathcal P}\varphi + \mathop{\mathcal P} \sum_{n\geq 1} \mathop{\mathcal P}^n\varphi - \mathop{\mathcal P} U \sum_{n\geq 1} \mathop{\mathcal P}^n\varphi,

and since {\mathop{\mathcal P} U} is the identity we see that the last term is just {\sum_n \mathop{\mathcal P}^n \varphi}, so that {\mathop{\mathcal P} h = 0}.

Proposition 3 now implies that {\mathop{\mathbb E}[h|T^{-1}\mathop{\mathcal B}]=0}, and we conclude that {S_nh} is a martingale, so by the martingale CLT {\frac{S_n h}{\sigma\sqrt{n}}} converges in distribution to {N(0,1)}, where {\sigma^2 = \mathop{\mathbb E}[h^2] = \int h^2\,d\mu}.

Now we want to apply this result to obtain information about {\varphi} itself, and in particular about {S_n\varphi = \sum_{j=1}^n \varphi\circ T^j}. We have {\varphi = h + g\circ T - g}, and so

\displaystyle  S_n\varphi = S_n h + \sum_{j=1}^n (g\circ T^{j+1} - g\circ T^j) = S_nh + g\circ T^{n+1} - g.

This yields

\displaystyle  \frac{S_n\varphi}{\sigma\sqrt n} = \frac{S_n h}{\sigma\sqrt n} + \frac{g\circ T^{n+1} - g}{\sigma\sqrt n},

and the last term goes to 0 in probability, which yields the central limit theorem for {S_n\varphi}.

Remark 1 There is a technical problem we have glossed over, which is that the sequence of {\sigma}-algebras {T^{-n}\mathop{\mathcal B}} is decreasing, not increasing as is required by the definition of a martingale. One solution to this is to pass to the natural extension {\hat T} and to consider the functions {h\circ \hat{T}^{-j}} and the {\sigma}-algebras {\hat{\mathop{\mathcal F}}_j = \hat{T}^j\mathop{\mathcal B}}. Another solution is to use reverse martingales, but we do not discuss this here.

Example 8 Let {X=[0,1]} and {T\colon X\rightarrow X} be an intermittent type (Manneville–Pomeau) map given by

\displaystyle  T(x) = \begin{cases} x(1 - (2x)^\alpha) & x\in [0,\frac 12), \\ 2x-1 & x\in [\frac 12,1], \end{cases}

where {0<\alpha<1} is a fixed parameter. It can be shown that {T} has a unique absolutely continuous invariant probability measure {\mu}, and that the transfer operator {\mathop{\mathcal P}} has the following contraction property: for every {\varphi\in L^2} with {\int \varphi\,d\mu=0}, there is {C\in{\mathbb R}} such that {\|\mathop{\mathcal P}^n\varphi\|_2 \leq Cn^{-\gamma}}, where {\gamma = \frac 12(\frac 1\alpha -1)}.

For small values of {\alpha}, this shows that {\|\mathop{\mathcal P}^n\varphi\|_2} is summable, and consequently {\mu} satisfies the CLT by the above discussion.

Posted in ergodic theory, statistical laws, theorems | Tagged , | 4 Comments