Entropy of S-gap shifts

1. S-gap shifts

S-gap shifts are a useful example for studying dynamics of shift spaces that are not subshifts of finite type but still exhibit some strong mixing properties. They are defined as follows: given {S\subset \{0,1,2,\dots,\}}, let {G\subset \{0,1\}^*} be the set of all words on two symbols of the form {0^n1} — that is, {n} 0s followed by a single 1. (Edit 8/2/15: As Steve Kass pointed out in his comment, we need to specify here that {0^n1\in G \Leftrightarrow n\in S}.) Then let {G^{{\mathbb Z}}} be the set of all bi-infinite sequences of 0s and 1s that can be written as an infinite concatenation of words in {G}, and let {X\subset \{0,1\}^{\mathbb Z}} be the smallest closed shift-invariant set containing {G^{\mathbb Z}}.

Equivalently, {X} is the collection of bi-infinite sequences {x\in \{0,1\}^{\mathbb Z}} for which every subword of the form {10^n1} has {n\in S}. If {S} is finite then {X} is a shift of finite type. We are usually most interested in the case where {S} is infinite — for example, in this paper (arXiv) where Dan Thompson and I considered questions of uniqueness of the measure of maximal entropy. For purposes of this post, {S} may be finite or infinite, it will not matter.

Recall that if {\mathcal{L}_n} denotes the set of words of length {n} that appear somewhere in the shift {X}, then the topological entropy of {X} is {h(X) = \lim \frac 1n \log \#\mathcal{L}_n}. The following result is well-known and often quoted.

Theorem 1 Given {S\subset \{0,1,2,\dots\}}, the topological entropy of the corresponding {S}-gap shift is {h(X) = \log\lambda}, where {\lambda>1} is the unique solution to {\sum_{n\in S} \lambda^{-(n+1)} = 1}.

Note that when {S=\{0,1,2,\dots\}}, the {S}-gap shift is the full shift on two symbols, and the equation has solution {\lambda=2}.

Despite the fact that Theorem 1 is well-known, I am not aware of a complete proof written anywhere in the literature. In a slightly different language, this result already appears in B. Weiss’ 1970 paper “Intrinsically ergodic systems” [Bull. AMS 76, 1266–1269] as example 3.(3), but no details of the proof are given. It is exercise 4.3.7 in Lind and Marcus’ book “Symbolic dynamics and coding”, and the section preceding it gives ideas as to how the proof may proceed. Finally, a more detailed proof appears in Spandl’s “Computing the topological entropy of subshifts” [Math. Log. Quart. 53 (2007), 493–510], but there is a gap in the proof. The goal of this post is to explain where the gap in Spandl’s proof is, and then to give two other proofs of Theorem 1, one more combinatorial, the other more ergodic theoretic.

2. An incomplete proof

The idea of Spandl’s proof is this. Given an {S}-gap shift, let {\mathcal{R}_n} be the set of words of length {n} that end in the symbol {1}. Every such word is either {0^{n-1}1} or is of the form {w0^k1}, where {w\in \mathcal{R}_{n-(k+1)}} and {k\in S}. Thus we have

\displaystyle \#\mathcal{R}_n = 1 + \sum_{k=0}^{n-1} {\mathbf{1}}_S(k) \#\mathcal{R}_{n-(k+1)}. \ \ \ \ \ (1)

Moreover, every {w\in \mathcal{L}_n} is either {0^n} or is of the form {v0^{n-k}} for some {v\in \mathcal{R}_k}, so {\#\mathcal{L}_n = 1 + \sum_{k=1}^n \#\mathcal{R}_k}. With a little more work, one can use this together with (1) to get

\displaystyle \#\mathcal{L}_n = c_n + \sum_{k=0}^{n-1} {\mathbf{1}}_S(k) \#\mathcal{L}_{n-(k+1)},

where {c_n\in [0,n+2]} for all {n}. Dividing through by {\#\mathcal{L}_n} gives

\displaystyle 1 = \frac{c_n}{\#\mathcal{L}_n} + \sum_{k=0}^{n-1} {\mathbf{1}}_S(k)\frac{\#\mathcal{L}_{n-(k+1)}}{\#\mathcal{L}_n} \ \ \ \ \ (2)

Writing {\lambda = e^{h(X)}}, Spandl now says that {\#\mathcal{L}_n} is asymptotically proportional to {\lambda^n}, and so for each {k} the ratio inside the sum converges to {\lambda^{-(k+1)}} as {n\rightarrow\infty}. Since {c_n} is subexponential, this would prove Theorem 1.

The problem is that the ratio {\frac{\#\mathcal{L}_{n-(k+1)}}{\#\mathcal{L}_n}} may not converge as {n\rightarrow\infty}. Indeed, taking {S = \{1,3,5,7,\dots\}} it is not hard to show that when {k} is even, the limit taken along odd values of {n} differs from the limit taken along even values of {n}.

One might observe that in this specific example, the terms where {k} is even do not contribute to the sum in (2), since {S} contains no even numbers. Thus it is plausible to make the following conjecture.

Conjecture. For any {S}, let {d = \mathrm{gcd}(S+1)} and let {h = \log\lambda} be the topological entropy of the corresponding {S}-gap shift. Then for every {j=0,1,\dots,d-1} the limit

\displaystyle a_j := \lim_{n\rightarrow\infty} \frac{\#\mathcal{L}_{j+dn}}{\lambda^{j+dn}}

exists. In particular, if {k\in S} then

\displaystyle \frac{\#\mathcal{L}_{n-(k+1)}}{\#\mathcal{L}_n} = \lambda^{-(k+1)}\frac{\#\mathcal{L}_{n-(k+1)}}{\lambda^{n-(k+1)}} \frac{\lambda^n}{\#\mathcal{L}_n} \rightarrow \lambda^{-(k+1)} \text{ as }n\rightarrow\infty.

If the conjecture is true, this would complete Spandl’s proof of Theorem 1. I expect that the conjecture is true but do not know how to prove it. In general, any shift space with entropy {h=\log \lambda} has the property that {\#\mathcal{L}_n\geq \lambda^n}. There are examples of shift spaces where {\#\mathcal{L}_n /\lambda^n} is not bounded above; however, it can be shown that every {S}-gap shift admits an upper bound, so that {\#\mathcal{L}_n/\lambda^n} is bounded away from 0 and {\infty} (this is done in my paper with Dan Thompson). I don’t see how those techniques can extend to a proof of the conjecture.

So instead of patching the hole in this proof, we investigate two others.

3. A combinatorial proof

Given a shift space {X} with language {\mathcal{L}}, let {a_n = \#\mathcal{L}_n} be the number of words of length {n}. Consider the following generating function:

\displaystyle H(z) = \sum_{n=1}^\infty a_n z^n.

(This is similar to the dynamical zeta function but is somewhat different since we consider all words of length {n} and not just ones that represent periodic orbits of period {n}.) Observe that the radius of convergence of {H} is {\xi := e^{-h(X)}}. Indeed, for {|z|<\xi} we can put {\beta = |z|/\xi < 1} and observe that the quantity {|a_n z^n| = \frac{\#\mathcal{L}_n}{e^{nh(X)}} \beta^n} decays exponentially; similarly, for {|z|>\xi} the terms in the sum grow exponentially.

Now fix an {S}-gap shift and consider the function

\displaystyle F_1(z) = \sum_{n\in S} z^{n+1}.

Our goal is to find a relationship between {H} and {F_1} allowing us to show that the radius of convergence of {H} is given by the positive solution to {F_1(z)=1}.

First recall the set {G = \{0^n1\mid n\in S\}}. Given integers {k,n}, let {A_n^k} be the number of words in {\{0,1\}^n} that can be written as a concatenation of exactly {k} words from {G}. Note that {A_n^1 = {\mathbf{1}}_S(n-1)}, so that

\displaystyle F_1(z) = \sum_{n=1}^\infty A_n^1 z^n.

More generally, we consider the power series

\displaystyle F_k(z) = \sum_{n=1}^\infty A_n^k z^n.

A little thought reveals that {F_k(z)F_\ell(z) = F_{k+\ell}(z)}, since the coefficient of {z^n} in {F_k(z)F_\ell(z)} is given by

\displaystyle \sum_{m=1}^n A_{n-m}^k A_m^\ell,

which is equal to {A_n^{k+\ell}} (here {m} represents the location where the {\ell}th element of {G} ends). In particular, we get

\displaystyle F_k(z) = (F_1(z))^k.

At this point, the natural thing to do is to say that {\#\mathcal{L}_n = \sum_{k\geq 1} A_n^k} and hence {H(z) = \sum_{k\geq 1} F_k(z)}. However, this is not quite correct because {\mathcal{L}_n} includes words that are not complete concatenations of words from {G} and so are not counted by any {A_n^k}. We return to this in a moment, but first point out that if this were true then we would have {H(z) = \sum_{k\geq 1} F_1(z)^k}, and so {H(z)} converges if {F_1(z)<1} and diverges if {F_1(z)>1}, which was our goal.

To make this precise, we observe that every word in {\mathcal{L}_n} is either {0^n} or is of the form {0^i w 0^j} where {w\in A_{n-(i+j)}^k} for some {k}. Thus we have the bounds

\displaystyle \sum_{k\geq 1} A_n^k \leq \#\mathcal{L}_n \leq 1 + \sum_{k\geq 1} \sum_{i\geq 0} \sum_{j\geq 0} A_{n-(i+j)}^k.

In particular, for all {z\geq 0} we get

\displaystyle \sum_{k\geq 1} F_k (z) \leq H(z) \leq \sum_{n\geq 1} z^n + \sum_{k\geq 1} F_k (z) \left(\sum_{i\geq 0} z^i \right) \left(\sum_{i\geq 0} z^j\right) \ \ \ \ \ (3)

Writing {\hat H(z) = \sum_k F_k(z) = \sum_k F_1(z)^k}, we note that for every {z\in [0,1)} we have {\sum_n z^n < \infty} and so {\hat H(z)} converges if and only if {H(z)} converges. Thus {\hat H} and {H} have the same radius of convergence; in particular, the radius of convergence of {H(z)} is the unique {z} such that {F_1(z) = \sum_{n\in S} z^{n+1} = 1}, and by the earlier discussion we have {h(X) = \log(1/z)}, proving Theorem 1.

4. An ergodic theory proof

Now we sketch a proof that goes via the variational principle, relating an {S}-gap shift (on the finite alphabet {\{0,1\}}) to a full shift on a (possibly countable) alphabet. The combinatorial proof above is elementary and requires little advanced machinery; this proof, on the other hand, requires a number of (rather deep) results from ergodic theory and thermodynamic formalism, but has the advantage of illuminating various aspects of the structure of {S}-gap shifts.

Let {X} be an {S}-gap shift and let {Y} be the set of sequences {x\in X} which contain the symbol 1 infinitely often both forwards and backwards. By the Poincaré recurrence theorem, {\mu(Y)=1} for every ergodic measure other than the delta-measure on the fixed point {0^\infty}. Note that {Y} is {\sigma}-invariant but not compact (unless {S} is finite).

Let {Z = Y \cap [1]}, and let {F\colon Z\rightarrow Z} be the first return map. Thus {F(x) = \sigma^n(x)} for all {x\in [10^n 1]}. Note that {F} is topologically conjugate to the full shift {S^{\mathbb Z}} on the alphabet {S}, which we allow to be finite or countable. The conjugacy is given by {\pi\colon S^{\mathbb Z}\rightarrow Z} that takes {\vec n = \cdots n_{-1}.n_0n_1\cdots} to {\cdots 10^{n_{-1}}.10^{n_0}10^{n_1}1\cdots}.

Given an ergodic {\sigma}-invariant probability measure {\mu} on {X} with {\mu(Y)=1}, let {\mu_Z} be the induced {F}-invariant measure on {Z}. Then by Abramov’s formula, we have

\displaystyle h_\mu(X,\sigma) = \mu(Z) h_{\mu_Z}(Z,F).

Associate to each {\mu} the shift-invariant measure on {S^{\mathbb Z}} given by {h^{-1}_*\mu_Z}. Then we have

\displaystyle h_\mu(X,\sigma) = \mu(Z) h_{\nu}(S^{\mathbb Z},\sigma).

Our goal is to relate the topological entropy on {(X,\sigma)} to the topological pressure of a suitable function on {(S^{\mathbb Z},\sigma)}. Let {\varphi\colon S^{\mathbb Z}\rightarrow {\mathbb R}} be the function taking {\vec n} to {n_0 + 1}, and observe that if {\mu} and {\nu} are identified as above, we have {\mu(Z) = \mu([1]) = 1/\int \varphi\,d\nu}, so that

\displaystyle h_\nu = \left(\int \varphi\,d\nu \right) h_\mu.

At this point we elide some technical details regarding Gurevich pressure, etc., and simply remark that for {t>0} we have

\displaystyle \begin{aligned} P(S^{\mathbb Z},-t\varphi) &= \lim_{k\rightarrow\infty} \frac 1k \log \sum_{(n_0,\dots,n_{k-1})\in S^k} \sup_{\vec m\in [n_0\cdots n_{k-1}]} e^{-t \sum_{i=0}^{k-1} \varphi(\sigma^i\vec m)} \\ &= \lim_{k\rightarrow\infty} \frac 1k \log \left(\sum_{n_0,\dots,n_{k-1}} e^{-t\sum_{i=0}^{k-1} (n_i + 1)} \right) \\ &= \lim_{k\rightarrow\infty} \frac 1k \log \left( \sum_{n\in S} e^{-t(n+1)} \right)^k = \log \sum_{n\in S} e^{-t(n+1)}, \end{aligned}

while by the variational principle

\displaystyle P(S^{\mathbb Z},-t\varphi) = \sup_\nu \left(h_\nu - t \int\varphi\,d\nu\right) = \sup_\nu \left(\int\varphi\,d\nu\right) (h_\mu - t).

We conclude that

\displaystyle \sup_\mu (h_\mu - t) \mu[1]^{-1} = \log \sum_{n\in S} e^{-t(n+1)}. \ \ \ \ \ (4)

Let {t} be such that the right-hand side is equal to 0; to prove Theorem 1 we need to prove that {h(X,\sigma) = t}. First observe that for every {\mu} we have {h_\mu - t \leq 0}, thus {h(X,\sigma) \leq t}. For the other inequality, let {\nu} be the Bernoulli measure on {S^{\mathbb Z}} that assigns weight {e^{-t(n+1)}} to the symbol {n}. Then {\nu} is an equilibrium state for {-t\varphi} on {S^{\mathbb Z}}, and by our choice of {t}, we have

\displaystyle h_\nu - \int \varphi\,d\nu = 0,

so that in particular the left hand side of (4) vanishes and we get {h_\mu=t} for the measure {\mu} that corresponds to {\nu}. This shows that {h(X,\sigma) = t} and completes the proof of the theorem.

We note that this proof has the advantage of giving an explicit description of the MME. With a little more work it can be used to show that the MME is unique and has good statistical properties (central limit theorem, etc.).

Posted in ergodic theory, examples | Tagged , | 2 Comments

Slowly mixing sets

There are two equivalent definitions of mixing for a measure-preserving dynamical system {(X,f,\mu)}. One is in terms of sets:

\displaystyle  \lim_{n\rightarrow\infty} |\mu(A \cap f^{-n}B) - \mu(A)\mu(B)| = 0 \ \ \ \ \ (1)

for all measurable {A,B\subset X}. The other is in terms of functions:

\displaystyle  \lim_{n\rightarrow\infty} \left\lvert \int \varphi \cdot(\psi\circ f^n)\,d\mu - \int\varphi\,d\mu \int\psi\,d\mu\right\rvert = 0 \ \ \ \ \ (2)

for all {\varphi,\psi\in L^1(X,\mu)}. In both cases one may refer to the quantity inside the absolute value as the correlation function, and write it as {C_n(A,B)} or {C_n(\varphi,\psi)}, so that (1) and (2) become {C_n(A,B)\rightarrow 0} and {C_n(\varphi,\psi)\rightarrow 0}, respectively.

Then it is natural to ask how quickly mixing occurs; that is, how quickly the correlation functions {C_n} decay to 0. Is the convergence exponential in {n}? Polynomial? Something else?

In a previous post, we discussed one method for establishing exponential decay of correlations using a spectral gap for a certain operator associated to the system. The result there stated that as long as {\varphi} and {\psi} are chosen from a suitable class of test functions (often Hölder continuous functions), then {C_n(\varphi,\psi)} decays exponentially.

A comment was made in that post that it is important to work with sufficiently regular test functions, because for arbitrary measurable functions, or for the setwise correlations {C_n(A,B)}, the decay may happen arbitrarily slowly even when the system is “as mixing as possible”. But no examples were given there — so I’d like to explain this a little more completely here.

Let {X=\{0,1\}^{\mathbb N}} be the space of infinite sequences of 0s and 1s, and let {f=\sigma} be the shift map. Let {\mu} be {(\frac 12,\frac 12)}-Bernoulli measure, so that writing {[x_1\cdots x_n]} for the {n}-cylinder containing all sequences in {X} that begin with the symbols {x_1,\dots,x_n}, we have {\mu[x_1\cdots x_n] = 2^{-n}}. This is isomorphic to the doubling map on the circle with Lebesgue measure, which was discussed in the previous post and which has exponential decay of correlations for Hölder observables {\varphi,\psi}. Nevertheless, we will show that there are sets {A,B\subset X} such that {C_n(A,B)} decays quite slowly.

We can think of {(X,f,\mu)} as modeling sequences of coin flips, with tails corresponding to the symbol 0, and heads to 1.

Let {B = [1]} be the set of all sequences in {X} that begin with 1. Then {B} corresponds to the event that the first flip is heads, and {\sigma^{-n}B} corresponds to the event that the {(n+1)}st flip is heads.

The description of {A} is a little more involved. Given {j\leq k \in {\mathbb N}}, let {A_{j,k} = \bigcup_{i=j}^{k-1} \sigma^{-i}[1]}, so that {x\in A_{j,k}} if and only if (at least) one of the symbols {x_j,\dots,x_{k-1}} is 1. This corresponds to the event that heads appears at least once on or after the {j}th flip and before the {k}th flip. Note that {\mu(X \setminus A_{j,k}) = 2^{-(k-j)}}, and so {\mu(X \setminus A_{j,k}) = 1-2^{-(k-j)}}.

Fix an increasing sequence of integers {k_1,k_2,\dots} with {k_1 = 1}, and let {A = \bigcap_{i=1}^\infty A_{k_i, k_{i+1}}} be the set of all sequences {x\in X} such that every subword {x_{k_i} \cdots x_{k_{i+1}-1}} contains at least one occurrence of the symbol 1.

Using the interpretation as coin flips, we see that the events {A_{k_i,k_{i+1}-1}} are independent, and so

\displaystyle  \mu(A) = \mathop{\mathbb P}(A) = \prod_{i=1}^\infty \mathop{\mathbb P}(A_{k_i,k_{i+1}}) = \prod_{i=1}^\infty (1-2^{-(k_{i+1} - k_i)}). \ \ \ \ \ (3)

We recall the following elementary lemma.

Lemma 1 Let {t_i\in {\mathbb R}} be a sequence of real numbers such that {0\leq t_i < 1} for all {i\in {\mathbb N}}. Then {\prod_{i=1}^\infty (1-t_i) > 0} if and only if {\sum_{i=1}^\infty t_i < \infty}.

It follows from Lemma 1 that {\mu(A)>0} if and only if

\displaystyle  \sum_{i=1}^\infty 2^{-(k_{i+1} - k_i)} < \infty. \ \ \ \ \ (4)

Thus from now on we assume that the sequence {k_i} is chosen such that (4) holds. Note that the correlation function {C_n(A,B)} can be computed via conditional probabilities: using

\displaystyle  \mathop{\mathbb P}(A \mid \sigma^{-n}B) = \frac{\mu(A \cap \sigma^{-n}B)}{\mu(B)}

(notice that this uses shift-invariance of {\mu}), we have

\displaystyle  \begin{aligned} C_n(A,B) &= \mu(A\cap \sigma^{-n}B) - \mu(A)\mu(B) \\ &= \mu(B) (\mathop{\mathbb P}(A \mid \sigma^{-n}B) - \mu(A)). \end{aligned} \ \ \ \ \ (5)

Recalling that {A} is the intersection of the independent events {A_{k_i,k_{i+1}}}, we let {j=j(n)} be such that {n\in [k_j, k_{j+1})}, and observe that {\sigma^{-n}(B) \subset A_{k_j,k_{j+1}}}. Thus

\displaystyle  \mathop{\mathbb P}(A \mid \sigma^{-n}B) = \prod_{i\neq j} (1-2^{-(k_{i+1}-k_i)}) = \frac{\mu(A)}{1-2^{-(k_{j+1}-k_j)}},

where the last equality uses the expression for {\mu(A)} from (3). Together with the expression (5) for {C_n(A,B)} in terms of the conditional probability {\mathop{\mathbb P}(A\mid \sigma^{-n}B)}, this gives

\displaystyle  \begin{aligned} C_n(A,B) &= \mu(A) \mu(B) \left( \frac 1{1-2^{-(k_{j+1}-k_j)}} - 1 \right) \\ &= \mu(A)\mu(B) \frac 1{2^{k_{j+1} - k_j} - 1}, \end{aligned} \ \ \ \ \ (6)

where we emphasise that {j} is a function of {n}.

Example 1 Take {k_i = \lfloor 2 i\log_2 i \rfloor}, so that {k_{i+1} - k_i \approx 2 i \log_2 (1+\frac 1i) + 2\log_2(i+1)}, and {2^{k_{i+1} - k_i} \approx (i+1)^2 (1+\frac 1i)^{2i}}. Since {(1+\frac 1i)^{2i} \rightarrow e^2}, we see that (4) is satisfied, so that {\mu(A)>0}.

Moreover, {j=j(n)} satisfies {2j\log_2 j \leq n}, and so in particular {j\leq n}. This implies that {k_{j+1} - k_j \leq k_{n+1} - k_n}, and so we can estimate the correlation function using (6) by

\displaystyle  C_n(A,B) \geq \mu(A)\mu(B)\frac 1{2^{k_{n+1} - k_n} - 1} \approx \mu(A)\mu(B) \frac 1{n^2}.

The example shows that the correlation function of sets may only decay polynomially, even if the system has exponential decay of correlations for regular observable functions. In fact, we can use the construction above to produce sets with correlations that decay more or less as slowly as we like along some subsequence of times {n}.

Theorem 2 Let {\gamma_n>0} be any sequence of positive numbers converging to 0. Then there is a sequence {k_i} such that the sets {A,B} from the construction above have {\limsup_{n\rightarrow\infty} C_n(A,B)/\gamma_n = \infty}.

The result says that no matter how slowly {\gamma_n} converges to 0, we can choose {A,B} such that {C_n(A,B)} is not bounded above by {K\gamma_n} for any constant {K}.

To prove Theorem 2, we will choose {k_i} such that {C_n(A,B)/\gamma_n} is large whenever {n\in [k_j, k_{j+1})} with {j} even. Let {c_i,d_i} be any increasing sequences of integers (conditions on these will be imposed soon) and define {k_i} recursively by

\displaystyle  k_1 = 1, \quad k_{2i} = k_{2i-1} + c_i, \qquad k_{2i+1} = k_{2i} + d_i.

Because {c_i,d_i} are increasing, we have

\displaystyle  \sum_{i=1}^\infty 2^{-(k_{i+1} - k_i)} = \sum_{i=1}^\infty 2^{-c_i} + 2^{-d_i} < \infty,

so (4) holds and {\mu(A)>0}. The idea is that we will take {c_i \gg d_i}, so that when {n\in [k_{2j},k_{2j+1})}, we have {k_{2j+1} - k_{2j} = d_j} small relative to {k_{2j}}, and hence to {n}.

Let’s flesh this out a little. Given {n\in [k_{2j},k_{2j+1})}, it follows from (6) that

\displaystyle  C_n(A,B) = \mu(A)\mu(B) \frac 1{2^{k_{j+1} - k_j} - 1} \geq \mu(A)\mu(B) 2^{-d_j}. \ \ \ \ \ (7)

On the other hand, {n\geq k_{2j} \geq \sum_{i=1}^j c_i}. Now we may take {d_j} to be any increasing sequence ({d_j=j} will do just fine) and define {c_j} recursively in terms of {c_1,\dots, c_{j-1}} and {d_j}. We do this by observing that since {\gamma_n \rightarrow 0}, given any {c_1,\dots, c_{j-1}} and {d_j}, we can take {c_j} large enough so that for every {n\geq \sum_{i=1}^j c_i}, we have

\displaystyle  j\gamma_n \leq \mu(A) \mu(B) 2^{-d_j}.

In particular, (7) gives {C_n(A,B) \geq j\gamma_n} for every {n\in [k_{2j},k_{2j+1})}. Sending {j\rightarrow\infty} completes the proof of Theorem 2.

Posted in ergodic theory, examples | Tagged | 1 Comment

Law of large numbers for dependent but uncorrelated random variables

One of the fundamental results in probability theory is the strong law of large numbers, which was discussed in an earlier post under the guise of the Birkhoff ergodic theorem.

Suppose we have a sequence of random variables {t_n} which take values in {{\mathbb N}}. If the sequence {t_n} is independent and identically distributed with {\mathop{\mathbb E}[t_n] = T}, then the strong law of large numbers shows that {\frac 1n(t_1+\cdots + t_n)\rightarrow T} almost surely.

It is often the case, however, that one or both of these conditions fail; there may be some dependence between the variables {t_n}, or the distribution may not be identical for all values of {n}. For example, the following situation arose recently in a project I have been working on: there is a sequence of random variables {t_n} as above, which has the property that no matter what values {t_1,\dots,t_{n-1}} take, the random variable {t_n} has expected value at most {T}. In the language of conditional expectation, we have

\displaystyle  \mathop{\mathbb E}[t_n \mid t_1,\dots, t_{n-1}] \leq T.

This is true despite the fact that the distribution of {t_n} may vary according to the values of {t_1,\dots,t_{n-1}}. In what follows we will write {\mathcal{F}_n} for the {\sigma}-algebra generated by {t_1,\dots, t_n} and write the above condition as

\displaystyle  \mathop{\mathbb E}[t_n \mid \mathcal{F}_{n-1}]\leq T.

It is plausible to conjecture that in this setting one still has {\varlimsup \frac 1n(t_1+\cdots+t_n) \leq T} almost surely. And indeed, it turns out that a statement very close to this can be proved.

Let {\mathop{\mathcal F}_n} be an increasing sequence of {\sigma}-algebras and let {t_n} be a sequence of {{\mathbb N}}-values random variables such that {t_n} is {\mathop{\mathcal F}_n}-measurable. Suppose that there is {p\colon {\mathbb N} \rightarrow [0,1]} such that

\displaystyle  \mathop{\mathbb P}[t_n=t \mid \mathcal{F}_{n-1}] \leq p(t) \ \ \ \ \ (1)

and moreover,

\displaystyle   T := \sum_{t=1}^\infty t \cdot p(t) < \infty. \ \ \ \ \ (2)

The following result is proved by modifying one of the standard proofs of the strong law of large numbers, which can be found, for example, in Billingsley’s book “Probability and Measure”, where it is Theorem 22.1. There is also a discussion on Terry Tao’s blog, which emphasises the role of the main tools in this proof: the moment method and truncation.

Proposition 1 If {t_n} is any sequence of random variables satisfying (1) and (2), then {\varlimsup_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n t_k \leq T} almost surely.

Proof: Consider the truncated random variables {s_n = t_n {\mathbf{1}}_{[t_n\leq n]}}, and note that by (1) and (2) we have

\displaystyle  0 \leq s_n \leq t_n \Rightarrow \mathop{\mathbb E}[s_n \mid \mathcal{F}_{n-1}] \leq T.

Now consider the random variables {r_n = s_n + T - \mathop{\mathbb E}[s_n \mid \mathcal{F}_{n-1}]}. Note that {r_n} is {\mathcal{F}_n}-measurable, takes values in {{\mathbb N}}, and moreover

\displaystyle  \mathop{\mathbb E}[r_n \mid \mathcal{F}_{n-1}] = T \text{ for all } n. \ \ \ \ \ (3)

The most important consequence of this is that the sequence {r_n} is uncorrelated — that is, {\mathop{\mathbb E}[r_ir_j] = \mathop{\mathbb E}[r_i]\mathop{\mathbb E}[r_j] = T^2} for all {i\neq j}, which in turn implies that {\mathop{\mathbb E}[(r_i - T)(r_j-T)] = 0}. This means that the variance is additive for the sequence {r_n}:

\displaystyle  \begin{aligned} \mathop{\mathbb E}\Big[ \Big(\sum_{k=1}^n (r_k -T)\Big)^2\Big] &= \sum_{k=1}^n \mathop{\mathbb E}[(r_k -T)^2] + 2\sum_{1\leq i<j\leq n} \mathop{\mathbb E}[(r_i-T)(r_j-T)] \\ &= \sum_{k=1}^n \mathop{\mathbb E}[(r_k -T)^2]. \end{aligned}

Let {X_n = \frac 1n (r_1 + \cdots + r_n) - T}, so that {\mathop{\mathbb E}[X_N] = 0}. We can estimate {\mathop{\mathbb E}[X_n^2]} using the previous computation:

\displaystyle  \begin{aligned} \mathop{\mathbb E}[X_n^2] &= \frac 1{n^2} \sum_{k=1}^n \mathop{\mathbb E}[(r_k-T)^2] = \frac 1{n^2} \sum_{k=1}^n (\mathop{\mathbb E}[r_k^2] - 2T\mathop{\mathbb E}[r_k] + T^2) \\ &= \frac 1{n^2} \left( nT^2 + \sum_{k=1}^n \mathop{\mathbb E}[r_k^2]\right). \end{aligned} \ \ \ \ \ (4)

Recall that {p(t)} is the sequence bounding {\mathop{\mathbb P}[t_n = t \mid \mathop{\mathcal F}_{n-1}]}, and let {\bar t} be such that {\sum_{t\geq \bar t} p(t) \leq 1}. Let {Y} be a random variable taking the value {t} with probability {p(t)} for {t\geq \bar t}. Note that by the definition of {s_n} and {r_n}, we have {r_n \leq s_n + T \leq n+T}, thus

\displaystyle  \mathop{\mathbb E}[r_k^2] = \sum_{t=1}^{k+T} t^2 \mathop{\mathbb P}[r_k = t] \leq \sum_{t=1}^{k+T} t^2 p(t) \leq C + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq k+T]}]

for some fixed constant {C}. Note that the final expression is non-decreasing in {k}, and so together with (4) we have

\displaystyle  \mathop{\mathbb E}[X_n^2] \leq \frac 1n \left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n+T]}] \right), \ \ \ \ \ (5)

where again {C'} is a fixed constant. Fixing {\varepsilon>0} and using (5) in Chebyshev’s inequality yields

\displaystyle  \mathop{\mathbb P}[|X_n| \geq \varepsilon] \leq \frac 1{\varepsilon^2 n} \left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n+T]}] \right). \ \ \ \ \ (6)

Now let {n_k} be a sequence of integers such that {\lim_k \frac{n_{k+1}}{n_k} =: \alpha > 1}. We see that (6) yields

\displaystyle  \begin{aligned} \sum_{k=1}^\infty \mathop{\mathbb P}[|X_{n_k}| \geq \varepsilon] &\leq \frac 1{\varepsilon^2} \sum_{k=1}^\infty n_k^{-1}\left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n_k+R]}]\right) \\ &\leq C'' + \frac 1{\varepsilon^2} \mathop{\mathbb E} \left[ Y^2 \sum_{k=1}^\infty n_k^{-1} {\mathbf{1}}_{[Y\leq n_k+T]}\right], \end{aligned} \ \ \ \ \ (7)

where {C''<\infty} depends on {\alpha} and on {\varepsilon}. Observe that

\displaystyle  \sum_{k=1}^\infty n_k^{-1} {\mathbf{1}}_{[Y\leq n_k +T]} \leq \sum_{k=k_0(Y)}^\infty n_k^{-1},

where {k_0(Y)} is the minimal value of {k} for which {n_k\geq Y-T}. Because {n_k} grows like {\alpha^k}, the sum is bounded above by a constant times {n_{k_0(Y)}^{-1}}, so that together with (7) we have

\displaystyle  \sum_{k=1}^\infty \mathop{\mathbb P}[|X_{n_k}| \geq \varepsilon] \leq C'' + C''' \mathop{\mathbb E}[Y] < \infty,

where finiteness of {\mathop{\mathbb E}[Y]} is the crucial place where we use (2).

Using the first Borel-Cantelli lemma and taking an intersection over all rational {\varepsilon>0} gives

\displaystyle  \lim_{n\rightarrow\infty} \frac 1{n_k} \sum_{j=1}^{n_k} r_j = T \text{ a.s.} \ \ \ \ \ (8)

Let {Z_n = \sum_{j=1}^n r_j}. Because {r_j\geq 0} we have {Z_{n_k} \leq Z_n \leq Z_{n_{k+1}}} for all {n_k\leq n\leq n_{k+1}}, and in particular

\displaystyle  \frac {n_k}{n_{k+1}} \frac{Z_{n_k}}{n_k} \leq \frac{Z_n}{n} \leq \frac{n_{k+1}}{n_k} \frac{Z_{n_{k+1}}}{n_{k+1}}.

Taking the limit and using (8) gives

\displaystyle  \frac 1\alpha T \leq \varliminf_{k\rightarrow\infty} \frac 1k Z_k \leq \varlimsup_{k\rightarrow\infty} \frac 1k Z_k \leq \alpha T \text{ a.s.}

Taking an intersection over all rational {\alpha>1} gives

\displaystyle  \lim_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n r_k = T \text{ a.s.} \ \ \ \ \ (9)

Finally, we recall from the definition of {s_n} and {r_n} that {t_n\leq r_n} whenever {t_n\leq n}. In particular, we may observe that

\displaystyle  \sum_{n=1}^\infty \mathop{\mathbb P}[t_n > r_n] \leq \sum_{n=1}^\infty \mathop{\mathbb P}[t_n > n] \leq \mathop{\mathbb E}[t_n] \leq T

and apply Borel-Cantelli again to deduce that with probability one, {t_n > r_n} for at most finitely many values of {n}. In particular, (9) implies that

\displaystyle  \varlimsup_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n t_k \leq T \text{ a.s.} \ \ \ \ \ (10)

which completes the proof of Proposition 1. \Box

Posted in statistical laws, theorems | Tagged | 4 Comments

Equidistribution for random rotations

Two very different types of dynamical behaviour are illustrated by a pair of very well-known examples on the circle: the doubling map and an irrational rotation. On the unit circle in {{\mathbb C}}, the doubling map is given by {z\mapsto z^2}, while an irrational rotation is given by {z\mapsto e^{2\pi i\theta}z} for some irrational {\theta}.

Lebesgue measure (arc length) is invariant for both transformations. For the doubling map, it is just one of many invariant measures; for an irrational rotation, it turns out to be the only invariant measure. We say that the doubling map exhibits hyperbolic behaviour, while the irrational rotation exhibits elliptic behaviour.

Systems with hyperbolicity have many invariant measures, as we saw in a series of previous posts. The goal of this post is to recall a proof that the opposite situation is true for an irrational rotation, and that in particular every orbit equidistributes with respect to Lebesgue measure; then we consider orbits generated by random rotations, where instead of rotating by a fixed angle {\theta}, we rotate by either {\theta_1} or {\theta_2}, with the choice of which to use being made at each time step by flipping a coin.

1. Invariant measures via {C(X)}

First we recall some basic facts from ergodic theory for topological dynamical systems. Given a compact metric space {X} and a continuous map {f\colon X\rightarrow X}, let {\mathcal{M}} denote the space of Borel probability measures on {X}. Writing {C(X)} for the space of all continuous functions {X\rightarrow {\mathbb C}}, recall that {C(X)^*} is the space of all continuous linear functionals {C(X)\rightarrow {\mathbb C}}. Then {C(X)^*} is (isomorphic to) the space of finite complex Borel measures on {X}. (This last assertion uses the fact that {X} is a compact metric space and combines various results from “Linear Operators” by Dunford and Schwartz, but a more precise reference will have to wait until I have the book available to look at.)

Using this fact together with the polar decomposition for finite complex Borel measures, we have the following: for every {L\in C(X)^*}, there is {\mu\in \mathcal{M}} and a measurable function {\theta\colon X\rightarrow {\mathbb R}} such that

\displaystyle  L(\varphi) = \|L\| \int \varphi e^{i\theta} \,d\mu \text{ for all } \varphi\in C(X). \ \ \ \ \ (1)

Note that although {C(X)^*} is endowed with the operator norm, we will usually think of it as a topological vector space with the weak* topology. Thus {\mathcal{M}} embeds naturally into {C(X)^*}, and (1) shows that every element of {C(X)^*} can be described in a canonical way in terms of {\mathcal{M}}.

Let {P\subset C(X)} be a countable set whose span is dense in {C(X)}. Then to every {L\in C(X)^*} we can associate the sequence {\Phi_P(L) = \{ L(p) \mid p\in P\} \subset {\mathbb C}^P}, where depending on the context we may index using either {{\mathbb N}} or {{\mathbb Z}}. If {P} is bounded then {\Phi_P(L)\in \ell^\infty} for every {L\in C(X)^*}, and so such a {P} defines a linear map {\Phi_P\colon C(X)^*\rightarrow \ell^\infty}.

Because {L} is determined by the values {L(p)} (by linearity and continuity of {L}, and density of the span of {P}), the map {\Phi} is 1-1. In particular, it is an isomorphism onto its image, which we denote by

\displaystyle  V_P := \Phi_P(C(X)^*) \subset \ell^\infty

Note that {V_P \neq \ell^\infty} because {C(X)^*} is separable and {\ell^\infty} is not.

It is straightforward to see that {\Phi_P} is continuous, and its inverse is also continuous on {V_P}. Thus we can translate questions about {C(X)^*}, and in particular about {\mathcal{M}}, into questions about {\ell^\infty}.

Remark 1 It is a nontrivial problem to determine which elements of {\ell^\infty} correspond to elements of {C(X)^*}, and also to determine which of those sequences correspond to actual measures (elements of {\mathcal{M}}). We will not need to address either of these problems here.

The action {f\colon X\rightarrow X} induces an action on {C(X)} by {\varphi\mapsto \varphi\circ f}, and hence on {C(X)^*} by duality. This action {f_*\colon C(X)^*\rightarrow C(X)^*} is given by

\displaystyle  (f_*L)(\varphi) = L(\varphi\circ f). \ \ \ \ \ (2)

In particular, {f} also induces an action {f_*\colon \mathcal{M}\rightarrow\mathcal{M}} by

\displaystyle  \int\varphi\,d(f_*\mu) = \int (\varphi \circ f)\,d\mu. \ \ \ \ \ (3)

A measure {\mu} is {f}-invariant iff it is a fixed point of {f_*}. Let {f_P} be the action induced by {f} on {V_P\subset \ell^\infty}; that is,

\displaystyle  f_P(\Phi_P(\mu)) = \Phi_P(f_*\mu). \ \ \ \ \ (4)

If {P} can be chosen so that {f_P} takes a particularly nice form, then this can be used to understand what invariant measures {f} has, and how empirical measures converge.

Let us say more clearly what is meant by convergence of empirical measures. Given {x\in X} and {n\in {\mathbb N}}, let {\mathcal{E}_n(x) = \frac 1n \sum_{k=0}^{n-1} \delta_{f^kx}} be the empirical measure along the orbit segment {x,f(x),\dots,f^{n-1}(x)}. Let {V(x)\subset \mathcal{M}} be the set of weak* accumulation points of the sequence {\mathcal{E}_n(x)}. By compactness of {\mathcal{M}}, the set {V(x)} is non-empty, and it is a standard exercise to show that every measure in {V(x)} is {f}-invariant.

In particular, if {(X,f)} is uniquely ergodic, then it only has one invariant measure {\mu}, and so {V(x) = \{\mu\}} for every {x\in X}. In this case we have {\mathcal{E}_n(x) \rightarrow \mu} for every {x}, and it is reasonable to ask how quickly this convergence occurs.

2. Irrational rotations

Now we specialise to the case of an irrational rotation. Let {X=S^1\subset {\mathbb C}} be the unit circle, fix {\theta\in {\mathbb R}} irrational, and let {f\colon X\rightarrow X} be given by {f(z) = e^{2\pi i\theta}z}. We will show that {\mu\in \mathcal{M}} is {f}-invariant iff it is Lebesgue measure, and then examine what happens in a broader setting.

Given {n\in {\mathbb Z}}, let {p_n(z) = z^n}, and let {P = \{p_n \mid n\in {\mathbb Z}\}}. Then the span of {P} contains all functions {S^1\rightarrow {\mathbb C}} that are polynomials in {z} and {\bar{z}}, thus it is a subalgebra of {C(S^1)} that contains the constant functions, separates points, and is closed under complex conjugation. By the Stone–Weierstrass theorem, this span is dense in {C(S^1)}, and since {P} is bounded the discussion from above gives an isomorphism {\Phi\colon C(S^1)^* \rightarrow V\subset \ell^\infty}, where we suppress {P} in the notation. This isomorphism is given by

\displaystyle  \Phi(L)_n = L(p_n) \ \ \ \ \ (5)

for a general {L\in C(S^1)^*}, and for {\mu\in \mathcal{M}} we write

\displaystyle  \Phi(\mu)_n = \int_{S^1} z^n \,d\mu(z). \ \ \ \ \ (6)

The sequence {\Phi(\mu)} is the sequence of Fourier coefficients associated to the measure {\mu}. The choice of {p_n} means that the action {f} induces on {V\subset \ell^\infty} takes a simple form: the Fourier coefficients of {L} and {f_*L} are related by

\displaystyle  \Phi(f_*L)_n = (f_*L)(z^n) = L((f(z))^n) = L\left(e^{2\pi i \theta n} z^n\right) = e^{2\pi i \theta n} \Phi(L)_n.

Thus if {\mu} is invariant, we have {\Phi(\mu)_n = e^{2\pi i \theta n} \Phi(\mu)_n} for all {n=0,1,2,\dots}. Because {\theta} is irrational, we have {e^{2\pi i\theta n} \neq 1} for all {n\neq 0}, and so {\Phi(\mu)_n=0}. Thus the only non-zero Fourier coefficient is {\Phi(\mu)_0 = \int {\mathbf{1}} \,d\mu(z) = 1}. Because {\Phi} is an isomorphism between {C(S^1)^*} and {V\subset \ell^\infty}, this shows that the only {L\in C(S^1)^*} with {f_*L=L} is Lebesgue measure. In particular, {f} is uniquely ergodic, with Lebesgue as the only invariant measure, and thus for any {z\in S^1}, the empirical measures {\mathcal{E}_n(z)} converge to Lebesgue.

3. Random rotations

Consider a sequence of points {z_1,z_2,z_3,\dots\in S^1}, and let {m_n\in \mathcal{M}} be the average of the point masses on the first {n} points of the sequence:

\displaystyle  m_n = \frac 1n \sum_{k=1}^n \delta_{z_k}, \qquad\qquad m_n(\varphi) = \frac 1n \sum_{k=1}^n \varphi(z_k). \ \ \ \ \ (7)

We say that the sequence {z_n} equidistributes if {m_n} converges to Lebesgue on {S^1} in the weak* topology.

The previous sections showed that if the points of the sequence are related by {z_{n+1} = e^{2\pi i \theta} z_n}, where {\theta} is irrational, then the sequence equidistributes. A natural generalisation is to ask what happens when the points {z_n} are related not by a fixed rotation, but by a randomly chosen rotation.

Here is one way of making this precise. Let {\Omega = \{1,2\}^{\mathbb N}} be the set of infinite sequences of 1s and 2s, and let {\mu} be the {\left(\frac 12, \frac 12\right)}-Bernoulli measure on {\Omega}, so that all sequences of length {n} are equally likely. Fix real numbers {\theta_1} and {\theta_2}, and fix {z_1\in S^1}. Given {\omega\in \Omega}, consider the sequence {z_n(\omega)} given by

\displaystyle  z_{n+1}(\omega) = e^{2\pi i \theta_{\omega_n}} z_n(\omega). \ \ \ \ \ (8)

Then one may ask whether or not {z_n(\omega)} equidistributes almost surely (that is, with probability 1 w.r.t. {\mu}). The remainder of this post will be dedicated to proving the following result.

Theorem 1 If either of {\theta_1} or {\theta_2} is irrational, then {z_n(\omega)} equidistributes almost surely.

Remark 2 The proof given here follows a paper by Lagarias and Soundararajan, to which I was referred by Lucia on MathOverflow.

Using Fourier coefficients as in the previous section, we have that {z_n(\omega)} equidistributes iff all the non-constant Fourier coefficients of {m_n(\omega)} converge to zero — that is, iff {\Phi(z_n(\omega))_k \rightarrow 0} as {n\rightarrow\infty} for all {k\neq 0}. This is Weyl’s criterion for equidistribution.

Fix a value of {k\neq 0}, which will be suppressed in the notation from now on. Write {a_n} for the absolute value of the {k}th Fourier coefficient of {m_n(\omega)}, and note that

\displaystyle  a_n := |\Phi(z_n(\omega))_k| = |m_n(\omega)(z^k)| = \frac 1n \left|\sum_{j=1}^n z_j(\omega)^k\right|. \ \ \ \ \ (9)

The outline of the proof is as follows.

  1. Show that there is a constant {C} such that the expected value of {a_n} is at most {C/n}.
  2. Given {\delta>0}, show that there is a constant {C'} such that the probability that {a_n} exceeds {\delta} is at most {C'/n}.
  3. Find an exponentially increasing sequence {n_j\rightarrow\infty} such that if {a_{n_j}\leq \delta}, then {a_n\leq 2\delta} for every {n\in [n_j,n_{j+1}]}.
  4. Use the Borel–Cantelli lemma to deduce that with probability 1, {a_{n_j}} exceeds {\delta} only finitely often, hence {a_n} exceeds {2\delta} only finitely often. Since {\delta>0} was arbitrary this shows that {a_n\rightarrow 0}.

Step 1

Given {\xi\in {\mathbb R}}, let {\|\xi\|} denote the distance between {\xi} and the nearest integer.

Lemma 2

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \left( 1 + \frac 1{\|k\theta_1\|^2 + \|k\theta_2\|^2}\right) \frac 1n \ \ \ \ \ (10)

Proof: Let {y_1\in {\mathbb R}} be such that {z_1 = e^{2\pi i y_1}}, and define {y_n} recursively by {y_{n+1} = y_n + \theta_{\omega_n}}, so that {z_n = e^{2\pi i y_n}}.

\displaystyle  \begin{aligned} a_n^2 &= \frac 1{n^2} \left\lvert \sum_{j=1}^n z_j^k \right\rvert^2 = \frac 1{n^2} \left(\sum_{\ell=1}^n z_\ell^k\right) \left(\sum_{j=1}^n \overline{z_j^k}\right) \\ &= \frac 1{n^2} \left( n + \sum_{\ell\neq j} z_\ell^k z_j^{-k} \right) = \frac 1{n^2} \left( n + \sum_{\ell\neq j} e^{2\pi i k(y_\ell - y_j)} \right). \end{aligned}

Using the fact that {z+\bar{z} = 2\Re(z)}, we have

\displaystyle  \sum_{\ell \neq j} e^{2\pi i k(y_\ell - y_j)} = 2\Re \sum_{1\leq \ell < j \leq n} e^{2\pi i k(y_\ell - y_j)}.

If {\ell - j = r}, then

\displaystyle  \begin{aligned} \mathop{\mathbb E}_\mu[e^{2\pi i k (y_\ell - y_j)}] &= \frac 1{2^r} \sum_{\omega_\ell,\dots, \omega_{j-1}} e^{2\pi i k \sum_{m=\ell}^{j-1} \theta_{\omega_{m}}} = \frac 1{2^r} \sum \prod_{m=\ell}^{j-1} e^{2\pi i k \theta_{\omega_m}} \\ &= \left(\frac{e^{2\pi i k\theta_1} + e^{2\pi i k\theta_2}}{2}\right)^r, \end{aligned}

where the sums are over all {\omega_m\in \{1,2\}} for {\ell\leq m<j}. Since there are {n-r} values of {\ell,j} with {\ell - j = r}, we have

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] = \frac 1{n^2}\left(n + 2\Re \sum_{r=1}^n (n-r)z^r\right), \ \ \ \ \ (11)

where {z=\frac 12(e^{2\pi i k\theta_1} + e^{2\pi i k\theta_2})}. Now

\displaystyle  \begin{aligned} \sum_{r=1}^n (n-r) z^r &= \sum_{r=1}^{n-1} z_r + \sum_{r=1}^{n-2} z^r + \cdots + \sum_{r=1}^1 z^r \\ &= \frac z{1-z}\sum_{s=1}^{n-1} (1-z^s) = \frac z{1-z} \left( n-\sum_{s=0}^{n-1} z^s \right). \end{aligned}

and since {|z|\leq 1} we have

\displaystyle  \left\lvert \sum_{r=1}^n (n-r) z^r \right\rvert \leq \frac{2n}{|1-z|}.

Together with (11), this gives

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \frac 1n \left( 1 + \frac 4{|1-z|}\right) = \frac 1n\left( 1 + \frac 8{|2 - e^{2\pi i k \theta_1} - e^{2\pi i k \theta_2}|} \right).

Using the fact that {|w|\geq |\Re w|} and that {\Re(1-e^{2\xi i}) \geq 1-\cos(2\xi) = 2\sin^2\xi}, we have

\displaystyle  \begin{aligned} |2 - e^{2\pi ik\theta_1} - e^{2\pi ik\theta_2}| &\geq 2(\sin^2 \pi k\theta_1 + \sin^2 \pi k\theta_2), \end{aligned}

and so

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \frac 1n\left(1 + \frac 4{\sin^2\pi k\theta_1 + \sin^2\pi k\theta_2}\right)

Finally, for {|\xi|\leq \frac 12} we have {|\sin \pi\xi| \geq |2\xi|}, which proves the bound in (10). \Box

Because one of {\theta_1,\theta_2} is irrational, the denominator in (10) is positive, and so {C := 1 + (\|k\theta_1\|^2 + \|k\theta_2\|^2)^{-1} < \infty}, which completes Step 1.

Step 2

Given {\delta>0}, we have

\displaystyle  \mathop{\mathbb E}[a_n^2] \geq \delta^2 \mathop{\mathbb P}(a_n\geq \delta),

and so by Lemma 2 we have

\displaystyle  \mathop{\mathbb P}(a_n\geq \delta) < \frac C{\delta^2 n}.

Putting {C' = C\delta^{-2}} completes step 2.

Step 3

Given {m\leq n\in {\mathbb N}}, we have

\displaystyle  na_n \leq ma_m + \left|\sum_{\ell=m}^{n-1} z_\ell^k\right| \leq ma_m + n-m,

and so

\displaystyle  a_n \leq \frac{n-m}n + \frac mn a_m.

In particular, if {n\leq (1+\delta)m} for some {\delta>0}, we have

\displaystyle  a_n \leq \delta + a_m. \ \ \ \ \ (12)

Let {n_j} be such that

\displaystyle  1+\frac \delta 2 \leq \frac{n_{j+1}}{n_j} \leq 1+\delta \ \ \ \ \ (13)

for all {j}. If {a_{n_j}\leq \delta}, then (12) implies that {a_n\leq 2\delta} for all {n\in [n_j, n_{j+1}]}.

Step 4

Let {E_j} be the event that {a_{n_j} \geq \delta}. By Part 2, we have {\mathop{\mathbb P}(E_j) \leq C'/n_j}, and because {n_j} increases exponentially in {j} by (13), we have {\sum_{j=1}^\infty \mathop{\mathbb P}(E_j) < \infty}. By the Borel–Cantelli lemma, this implies that with probability 1, there are only finitely many values of {j} for which {a_{n_j}\geq \delta}.

By the previous part, this in turn implies that there are only finitely many values of {n} for which {a_n\geq 2\delta}. In particular, {\varlimsup a_n \leq 2\delta}, and since {\delta>0} was arbitrary, we have {a_n\rightarrow 0}. Thus the sequence {z_n} satisfies Weyl’s criterion almost surely, which completes the proof of Theorem 1.

Posted in ergodic theory, examples, random dynamics, statistical laws | Tagged | Leave a comment

Fubini foiled

An important issue in hyperbolic dynamics is that of absolute continuity. Suppose some neighbourhood {U} of a smooth manifold {M} is foliated by a collection of smooth submanifolds {\{W_\alpha \mid \alpha\in A\}}, where {A} is some indexing set. (Here “smooth” may mean {C^1}, or {C^2}, or even more regularity depending on the context.)

Fixing a Riemannian metric on {M} gives a notion of volume {m} on the manifold {M}, as well as a notion of “leaf volume” {m_\alpha} on each {W_\alpha}, which is just the volume form coming from the induced metric. Then one wants to understand the relationship between {m} and {m_\alpha}.

The simplest example of this comes when {U=(0,1)^2} is the (open) unit square and {\{W_\alpha\}} is the foliation into horizontal lines, so {I=(0,1)} and {W_\alpha = (0,1)\times \{\alpha\}}. Then Fubini’s theorem says that if {E} is any measurable set, then {m(E)} can be found by integrating the leaf volumes {m_\alpha(E)}. In particular, if {m(E)=0}, then {m_\alpha(E)=0} for almost every {\alpha}, and conversely, if {E} has full measure, then {m_\alpha(E)=1} for almost every {\alpha}.

For the sake of simplicity, let us continue to think of a foliation of the two-dimensional unit square by one-dimensional curves, and assume that these curves are graphs of smooth functions {\phi_\alpha\colon (0,1)\rightarrow (0,1)}. The story in other dimensions is similar.

Writing {\Phi(x,y) = \phi_y(x)} gives a map {\Phi\colon (0,1)^2 \rightarrow (0,1)^2}, and we see that our foliation is the image under {\Phi} of the foliation by horizontal lines. Now we must make a very important point about regularity of foliations. The leaves of the foliation are smooth, and so {\Phi} depends smoothly on {x}. However, no assumption has been made so far on the transverse direction — that is, dependence on {y}. In particular, we cannot assume that {\Phi} depends smoothly on {y}.

If {\Phi} depends smoothly on both {x} and {y}, then we have a smooth foliation, not just smooth leaves, and in this case basic results from calculus show that a version of Fubini’s theorem holds:

\displaystyle  m(E) = \int_A \int_{W_\alpha} \rho_\alpha(z) {\mathbf{1}}_E(z) \,dm_\alpha(z)\,d\alpha,

where the density functions {\rho_\alpha\in L^1(W_\alpha,m_\alpha)} can be determined in terms of the derivative of {\Phi}.

It turns out that when the foliation is by local stable and unstable manifolds of a hyperbolic map, smoothness of the foliation is too much to ask for, but it is still possible to find density functions {\rho_\alpha} such that Fubini’s theorem holds, and the relationship between {m}-null sets and {m_\alpha}-null sets is as expected.

However, there are foliations where the leaves are smooth but the foliation as a whole does not even have the absolute continuity properties described in the previous paragraph. This includes a number of dynamically important examples, namely intermediate foliations (weak stable and unstable, centre directions in partial hyperbolicity) for certain systems. These take some time to set up and study. There is an elementary non-dynamical example first described by Katok; a similar example was described by Milnor (Math. Intelligencer 19(2), 1997), and it is this construction which I want to outline here, since it is elegant, does not require much machinery to understand, and neatly illustrates the possibility that absolute continuity may fail.

The example consists of two constructions: a set {E} and a foliation of the square by a collection of smooth curves {W_\alpha}. These are built in such a way that {E} has full Lebesgue measure ({m(E)=1}) but intersects each curve {W_\alpha} in at most a single point, so that in particular {m_\alpha(E)=0} for all {\alpha}.

1. The set

Given {p\in (0,1)}, consider the piecewise linear map {f_p\colon (0,1)\rightarrow (0,1]} given by

\displaystyle  f_p(y) = \begin{cases} \frac yp & y\leq p \\ \frac {y-p}{1-p} &y >p \end{cases}

It is not hard to see that {f_p} preserves Lebesgue measure {dy} and that {(f_p,dx)} is measure-theoretically conjugate to {(\sigma,\mu_p)}, where {\sigma} is the shift map on {\Sigma=\{0,1\}^{\mathbb N}} and {\mu_p} is the {(p,1-p)}-Bernoulli measure. The conjugacy {\pi_p\colon (0,1) \rightarrow \Sigma} is given by coding trajectories according to whether the {n}th iterate falls into {I_0 = (0,p]} or {I_1=(p,1)}.

Given {y}, let {\beta_n(y)} be the number of times that {y,f_p(y),\dots,f_p^{n-1}(y)} lands in {I_1}. Then by the strong law of large numbers, Lebesgue almost every {y\in (0,1)} has {\frac 1n \beta_n(y) \rightarrow 1-p}. Let {E} be the set of all pairs {(p,y)} such that this is true; then {m(E) = 1} by Fubini’s theorem (applied to the foliation of the unit square into vertical lines). It only remains to construct a foliation that intersects {E} in a “strange” way.

2. The foliation

The curves {W_\alpha} will be chosen so that two points {(p,y)} and {(p',y')} lie on the same {W_\alpha} if and only if they are coded by the same sequence in {\Sigma} — that is, if and only if {\pi_p(y) = \pi_{p'}(y')}. Then because the limit of {\frac 1n\beta_n} must be constant on {W_\alpha} if it exists, we deduce that each {W_\alpha} intersects the set {E} at most once.

It remains to see that this condition defines smooth curves. Given {\alpha\in (0,1)}, let {a=\pi_{1/2}(\alpha)=a_1a_2\cdots} be the binary expansion of {\alpha}, so that {\alpha = \sum a_n 2^{-n}}. Now let {W_\alpha = \{(p,y) \mid \pi_p(y)=a\}}.

Fix {p\in (0,1)}, let {y_1=y = \phi_\alpha(p)} be such that {(p,y)\in W_\alpha}, and let {y_{n+1}=f_p(y_n)}, so that {y_n\in I_{a_n}} for all {n}. Thus

\displaystyle  y_{n+1} = \begin{cases} \frac {y_n}{p} & a_n=0 \\ \frac{y_n-p}{1-p} & a_n=1 \end{cases}

For convenience of notation, write {p(0)=p} and {p(1)=1-p}. Then the relationship between {y_{n+1}} and {y_n} can be written as

\displaystyle  y_{n+1} = \frac{y_n - a_n p(0)}{p(a_n)} \qquad\Rightarrow\qquad y_n = p(0)a_n + p(a_n)y_{n+1}.

This can be used to write {y=\phi_\alpha(p)} in terms of the sequence {a_n}. Indeed,

\displaystyle  \begin{aligned} \phi_\alpha(p) &=y = y_1 = p(0)a_1 + p(a_1)y_2 \\ &= p(0)a_1 + p(a_1)\big(p(0) a_2 + p(a_2)y_3\big) \\ &= p(0)\big(a_1 + p(a_1)a_2\big) + p(a_1)p(a_2)y_3 \\ &= p(0)\big(a_1 + p(a_1)a_2 + p(a_1)p(a_2)a_3\big) + p(a_1)p(a_2)p(a_3)y_4, \end{aligned}

and so on, so that writing {\psi_n(p) = p(a_1)p(a_2)\cdots p(a_{n-1})}, one has

\displaystyle  \phi_\alpha(p) = p(0) \sum_{n=1}^\infty \psi_n(p) a_n.

The summands are analytic functions of {p}, and the sum converges uniformly on each interval {[\epsilon,1-\epsilon]}, since for {p\in [\epsilon, 1-\epsilon]} we have {\psi_n(p) \leq (1-\epsilon)^n}. In fact, this uniform convergence extends to complex values of {p} in the disc of radius {\frac 12 - \epsilon} centred at {\frac 12}, and so by the Weierstrass uniform convergence theorem, the function {\phi_\alpha} is analytic in {p}.

As discussed above, {W_\alpha} is the graph of the analytic function {\phi_\alpha}, and each {W_\alpha} intersects {E} at most once, despite the fact that {E} has Lebesgue measure 1 in the unit square. This demonstrates the failure of absolute continuity.

Posted in ergodic theory, examples, smooth dynamics | Tagged | 8 Comments

Sharkovsky’s theorem

This post is based on a talk given by Keith Burns in the UH dynamical systems seminar yesterday, in which he presented a streamlined proof of Sharkovsky’s theorem due to him and Boris Hasselblatt.

1. Background and statement of the theorem

We recall the statement of Sharkovsky’s theorem. The setting is a continuous map {f\colon {\mathbb R}\rightarrow {\mathbb R}}, and one is interested in what dynamical properties {f} must necessarily have if we know that {f} has a periodic orbit of some given least period {p} — for example, does existence of such an orbit force the existence of periodic points with least period {q} for various values of {q}? Or does it force the map to have positive entropy?

The assumptions of continuity and one-dimensionality are vital: for discontinuous one-dimensional maps, or for continuous higher-dimensional maps, it is possible to have a point of least period {p} without having points of any other period or having positive entropy.

In one dimension, on the other hand, one can begin with the following simple observation: if {f} has a periodic point with least period 2, then there is {p} such that {f(p) > p=f^2(p)}, and so applying the intermediate value theorem to {f(x)-x}, the map must have a fixed point.

What if {f} has a periodic point {p} with least period 3? Suppose that {p<f(p)<f^2(p)} (the other configurations are similar), and let {I_1=[p,f(p)]} and {I_2=[f(p),f^2(p)]}. Once again using the intermediate value theorem, we see that {f(I_1) \supset I_2} and {f(I_2) \supset I_1 \cup I_2}. Thus we may consider the topological Markov chain {(X,\sigma)} given by the graph in Figure 1, and observe that it embeds into the dynamics of {f}, so that in particular, {f} has periodic points of every order.

Fig 1

Sharkovsky’s theorem generalises the above observations. Consider the following unorthodox total order on the integers:

\displaystyle  \begin{aligned} 3\prec 5&\prec 7\prec \cdots \\ 6\prec 10&\prec 14\prec \cdots \\ 12\prec 20&\prec 28\prec \cdots \\ &\vdots \\ \cdots \prec 16 &\prec 8 \prec 4 \prec 2 \prec 1 \end{aligned}

Then if {f} has a periodic point of least period {p}, it has a point of least period {q} for every {q\succ p}.

The paper by Burns and Hasselblatt includes a discussion of the history of the theorem and the various proofs that have been given. Here we outline the most streamlined proof available at the present time.

2. Sketch of the proof

Given two intervals {I,J\subset {\mathbb R}}, write {I\rightarrow J} if {f(I)\supset J}. As in the case of a period 3 orbit above, the key will be to construct a collection of intervals {I_j} (whose endpoints are certain points of the period {p} orbit) such that the covering relations {I_j\rightarrow I_k} yield a directed graph with periodic paths of all the desired periods.

To deduce existence of various periodic orbits for {f} from these covering relations, one makes the following observations.

  1. If {f^p(I)\supset I}, then {I} contains a fixed point of {f^p}.
  2. If {I_0 \rightarrow I_1 \rightarrow \cdots \rightarrow I_k}, then {f^k(I_0)\supset I_0}.
  3. The above two observations guarantee existence of {x\in I_0} with {f^k(x)=x}. To guarantee that {x} can be chosen with least period {k}, it suffices to know that some {I_j} to have disjoint interior from all the others, and that the endpoints of this {I_j} do not follow the itinerary given by the sequence of covering relations above.

Detailed proofs of the above observations are in the Burns-Hasselblatt paper. With these tools in hand, the proof comes down to constructing the directed graph of covering relations referred to above.

Fig 2

To this end, consider a periodic orbit {P} with {p} points — the green points in Figure 2. These are permuted by {f}; some are moved to the left, some are moved to the right. Of all the points in the orbit that are moved to the right, let {z} be the rightmost, so that all points in the orbit to the right of {z} are moved to the left. Let {z'} be the next point to the right of {z}, and let {\Delta=[z,z']} as in the figure.

Now consider the intervals {I_k = f^k(\Delta)}. Note that {I_0} has the following property: every point in {P} lying to the left of {\Delta} moves to the right of {\Delta}, and vice versa. Let {I_m} be the first interval that does not have this property, and let {x\in I_m\cap P} be a point that does not switch sides of {\Delta} under the action of {f}. For concreteness we consider the case where {x} is to the left of {\Delta}, as in Figure 2.

By construction, all the points between {x} and {\Delta} are mapped to the right of {\Delta}. Of these, let {y} be the point that is mapped furthest to the right.

Write {R_y} for the interval {[y,\infty)}, and consider the intervals {J_k = I_k \cap R_y} for {0\leq k < m}, together with {J_m = [x,y]}. We claim that {J_0\rightarrow J_0}, {J_m\rightarrow J_0}, and {J_k \rightarrow J_{k+1}} for each {0\leq k<m}, so that we have the covering relations shown in Figure 3, which can be used to establish existence of periodic points with all least periods greater than {m}.

Fig 3

So why do the intervals {J_k} have the covering relations shown?

  1. The fact that {J_0\rightarrow J_0} follows from the fact that {J_0=I_0=\Delta} and the endpoints of {\Delta} switch sides of {\Delta}.
  2. The definition of {I_k} gives {I_k\rightarrow I_{k+1}}.
  3. Writing {I_k^R} for the part of {J_k} lying in or to the right of {\Delta} and {I_k^L} for the part lying in or to the left, we see that {I_k^R\rightarrow I_{k+1}^L} and {I_k^L\rightarrow I_{k+1}^R}; this is because for {0\leq k < m}, all points of {P\cap I_k} switch sides of {\Delta} under the action of {f}. (This is the definition of {m}.)
  4. Using similar notation for {J_k^L} and {J_k^R}, we see that {I_k^R = J_k^R} and {J_k^L \rightarrow I_k^R}; this is because {y} was chosen to be the point that maps furthest to the right.
  5. This establishes that {J_k^L \rightarrow J_k^R}, and similarly {J_k^R \rightarrow J_k^L}.

For {k=m} we see that {f(x)} lies to the left of {\Delta}, and {f(y)} to the right, so that {f(J_m)\supset \Delta=I_0}.

The extra piece of information in the final item of the list above shows that the graph in Figure 3 can be replaced by the one in Figure 4.

Fig 4

Let us sum up what we know. One of the following two cases holds. Case 1: There is some point {x} of the orbit such that {x} and {f(x)} both lie on the same side of {\Delta}. Case 2: Every point of the orbit is mapped to the other side of {\Delta} by {f}.

In Case 1, the arguments above show that for every {q>p} and every even {q<p}, there is an orbit of least period {q}, which is at least as strong as the conclusion in Sharkovsky’s theorem.

In Case 2, one may observe that since {f} permutes the elements of {P} and every element is moved from one side of {\Delta} to the other, {p} must be even. Moreover, by considering {P^L} and {P^R} separately, as invariant sets for {f^2}, we can repeat the argument inductively to complete the proof of the theorem.

Posted in theorems | Tagged | Leave a comment

Central Limit Theorem for dynamical systems using martingales

This post is based on notes from Matt Nicol’s talk at the UH summer school in dynamical systems. The goal is to present the ideas behind a proof of the central limit theorem for dynamical systems using martingale approximations.

1. Conditional expectation

Before we can define and use martingales, we must recall the definition of conditional expectation. Let {(\Omega,\mathop{\mathbb P})} be a probability space, with {\mathop{\mathbb P}} defined on a {\sigma}-algebra {\mathop{\mathcal B}}. Let {\mathop{\mathcal F}\subset \mathop{\mathcal B}} be a sub-{\sigma}-algebra of {\mathop{\mathcal B}}.

Example 1 Consider the doubling map {T\colon [0,1]\rightarrow [0,1]} given by {T(x) = 2x\pmod 1}. Let {\mathop{\mathbb P}} be Lebesgue measure, {\mathop{\mathcal B}} the Borel {\sigma}-algebra, and {\mathop{\mathcal F} = T^{-1}\mathop{\mathcal B} = \{T^{-1}(B) \mid B\in \mathop{\mathcal B}\}}. Then {\mathop{\mathcal F}} is a sub-{\sigma}-algebra of {\mathop{\mathcal B}}, consisting of precisely those sets in {\mathop{\mathcal B}} which are unions of preimage sets {T^{-1}(x)} — that is, those sets {F\in\mathop{\mathcal B}} for which a point {y\in [0,\frac 12]} is in {F} if and only if {y+\frac 12\in F}.

This example extends naturally to yield a decreasing sequence of {\sigma}-algebras

\displaystyle  \mathop{\mathcal B} \supset T^{-1}\mathop{\mathcal B} \supset T^{-2}\mathop{\mathcal B} \supset \cdots.

Given a sub-{\sigma}-algebra {\mathop{\mathcal F}\subset \mathop{\mathcal B}} and a random variable {Y\colon \Omega\rightarrow {\mathbb R}} that is measurable with respect to {\mathop{\mathcal B}}, the conditional expectation of {Y} given {\mathop{\mathcal F}} is any random variable {Z} such that

  1. {Z} if {\mathop{\mathcal F}}-measurable (that is, {Z^{-1}(I)\in\mathop{\mathcal F}} for every interval {I\subset {\mathbb R}}), and
  2. {\int_A Z\,d\mathop{\mathbb P} = \int_A Y\,d\mathop{\mathbb P}} for every {A\in \mathop{\mathcal F}}.

It is not hard to show that these conditions characterise {Z} for an almost-sure choice of {\omega\in\Omega}, and so the conditional expectation is uniquely defined as a random variable. We write it as {\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]}.

A key property is that conditional expectation is linear: for every {\mathop{\mathcal F}\subset \mathop{\mathcal B}}, every {Y_1,Y_2\colon \Omega\rightarrow{\mathbb R}}, and every {a\in {\mathbb R}}, we have

\displaystyle  \mathop{\mathbb E}[aY_1 + Y_2\mid \mathop{\mathcal F}] = a\mathop{\mathbb E}[Y_1\mid\mathop{\mathcal F}] + \mathop{\mathbb E}[Y_2\mid\mathop{\mathcal F}].

Example 2 If {Y} is already {\mathop{\mathcal F}}-measurable, then {\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]=Y}.

Example 3 At the other extreme, if {Y} and {\mathop{\mathcal F}} are independent — that is, if {\mathop{\mathbb E}[Y{\mathbf{1}}_A]=\mathop{\mathbb E}[Y]\mathop{\mathbb P}[A]} for every {A\in\mathop{\mathcal F}} — then {\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]} is the constant function {\mathop{\mathbb E}[Y]{\mathbf{1}}}.

Example 4 Suppose {\{\Omega_i\mid i\in{\mathbb N}\}} is a countable partition of {\Omega} such that {\mathop{\mathbb P}[\Omega_i]>0} for every {i}. Let {\mathop{\mathcal F}=\sigma(\Omega_1,\Omega_2,\dots)} be the {\sigma}-algebra generated by the sets {\Omega_i}. Then

\displaystyle  \mathop{\mathbb E}[Y|\mathop{\mathcal F}] = \sum_{i\in{\mathbb N}} \frac{\mathop{\mathbb E}[Y{\mathbf{1}}_{\Omega_i}]}{\mathop{\mathbb P}[\Omega_i]} {\mathbf{1}}_{\Omega_i}.

2. Martingales

Now we can define martingales, which are a particular sort of stochastic process (sequence of random variables) with “enough independence” to generalise results from the IID case.

Definition 1 A sequence {Y_n\colon \Omega\rightarrow {\mathbb R}} of random variables is a martingale if

  1. {\mathop{\mathbb E}[|Y_n|]<\infty} for all {n};
  2. there is an increasing sequence of {\sigma}-algebras (a filtration) {\mathop{\mathcal F}_1\subset\mathop{\mathcal F}_2\subset\cdots \subset \mathop{\mathcal B}} such that {Y_n} is measurable with respect to {\mathop{\mathcal F}_n};
  3. the conditional expectations satisfy {\mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] = Y_n}.

The first condition guarantees that everything is in {L^1}. If {\mathop{\mathcal F}_n} is taken to be the {\sigma}-algebra of events that are determined by the first {n} outcomes of a sequence of experiments, then the second condition states that {Y_n} only depends on those first {n} outcomes, while the third condition requires that if the first {n} outcomes are known, then the expected value of {Y_{n+1} - Y_n} is 0.

Example 5 Let {Y_n} be a sequence of fair coin flips — IID random variables taking the values {\pm1} with equal probability. Let {S_n=Y_1+\cdots+Y_n}. As suggested in the previous paragraph, let {\mathop{\mathcal F}_n=\sigma(Y_1,\dots,Y_n)} be the smallest {\sigma}-algebra with respect to which {Y_1,\dots,Y_n} are all measurable. (The sets in {\mathop{\mathcal F}_n} are precisely those sets in {\mathop{\mathcal B}} which are determined by knowing the values of {Y_1,\dots,Y_n}.)

It is easy to see that {S_n} satisfies the first two properties of a martingale, and for the third, we use linearity of expectation and the definition of {\mathop{\mathcal F}_n} to get

\displaystyle  \mathop{\mathbb E}[S_{n+1}|\mathop{\mathcal F}_n] = \mathop{\mathbb E}[Y_1 + \cdots + Y_n|\mathop{\mathcal F}_n] + \mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] = S_n + 0 = S_n.

When {Y_n} is a sequence of random variables for which {S_n = Y_1+\cdots+Y_n} is a martingale, we say that the sequence {Y_n} is a martingale difference.

In the previous example the martingale property (the third condition) was a direct consequence of the fact that the random variables {Y_n = S_{n+1}-S_n} were IID. However, there are examples where the martingale differences are not IID.

Example 6 Polya’s urn is a stochastic process defined as follows. Consider an urn containing some number of red and blue balls. At each step, a single ball is drawn at random from the urn, and then returned to the urn, along with a new ball that matches the colour of the one drawn. Let {Y_n} be the fraction of the balls that are red after the {n}th iteration of this process.

Clearly the sequence of random variables {Y_n} is neither independent nor identically distributed. However, it is a martingale, as the following computation shows: suppose that at time {n} there are {p} red balls and {q} blue balls in the urn. (This knowledge represents knowing which element of {\mathop{\mathcal F}_n} we are in.) Then at time {n+1}, there will be {p+1} red balls with probability {\frac{p}{p+q}}, and {p} red balls with probability {\frac{q}{p+q}}. Either way, there will be {p+q+1} total balls, and so the expected fraction of red balls is

\displaystyle  \begin{aligned} \mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] &= \frac{p}{p+q}\cdot\frac{p+1}{p+q+1} + \frac{q}{p+q}\cdot \frac{p}{p+q+1} \\ &= \frac{p(p+q+1)}{(p+q)(p+q+1)} = \frac{p}{p+q} = Y_n. \end{aligned}

If we assume that the martingale differences are stationary (that is, identically distributed) and ergodic, then we have the following central limit theorem for martingales, from a 1974 paper of McLeish (we follow some notes by S. Sethuraman for the statement).

Theorem 2 Let {Y_i} be a stationary ergodic sequence such that {\sigma^2 = \mathop{\mathbb E}[Y_i^2]<\infty} and {\mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n]=0}, where {\mathop{\mathcal F}_n} is the {\sigma}-algebra generated by {\{Y_i\mid 1\leq i\leq n\}}. Then {S_n = \sum_{i=1}^n Y_i} is a martingale, and {\frac{S_n}{\sigma\sqrt{n}}} converges in distribution to {N(0,1)}.

More sophisticated versions of this result are available, but this simple version will suffice for our needs.

3. Koopman operator and transfer operator

Now we want to apply 2 to a dynamical system {T\colon X\rightarrow X} with an ergodic measure {\mu} by taking {Y_i=\varphi\circ T^i} for some observable {\varphi\colon X\rightarrow{\mathbb R}}.

To carry this out, we consider two operators on {L^2(\mu)}. First we consider the Koopman operator {U\colon \varphi\mapsto \varphi\circ T}. Then we define the transfer operator {\mathop{\mathcal P}} to be its {L^2} adjoint — that is,

\displaystyle  \int(\mathop{\mathcal P}\varphi)\psi\,d\mu = \int\varphi(\psi\circ T)\,d\mu \ \ \ \ \ (1)

for all {\varphi,\psi\in L^2}. The key result for our purposes is that the operators {\mathop{\mathcal P}} and {U} are one-sided inverses of each other.

Proposition 3 Given {\varphi\in L^2}, we have

  1. {\mathop{\mathcal P} U\varphi=\varphi};
  2. {U\mathop{\mathcal P}\varphi=\mathop{\mathbb E}[\varphi|T^{-1}\mathop{\mathcal B}]}, where {\mathop{\mathcal B}} is the {\sigma}-algebra on which {\mu} is defined.

Proof: For the first claim, we see that for all {\psi\in L^2} we have

\displaystyle  \int\psi\cdot (\mathop{\mathcal P} U\varphi)\,d\mu = \int (\psi\circ T)(\varphi\circ T)\,d\mu = \int\psi\varphi\,d\mu,

where the first equality uses the definition of {\mathop{\mathcal P}} and the second uses the fact that {\mu} is invariant. To prove the second claim, we first observe that given an interval {I\subset{\mathbb R}}, we have

\displaystyle  (\mathop{\mathcal P} U\varphi)^{-1}(I) = T^{-1}((\mathop{\mathcal P} \varphi)^{-1}(I)) \in T^{-1}\mathop{\mathcal B},

since {\mathop{\mathcal P}} maps {\mathop{\mathcal B}}-measurable functions to {\mathop{\mathcal B}}-measurable functions. This shows that {\mathop{\mathcal P} U\varphi} is {T^{-1}\mathop{\mathcal B}}-measurable, and it remains to show that

\displaystyle  \int_{T^{-1}B} U\mathop{\mathcal P}\varphi\,d\mu = \int_{T^{-1}B} \varphi\,d\mu \text{ for all }B\in \mathop{\mathcal B}. \ \ \ \ \ (2)

This follows from a similar computation to the one above: given {B\in \mathop{\mathcal B}} we have

\displaystyle  \begin{aligned} \int_{T^{-1}B} U\mathop{\mathcal P}\varphi\,d\mu &= \int ((\mathop{\mathcal P}\varphi)\circ T) \cdot {\mathbf{1}}_{T^{-1}B}\,d\mu = \int((\mathop{\mathcal P}\varphi)\circ T) ({\mathbf{1}}_B \circ T)\,d\mu \\ &= \int(\mathop{\mathcal P}\varphi) {\mathbf{1}}_B\,d\mu = \int\varphi\cdot ({\mathbf{1}}_B\circ T)\,d\mu = \int_{T^{-1}B} \varphi\,d\mu, \end{aligned}

which establishes (2) and completes the proof. \Box

We see from Proposition 3 that a function has zero conditional expectation with respect to {T^{-1}\mathop{\mathcal B}} if and only if it is in the kernel of {\mathop{\mathcal P}}. In particular, if {\mathop{\mathcal P} h = 0} then {\sum_{j=1}^n h\circ T^j} is a martingale; this will be a key tool in the next section.

Example 7 Let {X=[0,1]} and {T} be the doubling map. Let {\mu} be Lebesgue measure. For convenience of notation we consider the {L^2} space of complex-valued functions on {X}; the functions {\varphi_n\colon x\mapsto e^{2\pi inx}} form an orthonormal basis for this space. A simple calculation shows that

\displaystyle  U\varphi_n(x) = \varphi_n(2x) = e^{2\pi in(2x)} = \varphi_{2n}(x),

so {U\colon \varphi_n\rightarrow \varphi_{2n}}. For the transfer operator we obtain {\mathop{\mathcal P}\colon \varphi_{2n}\rightarrow \varphi_n}, while for odd values of {n} we have

\displaystyle  \begin{aligned} \mathop{\mathcal P}\varphi_n(x) &= \frac 12 \left(\varphi_n\left(\frac x2\right) + \varphi_n\left(\frac{1+x}{2}\right)\right) \\ &= \frac 12 \left( e^{\pi i n x} + e^{\pi i n x} e^{\pi i n}\right) = 0. \end{aligned}

4. Martingale approximation and CLT

The machinery of the Koopman and transfer operators from the previous section can be used to apply the martingale central limit theorem to observations of dynamical systems via the technique of martingale approximation, which was introduced by M. Gordin in 1969.

The idea is that if {\mathop{\mathcal P}^n\varphi\rightarrow 0} quickly enough for functions {\varphi} with {\int\varphi\,d\mu=0}, then we can approximate the sequence {\varphi\circ T^n} with a martingale sequence {h\circ T^n}.

More precisely, suppose that the sequence {\|\mathop{\mathcal P}^n\varphi\|_2} is summable; then we can define an {L^2} function {g} by

\displaystyle  g = \sum_{n=1}^\infty \mathop{\mathcal P}^n\varphi. \ \ \ \ \ (3)

Let {h = \varphi + g - g\circ T\in L^2}. We claim that {S_nh = \sum_{j=1}^n h\circ T^j} is a martingale. Indeed,

\displaystyle  \mathop{\mathcal P} h = \mathop{\mathcal P}\varphi + \mathop{\mathcal P} \sum_{n\geq 1} \mathop{\mathcal P}^n\varphi - \mathop{\mathcal P} U \sum_{n\geq 1} \mathop{\mathcal P}^n\varphi,

and since {\mathop{\mathcal P} U} is the identity we see that the last term is just {\sum_n \mathop{\mathcal P}^n \varphi}, so that {\mathop{\mathcal P} h = 0}.

Proposition 3 now implies that {\mathop{\mathbb E}[h|T^{-1}\mathop{\mathcal B}]=0}, and we conclude that {S_nh} is a martingale, so by the martingale CLT {\frac{S_n h}{\sigma\sqrt{n}}} converges in distribution to {N(0,1)}, where {\sigma^2 = \mathop{\mathbb E}[h^2] = \int h^2\,d\mu}.

Now we want to apply this result to obtain information about {\varphi} itself, and in particular about {S_n\varphi = \sum_{j=1}^n \varphi\circ T^j}. We have {\varphi = h + g\circ T - g}, and so

\displaystyle  S_n\varphi = S_n h + \sum_{j=1}^n (g\circ T^{j+1} - g\circ T^j) = S_nh + g\circ T^{n+1} - g.

This yields

\displaystyle  \frac{S_n\varphi}{\sigma\sqrt n} = \frac{S_n h}{\sigma\sqrt n} + \frac{g\circ T^{n+1} - g}{\sigma\sqrt n},

and the last term goes to 0 in probability, which yields the central limit theorem for {S_n\varphi}.

Remark 1 There is a technical problem we have glossed over, which is that the sequence of {\sigma}-algebras {T^{-n}\mathop{\mathcal B}} is decreasing, not increasing as is required by the definition of a martingale. One solution to this is to pass to the natural extension {\hat T} and to consider the functions {h\circ \hat{T}^{-j}} and the {\sigma}-algebras {\hat{\mathop{\mathcal F}}_j = \hat{T}^j\mathop{\mathcal B}}. Another solution is to use reverse martingales, but we do not discuss this here.

Example 8 Let {X=[0,1]} and {T\colon X\rightarrow X} be an intermittent type (Manneville–Pomeau) map given by

\displaystyle  T(x) = \begin{cases} x(1 - (2x)^\alpha) & x\in [0,\frac 12), \\ 2x-1 & x\in [\frac 12,1], \end{cases}

where {0<\alpha<1} is a fixed parameter. It can be shown that {T} has a unique absolutely continuous invariant probability measure {\mu}, and that the transfer operator {\mathop{\mathcal P}} has the following contraction property: for every {\varphi\in L^2} with {\int \varphi\,d\mu=0}, there is {C\in{\mathbb R}} such that {\|\mathop{\mathcal P}^n\varphi\|_2 \leq Cn^{-\gamma}}, where {\gamma = \frac 12(\frac 1\alpha -1)}.

For small values of {\alpha}, this shows that {\|\mathop{\mathcal P}^n\varphi\|_2} is summable, and consequently {\mu} satisfies the CLT by the above discussion.

Posted in ergodic theory, statistical laws, theorems | Tagged , | 4 Comments

The Perron-Frobenius theorem and the Hilbert metric

In the last post, we introduced basic properties of convex cones and the Hilbert metric. In this post, we look at how these tools can be used to obtain an explicit estimate on the rate of convergence in the Perron–Frobenius theorem.

1. Perron–Frobenius theorem

We start by stating a version of the Perron–Frobenius theorem. Let {A} be a {d\times d} stochastic matrix, where here we use this to mean that the entries of {A} are non-negative, and every column sums to 1: {A_{ij}\in[0,1]} for all {i,j}, and {\sum_{i=1}^d A_{ij} = 1} for all {j}. Thus the columns of {A} are probability vectors.

Such a matrix {A} describes a weighted random walk on {d} sites: if the walker is presently at site {j}, then {A_{ij}} gives the probability that he will move to site {i} at the next step. Thus if we interpret a probability vector {v} as giving the probability of the walker being at site {j} with probability {v_j}, then {v\mapsto Av} gives the evolution of this probability under one step of the random walk.

Now one version of the Perron–Frobenius theorem is as follows: If {A} is a stochastic matrix with {A>0} (that is, {A_{ij}>0} for all {i,j}), then there is exactly one probability vector {\pi} that is an eigenvector for {A}. Moreover, the eigenvalue associated to this eigenvector is 1, the eigenvalue 1 is simple, and all other eigenvalues have modulus {<1}. In particular, given any {v\in[0,\infty)^2} we have {A^n v \rightarrow \pi} exponentially quickly.

The eigenvector {\pi} is the stationary distribution for the random walk (Markov chain) given by {A}, and the convergence result states that any initial distribution converges to the stationary distribution under iteration of the process.

The assumption that {A>0} is quite strong: for the random walk, this says that the walker can get from any site to any other site in a single step. A more general condition is that {A} is primitive: that is, there exists {N\in{\mathbb N}} such that {A^N>0}. This says that there is a time {N} such that by taking {N} steps, the walker can get from any site to any other site. The same result as above holds in this case too.

In fact, the result holds in the even more general case when {A} is irreducible: for every {i,j} there exists {N} such that {(A^N)_{ij}>0}. This says that the walker can get from every site to every other site, but removes the assumption that there is a single time {N} that works for all site. For example, consider a random walk on a chessboard, where the walker is allowed to move one square horizontally or vertically at each step. Then for a sufficiently large even value of {N}, the walker can get from any white square to any other white square, but to get to a black square requires an odd value of {N}.

2. Rate of decay using convex cones

As stated above, the Perron–Frobenius theorem does not give any result on the rate with which {A^n v} converges to {\pi}. One way to give an estimate on this rate is to use convex cones and the Hilbert metric, which were discussed in the last post.

2.1. A cone and a metric

Let {{\mathop{\mathcal C}}} be the convex cone {[0,\infty)^d\subset{\mathbb R}^d}. We want an estimate on the diameter of {A({\mathop{\mathcal C}})} in the Hilbert metric {d_{\mathop{\mathcal C}}}. Recall that this metric is given by {d_{\mathop{\mathcal C}}(v,w)=\log(\beta/\alpha)}, where

\displaystyle  \begin{aligned} \beta &= \inf \{\mu>0 \mid \mu v - w \in {\mathop{\mathcal C}}\},\\ \alpha &= \sup\{\lambda>0\mid w-\lambda v\in {\mathop{\mathcal C}}\}. \end{aligned}

Another way of interpreting the cone {{\mathop{\mathcal C}}} is in terms of the partial order it places on {V}, which is given by {v\preceq w \Leftrightarrow w-v\in {\mathop{\mathcal C}}\cup\{0\}}. We see that {\beta} and {\alpha} can be characterised as

\displaystyle  \alpha = \sup\{\lambda \mid \lambda w \preceq v\},\qquad \beta = \inf\{\mu \mid v\preceq \mu w\}.

In our present example, we see that the cone {{\mathop{\mathcal C}}=[0,\infty)^d} induces the partial order {v\preceq w \Leftrightarrow v_i\leq w_i\ \forall i}. Thus

\displaystyle  \alpha = \sup \{\lambda \mid \lambda w_i\leq v_i\ \forall i\} = \min_{1\leq i\leq d} \frac{v_i}{w_i}, \ \ \ \ \ (1)

and similarly {\beta = \max_{1\leq i\leq d} \frac{v_i}{w_i}}.

2.2. Diameter of {A({\mathop{\mathcal C}})}

Now we need to determine the diameter {\Delta} of {A({\mathop{\mathcal C}})} in the Hilbert metric {d_{\mathop{\mathcal C}}}. If {\Delta<\infty}, then the theorem of Birkhoff from the previous post will imply that {d_{\mathop{\mathcal C}}} contracts distances by a factor of {\tanh(\Delta/4)<1}.

Let {e_i} be the standard basis vectors in {{\mathbb R}^d}. Because {d_{\mathop{\mathcal C}}} is projective we can compute {\Delta} by considering {d_{\mathop{\mathcal C}}(Av,Aw)} where {\sum v_i = \sum w_j = 1}. Using the triangle inequality, we have

\displaystyle  \begin{aligned} d_{\mathop{\mathcal C}}(Av,Aw) &= d_{\mathop{\mathcal C}}\left(A\sum v_i e_i, A\sum w_j e_j\right) = d_{\mathop{\mathcal C}}\left(\sum v_i (Ae_i), \sum w_j (Ae_j) \right) \\ &\leq \sum_{i,j} v_i w_j d_{\mathop{\mathcal C}}(Ae_i,Ae_j) \leq \max_{i,j} d_{\mathop{\mathcal C}}(Ae_i,Ae_j), \end{aligned}

so it suffices to consider {d_{\mathop{\mathcal C}}(Ae_i,Ae_j)} for {1\leq i,j\leq d}. But {Ae_i} is just the {i}th column of the matrix {A}, so writing {A=[v^1 \cdots v^n]}, where {v^i} is the {i}th column vector, we see that

\displaystyle  \Delta \leq \max_{i,j} d_{\mathop{\mathcal C}}(v^i,v^j). \ \ \ \ \ (2)

2.3. Contraction under multiplication by {A}

Now we have a very concrete procedure for estimating the amount of contraction in the {d_{\mathop{\mathcal C}}} metric under multiplication by {A}:

  1. estimate {\Delta} using (2) and the expression for {d_{\mathop{\mathcal C}}} in (1) and the discussion preceding it;
  2. get a contraction rate of {\tanh(\Delta/4)<1}.

From (1) and the discussion preceding it, the distance {d_{\mathop{\mathcal C}}(v^i,v^j)} is given as

\displaystyle  d_{\mathop{\mathcal C}}(v^i,v^j) = \log\beta - \log\alpha = \log\left(\max_{1\leq k\leq d} \frac{v^i_k}{v^j_k}\cdot \max_{1\leq k\leq d} \frac{v^j_k}{v^i_k}\right). \ \ \ \ \ (3)

Let {\Lambda = \tanh(\Delta/4)}. To write an explicit estimate for {\Lambda}, we use

\displaystyle  \Lambda = \frac{e^{\Delta/4} - e^{-\Delta/4}}{e^{\Delta/4} + e^{-\Delta/4}} = \frac{1 - e^{-\Delta/2}}{1+e^{-\Delta/2}} \leq \frac{1-s}{1+s}, \ \ \ \ \ (4)

where {s<1} is any estimate we can obtain satisfying {e^{-\Delta/2}\geq s}. From (3) and (2), we have

\displaystyle  e^{-\Delta/2} \geq \max_{i,j}\sqrt{ \min_k\left(\frac {v_k^i}{v_k^j}\right) \min_k \left(\frac{v_k^j}{v_k^i}\right)} =:s. \ \ \ \ \ (5)

This allows us to obtain estimates on {d_{\mathop{\mathcal C}}(A^nv, A^nw)}. However, we want to estimate {d(A^nv,A^nw)} in a more familiar metric, such as one coming from a norm. We can relate the two by observing that if {v,w\in(0,1]^d}, then

\displaystyle  \begin{aligned} d_{\mathop{\mathcal C}}(v,w) &= \log \max_k \left(\frac {v_k}{w_k}\right) + \log \max_k \left(\frac{w_k}{v_k}\right) \\ &\geq \max_k|\log v_k - \log w_k| \geq \max_k |v_k-w_k| = \|v-w\|_{L^\infty}, \end{aligned}

where the last inequality uses the fact that {\log } has derivative {\geq 1} on {(0,1]}. Since {A} maps the unit simplex to itself (because {A} is stochastic), we see that

\displaystyle  \|A^nv-A^nw\|_{L^\infty} \leq d_{\mathop{\mathcal C}}(A^nv,A^nw) \leq C\Lambda^n, \ \ \ \ \ (6)

where {\Lambda} is given by (4) and (5), and where we can take either {C=d_{\mathop{\mathcal C}}(v,w)} or {C=\Delta/\Lambda} (since {d_{\mathop{\mathcal C}}(Av,Aw)\leq \Delta}), whichever gives the better bound. Since all norms on {{\mathbb R}^d} are equivalent, we have a similar bound in any norm.

3. Nonnegative matrices

The analysis in the previous section required {A} to be positive ({A_{ij}>0} for all {i,j}). A more general condition is that {A} is nonnegative and primitive: that is, {A_{ij}\geq 0} for all {i,j}, and moreover there exists {N} such that {A^N>0}.

If {A_{ij}=0} for some {i,j}, then it is easy to see from the calculations in the previous section that {A({\mathop{\mathcal C}})} has infinite diameter in the Hilbert metric, so the above arguments do not apply directly. However, they do apply to {A^N} when {A^N>0}, and so we fix {N} for which this is true, and we obtain {\Lambda<1} such that {d_{\mathop{\mathcal C}}(A^Nv, A^Nw) \leq \Lambda d_{\mathop{\mathcal C}}(v,w)} for all {v,w\in{\mathop{\mathcal C}}}.

Moreover, let {L\in{\mathbb R}} be such that {\|A^r\|\leq L} for all {0\leq r<N}. Then for any {n\in{\mathbb N}} we can write {A^n = A^{kN+r}} for some {0\leq r<N}, so that

\displaystyle  \|A^nv-A^nw\| = \|A^r(A^{kN}v - A^{kN}w)\| \leq LC \Lambda^k,

where {C} is as in (6). Thus we conclude that asymptotically, {A^nv} approaches the eigenvector with contraction rate {\Lambda^{1/N}}.

To see this in action, consider a Markov chain with transition matrix

\displaystyle  A=\begin{pmatrix} \frac 12 & 1 \\ \frac 12 & 0\end{pmatrix}.

That is, from the first state the walker transitions to either state with probability 1/2, while from the second state the walker always returns to the first state. Since the transition from the second state to itself is forbidden, {A({\mathop{\mathcal C}})} has infinite diameter. However, the two-step transition matrix is

\displaystyle  A^2 = \begin{pmatrix} \frac 34 & \frac 12 \\ \frac 14 & \frac 12 \end{pmatrix},

for which we can compute

\displaystyle  s = \sqrt{\frac{1/4}{1/2} \cdot \frac{1/2}{3/4}} = \frac 1{\sqrt{6}} \quad \Rightarrow\quad \Lambda \leq \frac{\sqrt{6}-1}{\sqrt{6}+1}.

Thus the estimate on {A^2} gives us a definite rate of contraction, which the estimate from {A} does not.

It can be useful to use the estimate on {A^N} even when {A>0}. For example, if we consider the Markov chain with transition matrix

\displaystyle  A = \begin{pmatrix} \frac 15 & \frac 9{10} \\ \frac 45 & \frac 1{10} \end{pmatrix},

then we have

\displaystyle  s = \sqrt{\frac{1/5}{9/10}\cdot \frac{1/10}{4/5}} = \sqrt{\frac 29 \cdot \frac 18} = \frac 16 \quad\Rightarrow\quad \Lambda\leq \frac 57 \approx .714

as the rate of contraction, while considering

\displaystyle  A^2 = \begin{pmatrix} \frac{19}{25} & \frac{27}{100} \\ \frac{6}{25} & \frac{73}{100} \end{pmatrix}


\displaystyle  s = \sqrt{\frac{27/100}{19/25}\cdot\frac{6/25}{73/100}} \approx .3418 \quad\Rightarrow\quad \Lambda \leq \frac{.6582}{1.3418}\approx .4906 \approx (.7)^2,

a better estimate than we obtained from considering {A} itself.

Posted in theorems | Tagged , , , | Leave a comment

Convex cones and the Hilbert metric

Having spent some time discussing spectral methods and coupling techniques as tools for studying the statistical properties of dynamical systems, we turn now to a third approach, based on convex cones and the Hilbert metric. This post is based on Will Ott’s talk from March 25.

1. Basic definitions

Let {V} be a vector space over the reals. Ultimately we will be most interested in the case when {V} is a function space, such as {L^1} or {BV}, but for now we make the definitions in the general context.

Definition 1 A subset {C\subset V} is a convex cone (or positive cone) if

  1. {C\cap (-C) = \emptyset};
  2. {\lambda C = C} for each {\lambda>0};
  3. {C} is convex; and
  4. for all {f,g\in C} and {\alpha\in {\mathbb R}}, we have the following property: if {\alpha_n\rightarrow \alpha} and {g-\alpha_n f\in C} for every {n}, then {g-\alpha f\in C\cup \{0\}}.

The first three conditions are very geometric and in some sense guarantee that {C} “looks like a cone should look”. The last condition is more topological; if {V} is a topological vector space and {C\cup \{0\}} is a closed subset of {V}, then this condition holds, but we stress that the condition itself is actually weaker than this and is phrased without reference to any topology on {V}.

Example 1 Let {V=BV([0,1],{\mathbb R})} be the space of all real-valued functions on the unit interval with bounded variation, and let {C = \{ \varphi\in V \mid \varphi\geq 0, \varphi\not\equiv 0\}}. Then {C} is a convex cone.

We see immediately from this example that the notion of convex cone is relevant to the sorts of questions we want to ask about invariant measures of a dynamical system, because this set {C} is exactly the set of density functions that arises when we are searching for an absolutely continuous invariant measure.

This suggests that we will ultimately want to consider the action of some operator {L\colon C\rightarrow C}, and in particular may want to find a fixed point of this action (for a suitable operator {L}). One of the most powerful methods for finding a fixed point is to find a metric in which {L} acts as a contraction, and this is accomplished by the Hilbert metric, which we now introduce.

Definition 2 Fix a convex cone {C\subset V}. Given {\varphi,\psi\in C}, let

\displaystyle  \begin{aligned} \beta(\varphi,\psi) &= \inf \{\mu>0 \mid \mu\varphi - \psi\in C\},\\ \alpha(\varphi,\psi) &= \sup \{\lambda>0 \mid \psi - \lambda\varphi \in C\}, \end{aligned} \ \ \ \ \ (1)

with {\alpha=0} and/or {\beta=\infty} if the corresponding set is empty. The cone distance between {\varphi} and {\psi} is

\displaystyle  d_C(\varphi,\psi) = \log \left( \frac{\beta(\varphi,\psi)}{\alpha(\varphi,\psi)}\right). \ \ \ \ \ (2)

The distance {d_C} is also called the Hilbert (projective) metric.

Several remarks are now in order. First we observe that although {V} may be infinite-dimensional, the distance {d_C(\varphi,\psi)} is completely determined in terms of the two-dimensional subspace spanned by {\varphi} and {\psi}, and in particular by the points shown in Figure 1 — in the figure, the lines {0A} and {0B} are the boundary of this two-dimensional cross-section of {C}. The lines {0X} and {Y\psi} are parallel, as are the lines {0A} and {\psi X}; then we have

\displaystyle  \alpha = \frac{|\psi Y|}{|0\varphi|} \text{ and } \beta = \frac{|0X|}{|0\varphi|}.

Fig 1

An alternate description of {d_C} is available in terms of this more geometric description. Let {\ell} be the line through {\varphi} and {\psi}, and let {A,B} be the points where this line intersects the boundary of {C}. We see from Figure 1 that the triangles {BY\psi} and {B0\varphi} are similar, so

\displaystyle  \alpha = \frac{|\psi Y|}{|0\varphi|} = \frac{|B\psi|}{|B\varphi|}.

Furthermore, {\varphi 0A} and {\varphi X\psi} are similar, so

\displaystyle  \beta = \frac{|0X|}{|0\varphi|} = 1 + \frac{|\varphi X|}{|0\varphi|} = 1 + \frac{|\psi\varphi|}{|A\varphi|} = \frac{|A\psi|}{|A\varphi|}.

Thus {d_C} can be given in terms of the cross-ratio of the points {\varphi,\psi,A,B}:

\displaystyle  \frac \beta\alpha = \frac{|A\psi|}{|A\varphi|}\frac{|B\varphi|}{|B\psi|} = (\varphi,\psi;A,B).

We have

\displaystyle  d_C(\varphi,\psi) = \log(\varphi,\psi;A,B). \ \ \ \ \ (3)

Note that it is possible that the line {\ell} does not intersect the boundary of {C} twice; this corresponds to the case when either {\alpha=0} or {\beta=\infty} (or both) in (1), and in this case {d_C(\varphi,\psi)=\infty}.

This situation occurs, for example, when we take {V=BV([0,1],{\mathbb R})} and {C} as in the example above, and consider {\varphi,\psi\in C} with disjoint supports — that is, {\varphi(x)\psi(x)=0} for all {x}. In this case {\alpha=0} and {\beta=\infty} so the cone distance between {\varphi} and {\psi} is infinite.

Because of this phenomenon, {d_C} is not a true metric. Moreover, we observe that {d_C} is projective: {d_C(\varphi,\lambda\varphi)=0} for every {\lambda>0}.

An important property of the Hilbert metric is the following theorem, due to Birkhoff, which states that a linear map from one convex cone to another is a contraction whenever its image has finite diameter.

Theorem 3 Let {C_1\subset V_1} and {C_2\subset V_2} be convex cones, and let {L\colon V_1\rightarrow V_2} be a linear map such that {L(C_1)\subset C_2}. (This is a sort of `positivity’ condition.) Let

\displaystyle  \Delta = \sup_{\hat\varphi,\hat\psi\in L(C_1)} d_{C_2}(\hat\varphi,\hat\psi).

Then for all {\varphi,\psi\in C_1}, we have

\displaystyle  d_{C_2}(L\varphi,L\psi)\leq \tanh\left(\frac \Delta4\right) d_{C_1}(\varphi,\psi), \ \ \ \ \ (4)

where we use the convention that {\tanh\infty=1}.

We also want to relate {d_C} to a more familiar norm. Say that a norm {\|\cdot\|} on {V} is adapted if the following is true: whenever {\varphi,\psi\in V} are such that {\varphi-\psi\in C} and {\varphi+\psi\in C}, we have {\|\psi\|\leq\|\varphi\|}.

Example 2 On {BV}, the {L^1} norm is adapted, but the BV norm is not.

The following lemma, due to Liverani, Saussol, and Vaienti, relates the cone metric to an adapted norm.

Lemma 4 Let {\|\cdot\|} be an adapted norm on {V} and {C\subset V} a convex cone. Then for all {\varphi,\psi\in C} with {\|\varphi\|=\|\psi\|>0}, we have

\displaystyle  \|\varphi-\psi\| \leq \left(e^{d_C(\varphi,\psi)} - 1\right) \|\varphi\|. \ \ \ \ \ (5)

Convex cones and the Hilbert metric are well suited to studying nonequilibrium open systems. Consider the following setting. Let {X} be a Riemannian manifold, {\lambda} volume on {X}, and {\hat f_i\colon X\rightarrow X} a diffeomorphism. For {m\in {\mathbb N}}, let {\hat F_m = \hat f_m \circ \cdots \circ \hat f_1}. This is a nonequilibrium closed system. (Nonequilibrium because the map changes at each time step, closed because every point can be iterated arbitrarily many times.)

Now consider sets {H_j\subset X}, which we interpret as a “hole” at time {j}. The time-{m} survivor set is

\displaystyle  S_m = X\setminus \bigcup_{i=1}^m \hat F_i^{-1}(H_i),

the set of points that do not fall into a hole before time {m}. Let {F_m = \hat F_m|_{S_m}}. We refer to the pair {(F_m, H_m)} as a nonequilibrium open dynamical system.

We would like an analogue of decay of correlations for such systems. Let {\varphi_0,\psi_0} be two probability density functions on {X}, and evolve these under {(F_m, H_m)}. We expect that {\|\varphi_t\|_{L^1(\lambda)} < 1} because there is a positive probability of falling into a hole.

Let {\hat{\mathcal{P}}_j} be the Perron–Frobenius operator for the closed system {\hat f_j} (with respect to {\lambda}). Then to the open system {f_j} we can associate the operator

\displaystyle  \mathcal{P}_j(\varphi) = \hat{\mathcal{P}}_j(\varphi) {\mathbf{1}}_{X\setminus H_j}.

Definition 5 We say that {(F_m,H_m)} exhibits conditional memory loss in the statistical sense if for all suitably chosen {\varphi_0, \psi_0}, we have

\displaystyle  \lim_{t\rightarrow\infty} \left\| \frac{\varphi_t}{\|\varphi_t\|_{L^1(\lambda)}} - \frac{\psi_t}{\|\psi_t\|_{L^1(\lambda)}} \right\|_{L^1(\lambda)} = 0.

The idea of this definition is that before comparing the probabilities, we need to first condition on the event that the trajectory survives. Next time we will investigate this property for piecewise expanding interval maps using the Lasota–Yorke inequality, where the holes {H_j} are small and vary slowly.

Posted in functional analysis | Tagged , | 3 Comments

Spectral methods 3 – central limit theorem

With the previous post on convergence of random variables, the law of large numbers, and Birkhoff’s ergodic theorem as background, we return to the spectral methods discussed in the first two posts in this series. This post is based on Andrew Török’s talk from March 4 and gives a proof of the central limit theorem using the spectral gap property.

1. Central limit theorem for IID

Now we recall the statement of the central limit theorem (CLT) and give a proof in the case of IID (independent identically distributed) random variables.

The weak law of large numbers says that if {X_n} is a sequence of IID random variables with {\mathop{\mathbb E}[X_n] = 0}, then writing {S_n = \sum_{k=0}^{n-1}X_k}, the time averages {\frac 1n S_n} converge to {0} in probability, or equivalently (since the limit is a constant), in distribution. In the case when {\sigma^2 = \mathop{\mathbb E}[X^2]<\infty}, the central limit theorem strengthens this to the result that the sequence {\frac 1{\sqrt{n}} S_n} converges in distribution to {N(0,\sigma^2)}, the normal distribution with mean {0} and variance {\sigma^2}. That is, we have

\displaystyle  \mathop{\mathbb P}\left( \frac 1{\sqrt n} S_n \leq c\right) \xrightarrow{n\rightarrow\infty} \frac 1{\sigma\sqrt{2\pi}} \int_{-\infty}^c e^{-t^2/2\sigma^2}\,dt \ \ \ \ \ (1)

for every {c\in {\mathbb R}}.

This can be established by the same method as we used last time for the proof of the weak law of large numbers, by studying the characteristic functions of {\frac 1{\sqrt n}S_n} and {N(0,\sigma^2)}. The characteristic function of {N(0,\sigma^2)} is

\displaystyle  \psi(t) = e^{-\frac 12 \sigma^2 t^2}. \ \ \ \ \ (2)

Arguing as in the proof of the weak law of large numbers in the previous post, we write {\varphi_n} for the characteristic function of {\frac 1{\sqrt n} S_n} and observe that

\displaystyle  \varphi_n(t) = \mathop{\mathbb E}[e^{it \frac 1{\sqrt n} (X_1+\cdots + X_n)}] = \prod_{j=1}^n \mathop{\mathbb E}[e^{\frac{it}{\sqrt n} X_j}] = \varphi\left(\frac {t}{\sqrt n}\right)^n, \ \ \ \ \ (3)

where {\varphi} is the characteristic function of the {X_j} (which are identically distributed), and the second equality uses the fact that the {X_j} are independent.

Now by Taylor’s theorem, we have

\displaystyle  \begin{aligned} \varphi(t/\sqrt{n}) &= \mathop{\mathbb E}[e^{\frac {it}{\sqrt n}X_j}] = 1 + \frac{it}{\sqrt n} \mathop{\mathbb E}[X_j] - \frac{t^2}{2n} \mathop{\mathbb E}[X_j^2] + o(t^2) \\ &= 1 - \frac {t^2}{2n}\sigma^2 + o(t^2), \end{aligned}

using the fact that the {X_j} have mean 0 and variance {\sigma^2}. Thus we conclude from (3) that

\displaystyle  \varphi_n(t) = \left(1 - \frac{t^2\sigma^2}{2n} + o(t^2)\right)^n \xrightarrow{n\rightarrow\infty} e^{-\frac 12 t^2\sigma^2} = \psi(t),

which completes the proof of the CLT in the IID case.

2. CLT with spectral gap

To translate the CLT into the language of dynamical systems, we consider a space {X} and a map {T\colon X\rightarrow X} with an invariant measure {\mu}. In general there may be many {T}-invariant measures, and so it is important to choose a suitable measure {\mu}. For example, when {X} is an interval and {f} is piecewise expanding, we are most interested in the case when {\mu} is an acip.

Given a measurable function {f\colon X\rightarrow {\mathbb R}}, the sequence of functions {f}, {f\circ T}, {f\circ T^2, \dots} defines a sequence of identically distributed random variables on {X}. However, they are not independent, and so we need some information about the decay of correlations between them. In particular, we can replicate the proof from the previous section as long as the transfer operator has a spectral gap.

Let’s make this precise in the case when {T} is a piecewise expanding interval map, so the Lasota–Yorke inequality we discussed in an earlier post yields a spectral gap for the transfer operator {\mathcal{P}_T} acting on {BV}, the space of functions of bounded variation, and in particular establishes the existence of an acip {\mu}.

Theorem 1 Let {T} be a piecewise expanding interval map and {\mu} the acip constructed before. Suppose that {\mu} is mixing. Then {\mu} satisfies the central limit theorem as follows: given any {f\in BV} with {\int f\,d\mu = 0} and writing {S_nf(x) = \sum_{k=0}^{n-1} f\circ T^k}, we have

\displaystyle  \mu \left\{ x \mid \frac 1{\sqrt{n}} S_nf(x) \leq c \right\} \xrightarrow{n\rightarrow\infty} \frac 1{\sigma\sqrt{2\pi}} \int_{-\infty}^c e^{-x^2/2\sigma^2}\,dx \ \ \ \ \ (4)

for all {c\in {\mathbb R}}, where {\sigma} is given by the Green–Kubo formula

\displaystyle  \sigma^2 = \sum_{n\in{\mathbb Z}} \int_X f \cdot (f\circ T^n) \,d\mu, \ \ \ \ \ (5)

and {\sigma=0} if and only if there exists {g\in BV} and {c\in {\mathbb R}} such that {f=c+g\circ T - g}.

Before proving the theorem, we make some remarks concerning the Green–Kubo formula (5). First, note that the sum converges as soon as we establish exponential decay of correlations for functions in {BV}, since each integral in the sum is just the correlation function at time {n}. Second, note that if we replace the functions {f\circ T^n} with independent random variables, then all the terms with {n\neq 0} vanish, and the {n=0} term is just the variance {\mathop{\mathbb E}[X^2]}, as in the previous section.

Note also that using (5), {\sigma^2} can be written as

\displaystyle  \sigma^2 = \lim_{n\rightarrow\infty} \frac 1n \int (S_n f)^2\,d\mu.

Now we prove the central limit theorem (4). As in the IID case, we use the characteristic functions

\displaystyle  \begin{aligned} \psi(t) &= e^{-\sigma^2 t^2/2}, \\ \varphi_n(t) &= \mathop{\mathbb E}_\mu[e^{it(S_nf)/\sqrt{n}}] = \int e^{\frac{it}{\sqrt n} S_nf} \,d\mu, \end{aligned}

where {\psi(t)} is the characteristic function of the normal distribution and {\varphi_n} is the characteristic function of {\frac 1{\sqrt n}S_nf}, so it suffices to show that {\varphi_n(t)\rightarrow \psi(t)} for all {t}.

To prove this convergence of the characteristic functions, we use the following procedure.

  1. Write the characteristic functions {\varphi_n} in terms of a twisted transfer operator {\mathcal{P}_{f,t}}, where {f} is the function we are investigating in the CLT, and {t\approx 0} is a small real parameter. The operator {\mathcal{P}_{f,t}} is a small perturbation of the transfer operator {\mathcal{P}_T}.
  2. Use perturbation theory of operators to show that {\mathcal{P}_{f,t}} has a spectral gap and to derive asymptotics for the leading eigenvalue {\lambda(t)}. In particular, relate {\lambda'(0)} and {\lambda''(0)} to the mean and variance of the limiting distribution.

First we define the transfer operator itself by the implicit equation

\displaystyle  \int (\mathcal{P}_T g)\cdot h\,d\mu = \int g\cdot (h\circ T)\,d\mu \ \ \ \ \ (6)

for all {g\in L^1(\mu)} and {h\in L^\infty}. Note that this is different from the transfer operator defined by integrating with respect to Lebesgue measure in (6) — it is a worthwhile exercise to determine the precise relationship between the two.

More directly, the transfer operator can be defined by

\displaystyle  \mathcal{P}_T g(x) = \sum_{y\in T^{-1}(x)} \frac{g(y)h(y)}{|T'(y)|}, \ \ \ \ \ (7)

where {h} is the density of {\mu} with respect to Lebesgue measure.

Now given {f\in BV} and {t\in{\mathbb R}}, we define the twisted transfer operator by

\displaystyle  \mathcal{P}_{f,t}g = \mathcal{P}_T(e^{itf}g). \ \ \ \ \ (8)

To see the utility of this definition, we first note that

\displaystyle  \int\mathcal{P}_{f,t}(g)\,d\mu = \int \mathcal{P}_T(e^{itf}g){\mathbf{1}}\,d\mu =\int e^{itf} g\,d\mu,

and so by induction we have

\displaystyle  \int \mathcal{P}_{f,t}^n(g) \,d\mu = \int e^{itS_nf} g\,d\mu.

In particular, considering the characteristic function {\varphi_n}, we have

\displaystyle  \varphi_n(t) = \int e^{\frac {it}{\sqrt n} S_n f}\,d\mu = \int \mathcal{P}_{f,\frac{t}{\sqrt n}}^n ({\mathbf{1}})\,d\mu, \ \ \ \ \ (9)

which accomplishes the first stage of the proof — writing the characteristic function in terms of the twisted transfer operator.

For the second stage of the proof, we consider the twisted transfer operator as a perturbation of {\mathcal{P}_T}. From the Lasota–Yorke inequality and the fact that {\mu} is mixing, we know that the spectrum of {\mathcal{P}_T} has the form {\{1\}\cup Z}, where {Z} is contained in a disc of radius {r<1} centred at the origin.

By the perturbation theory of linear operators, the spectrum of {\mathcal{P}_{f,t}} has the same form for small enough {|t|}: there is a leading eigenvalue {\lambda(t)} that is close to {1}, and the rest of the spectrum is contained in the disc of radius {r}. Moreover, the leading eigenvalue satisfies (Edit: see the end of the post for a proof)

\displaystyle  \lambda'(0) = \int (if) \,d\mu = 0


\displaystyle  \lambda''(0) = \lim_{n\rightarrow\infty} \frac 1n \int (S_n(if))^2\,d\mu = -\sigma^2,

which is the origin of the expression in the Green–Kubo formula.

Now we use the Riesz functional calculus, whose general ideas we briefly recall here. Let {X} be a Banach space and {\mathcal{B}(X)} the space of bounded linear operators on {X}. Given {S\in \mathcal{B}(X)}, let {\sigma\subset{\mathbb C}} be the spectrum of {S}. Then there is a unique way to associate to each analytic function {g\colon \sigma\rightarrow{\mathbb C}} an operator {g(S)} such that the map {g\mapsto g(S)} is a homomorphism mapping the constant function to the identity operator and the identity function to {S}.

This mapping can be defined by integrating around a curve {\gamma} surrounding the spectrum {\sigma} (this is similar to the Cauchy formula from complex analysis):

\displaystyle  g(S) = \frac 1{2\pi i} \int_\gamma (S-zI)^{-1} g(z)\,d\gamma,

where we recall that {S-zI} is invertible for all {z} in the resolvent {{\mathbb C}\setminus\sigma}. If we take {g} to be the characteristic function of part of the spectrum, we obtain a projection to the eigenspace associated with that part.

In particular, considering the operator {\mathcal{P}_{f,t}}, we may set {g = {\mathbf{1}}_{\lambda(t)}} and obtain a projection {\Pi_t} onto the eigenspace of {\lambda(t)}. Similarly, setting {g(z)=z{\mathbf{1}}_{Z(t)}(z)}, where {Z(t)} is the part of the spectrum contained in a disc of radius {r<1}, we get an operator {R_t} such that

\displaystyle  \Pi_t R_t = R_t\Pi_t = 0, \qquad \|R_t\|_{BV} < r.

Moreover, we have

\displaystyle  \mathcal{P}_{f,t} = \lambda(t) \Pi_t + R_t,

which allows us to write the operator in (9) as

\displaystyle  \begin{aligned} \mathcal{P}_{f,\frac{t}{\sqrt n}}^n &= \lambda \left( \frac t{\sqrt n}\right)^n \Pi_{\frac t{\sqrt n}} + R_{\frac t{\sqrt n}}^n \\ &= \left( 1 + \lambda'(0) \frac t{\sqrt n} + \frac{\lambda''(0)}2 \frac {t^2} n + o\left( \frac {t^2} n\right) \right)^n \Pi_\frac t{\sqrt n} + R_\frac t{\sqrt n}^n \\ &= \left( 1 - \frac{\sigma^2 t^2}{2n} +o\left( \frac {t^2} n\right) \right)^n\left(\Pi_0 + O\left( \frac t{\sqrt n}\right)\right) + R_\frac t{\sqrt n}^n \\ &\xrightarrow{n\rightarrow\infty} e^{-t^2\sigma^2/2}\Pi_0, \end{aligned}

using the fact that {\|R_t\|_{BV} < r < 1}. Now (9) yields

\displaystyle  \varphi_n(t) \rightarrow e^{-t^2\sigma^2/2} \int \Pi_0({\mathbf{1}})\,d\mu = e^{-t^2\sigma^2/2} = \psi(t),

which completes the proof of the CLT.

3. Proof of formulas for derivatives of {\lambda}

The formulas given above for {\lambda'(0)} and {\lambda''(0)} were not explained. Here is a derivation of these formulas.

Let {g_t} be the eigenfunction of {\mathop{\mathcal P}_{f,t}} corresponding to the eigenvalue {\lambda(t)}. That is, {g_t} satisfies

\displaystyle  \mathcal{P}_T(e^{itf} g_t) = \lambda(t) g_t.

Multiplying by a test function {h} and integrating against {\mu} gives

\displaystyle  \int \mathcal{P}_T(e^{itf} g_t) h \,d\mu = \lambda(t) \int g_t h \,d\mu.

Recalling the definition of {\mathcal{P}_T}, this gives

\displaystyle  \int (e^{itf} g_t) (h\circ T) \,d\mu = \lambda(t) \int g_t h \,d\mu. \ \ \ \ \ (10)

Let {g'_t = \frac d{dt} g_t} and {g''_t = \frac {d^2}{dt^2} g_t}. Then differentiating (10) with respect to {t} gives

\displaystyle  \int (if) (e^{itf} g_t) (h\circ T) \,d\mu + \int (e^{itf} g'_t)(h\circ T)\,d\mu = \lambda'(t) \int g_t h\,d\mu + \lambda(t) \int g_t' h\,d\mu. \ \ \ \ \ (11)

Setting {t=0} and using the fact that {\lambda(0)=1} and {g_0\equiv 1}, we get

\displaystyle  \int(if)(h\circ T) \,d\mu + \int (q)(h\circ T) \,d\mu = \lambda'(0) \int h\,d\mu + \int(q) (h)\,d\mu,

where we write {q = \frac d{dt}g_t|_{t=0}}.

Putting {h\equiv 1} gives the expression for {\lambda'(0)}. Before finding {\lambda''(0)}, we observe that the above equation can also be used to find {\int qh \,d\mu}, which will be important later on. Indeed, using the assumption that {\int f\,d\mu = 0}, we have {\lambda'(0)=0}, and so the above equation becomes

\displaystyle  \int(if)(h\circ T) \,d\mu + \int (q)(h\circ T) \,d\mu = \int(q) (h)\,d\mu. \ \ \ \ \ (12)

Similarly, replacing {h} with the test functions {h\circ T^k} for {k\geq 1}, (12) gives

\displaystyle  \begin{aligned} \int(if)(h\circ T^2) \,d\mu + \int (q)(h\circ T^2) \,d\mu &= \int(q) (h\circ T)\,d\mu, \\ \int(if)(h\circ T^3) \,d\mu + \int (q)(h\circ T^3) \,d\mu &= \int(q) (h\circ T^2)\,d\mu, \end{aligned}

and so on. Observe that {\sum_{k\geq 1} (q)(h\circ T^k)\,d\mu} converges because we have exponential decay of correlations. Thus we may add the above equations (infinitely many of them) and subtract this sum from both sides to obtain

\displaystyle  \int qh\,d\mu = \sum_{k\geq 1} \int (if) (h\circ T^k)\,d\mu. \ \ \ \ \ (13)

Now we can find the expression for {\lambda''(0)}. Set {h\equiv 1} in (11) and differentiate to get

\displaystyle  \int \left((if)^2 (e^{itf} g_t) + 2(if) (e^{itf} g'_t) + e^{itf} g''_t)\right)\,d\mu = \int (\lambda''(t) g_t + 2\lambda'(t) g'_t + \lambda(t) g''_t)\,d\mu. \ \ \ \ \ (14)

At {t=0} we see that the terms containing {g_t''} are equal, while {\lambda'(0)=0} by the assumption that {\int f\,d\mu = 0}, and so (14) gives

\displaystyle  \int (-f^2)\,d\mu + 2\int (if)(q) \,d\mu = \lambda''(t). \ \ \ \ \ (15)

From (13), we have

\displaystyle  \int(if)(q)\,d\mu = \sum_{k\geq 1} \int(if) (if\circ T^k)\,d\mu,

which together with (15) suffices to complete the proof of the expression for {\lambda''(0)}.

Posted in statistical laws, theorems | Tagged , | 8 Comments