Slowly mixing sets

There are two equivalent definitions of mixing for a measure-preserving dynamical system {(X,f,\mu)}. One is in terms of sets:

\displaystyle  \lim_{n\rightarrow\infty} |\mu(A \cap f^{-n}B) - \mu(A)\mu(B)| = 0 \ \ \ \ \ (1)

for all measurable {A,B\subset X}. The other is in terms of functions:

\displaystyle  \lim_{n\rightarrow\infty} \left\lvert \int \varphi \cdot(\psi\circ f^n)\,d\mu - \int\varphi\,d\mu \int\psi\,d\mu\right\rvert = 0 \ \ \ \ \ (2)

for all {\varphi,\psi\in L^1(X,\mu)}. In both cases one may refer to the quantity inside the absolute value as the correlation function, and write it as {C_n(A,B)} or {C_n(\varphi,\psi)}, so that (1) and (2) become {C_n(A,B)\rightarrow 0} and {C_n(\varphi,\psi)\rightarrow 0}, respectively.

Then it is natural to ask how quickly mixing occurs; that is, how quickly the correlation functions {C_n} decay to 0. Is the convergence exponential in {n}? Polynomial? Something else?

In a previous post, we discussed one method for establishing exponential decay of correlations using a spectral gap for a certain operator associated to the system. The result there stated that as long as {\varphi} and {\psi} are chosen from a suitable class of test functions (often Hölder continuous functions), then {C_n(\varphi,\psi)} decays exponentially.

A comment was made in that post that it is important to work with sufficiently regular test functions, because for arbitrary measurable functions, or for the setwise correlations {C_n(A,B)}, the decay may happen arbitrarily slowly even when the system is “as mixing as possible”. But no examples were given there — so I’d like to explain this a little more completely here.

Let {X=\{0,1\}^{\mathbb N}} be the space of infinite sequences of 0s and 1s, and let {f=\sigma} be the shift map. Let {\mu} be {(\frac 12,\frac 12)}-Bernoulli measure, so that writing {[x_1\cdots x_n]} for the {n}-cylinder containing all sequences in {X} that begin with the symbols {x_1,\dots,x_n}, we have {\mu[x_1\cdots x_n] = 2^{-n}}. This is isomorphic to the doubling map on the circle with Lebesgue measure, which was discussed in the previous post and which has exponential decay of correlations for Hölder observables {\varphi,\psi}. Nevertheless, we will show that there are sets {A,B\subset X} such that {C_n(A,B)} decays quite slowly.

We can think of {(X,f,\mu)} as modeling sequences of coin flips, with tails corresponding to the symbol 0, and heads to 1.

Let {B = [1]} be the set of all sequences in {X} that begin with 1. Then {B} corresponds to the event that the first flip is heads, and {\sigma^{-n}B} corresponds to the event that the {(n+1)}st flip is heads.

The description of {A} is a little more involved. Given {j\leq k \in {\mathbb N}}, let {A_{j,k} = \bigcup_{i=j}^{k-1} \sigma^{-i}[1]}, so that {x\in A_{j,k}} if and only if (at least) one of the symbols {x_j,\dots,x_{k-1}} is 1. This corresponds to the event that heads appears at least once on or after the {j}th flip and before the {k}th flip. Note that {\mu(X \setminus A_{j,k}) = 2^{-(k-j)}}, and so {\mu(X \setminus A_{j,k}) = 1-2^{-(k-j)}}.

Fix an increasing sequence of integers {k_1,k_2,\dots} with {k_1 = 1}, and let {A = \bigcap_{i=1}^\infty A_{k_i, k_{i+1}}} be the set of all sequences {x\in X} such that every subword {x_{k_i} \cdots x_{k_{i+1}-1}} contains at least one occurrence of the symbol 1.

Using the interpretation as coin flips, we see that the events {A_{k_i,k_{i+1}-1}} are independent, and so

\displaystyle  \mu(A) = \mathop{\mathbb P}(A) = \prod_{i=1}^\infty \mathop{\mathbb P}(A_{k_i,k_{i+1}}) = \prod_{i=1}^\infty (1-2^{-(k_{i+1} - k_i)}). \ \ \ \ \ (3)

We recall the following elementary lemma.

Lemma 1 Let {t_i\in {\mathbb R}} be a sequence of real numbers such that {0\leq t_i < 1} for all {i\in {\mathbb N}}. Then {\prod_{i=1}^\infty (1-t_i) > 0} if and only if {\sum_{i=1}^\infty t_i < \infty}.

It follows from Lemma 1 that {\mu(A)>0} if and only if

\displaystyle  \sum_{i=1}^\infty 2^{-(k_{i+1} - k_i)} < \infty. \ \ \ \ \ (4)

Thus from now on we assume that the sequence {k_i} is chosen such that (4) holds. Note that the correlation function {C_n(A,B)} can be computed via conditional probabilities: using

\displaystyle  \mathop{\mathbb P}(A \mid \sigma^{-n}B) = \frac{\mu(A \cap \sigma^{-n}B)}{\mu(B)}

(notice that this uses shift-invariance of {\mu}), we have

\displaystyle  \begin{aligned} C_n(A,B) &= \mu(A\cap \sigma^{-n}B) - \mu(A)\mu(B) \\ &= \mu(B) (\mathop{\mathbb P}(A \mid \sigma^{-n}B) - \mu(A)). \end{aligned} \ \ \ \ \ (5)

Recalling that {A} is the intersection of the independent events {A_{k_i,k_{i+1}}}, we let {j=j(n)} be such that {n\in [k_j, k_{j+1})}, and observe that {\sigma^{-n}(B) \subset A_{k_j,k_{j+1}}}. Thus

\displaystyle  \mathop{\mathbb P}(A \mid \sigma^{-n}B) = \prod_{i\neq j} (1-2^{-(k_{i+1}-k_i)}) = \frac{\mu(A)}{1-2^{-(k_{j+1}-k_j)}},

where the last equality uses the expression for {\mu(A)} from (3). Together with the expression (5) for {C_n(A,B)} in terms of the conditional probability {\mathop{\mathbb P}(A\mid \sigma^{-n}B)}, this gives

\displaystyle  \begin{aligned} C_n(A,B) &= \mu(A) \mu(B) \left( \frac 1{1-2^{-(k_{j+1}-k_j)}} - 1 \right) \\ &= \mu(A)\mu(B) \frac 1{2^{k_{j+1} - k_j} - 1}, \end{aligned} \ \ \ \ \ (6)

where we emphasise that {j} is a function of {n}.

Example 1 Take {k_i = \lfloor 2 i\log_2 i \rfloor}, so that {k_{i+1} - k_i \approx 2 i \log_2 (1+\frac 1i) + 2\log_2(i+1)}, and {2^{k_{i+1} - k_i} \approx (i+1)^2 (1+\frac 1i)^{2i}}. Since {(1+\frac 1i)^{2i} \rightarrow e^2}, we see that (4) is satisfied, so that {\mu(A)>0}.

Moreover, {j=j(n)} satisfies {2j\log_2 j \leq n}, and so in particular {j\leq n}. This implies that {k_{j+1} - k_j \leq k_{n+1} - k_n}, and so we can estimate the correlation function using (6) by

\displaystyle  C_n(A,B) \geq \mu(A)\mu(B)\frac 1{2^{k_{n+1} - k_n} - 1} \approx \mu(A)\mu(B) \frac 1{n^2}.

The example shows that the correlation function of sets may only decay polynomially, even if the system has exponential decay of correlations for regular observable functions. In fact, we can use the construction above to produce sets with correlations that decay more or less as slowly as we like along some subsequence of times {n}.

Theorem 2 Let {\gamma_n>0} be any sequence of positive numbers converging to 0. Then there is a sequence {k_i} such that the sets {A,B} from the construction above have {\limsup_{n\rightarrow\infty} C_n(A,B)/\gamma_n = \infty}.

The result says that no matter how slowly {\gamma_n} converges to 0, we can choose {A,B} such that {C_n(A,B)} is not bounded above by {K\gamma_n} for any constant {K}.

To prove Theorem 2, we will choose {k_i} such that {C_n(A,B)/\gamma_n} is large whenever {n\in [k_j, k_{j+1})} with {j} even. Let {c_i,d_i} be any increasing sequences of integers (conditions on these will be imposed soon) and define {k_i} recursively by

\displaystyle  k_1 = 1, \quad k_{2i} = k_{2i-1} + c_i, \qquad k_{2i+1} = k_{2i} + d_i.

Because {c_i,d_i} are increasing, we have

\displaystyle  \sum_{i=1}^\infty 2^{-(k_{i+1} - k_i)} = \sum_{i=1}^\infty 2^{-c_i} + 2^{-d_i} < \infty,

so (4) holds and {\mu(A)>0}. The idea is that we will take {c_i \gg d_i}, so that when {n\in [k_{2j},k_{2j+1})}, we have {k_{2j+1} - k_{2j} = d_j} small relative to {k_{2j}}, and hence to {n}.

Let’s flesh this out a little. Given {n\in [k_{2j},k_{2j+1})}, it follows from (6) that

\displaystyle  C_n(A,B) = \mu(A)\mu(B) \frac 1{2^{k_{j+1} - k_j} - 1} \geq \mu(A)\mu(B) 2^{-d_j}. \ \ \ \ \ (7)

On the other hand, {n\geq k_{2j} \geq \sum_{i=1}^j c_i}. Now we may take {d_j} to be any increasing sequence ({d_j=j} will do just fine) and define {c_j} recursively in terms of {c_1,\dots, c_{j-1}} and {d_j}. We do this by observing that since {\gamma_n \rightarrow 0}, given any {c_1,\dots, c_{j-1}} and {d_j}, we can take {c_j} large enough so that for every {n\geq \sum_{i=1}^j c_i}, we have

\displaystyle  j\gamma_n \leq \mu(A) \mu(B) 2^{-d_j}.

In particular, (7) gives {C_n(A,B) \geq j\gamma_n} for every {n\in [k_{2j},k_{2j+1})}. Sending {j\rightarrow\infty} completes the proof of Theorem 2.

Posted in ergodic theory, examples | Tagged | Leave a comment

Law of large numbers for dependent but uncorrelated random variables

One of the fundamental results in probability theory is the strong law of large numbers, which was discussed in an earlier post under the guise of the Birkhoff ergodic theorem.

Suppose we have a sequence of random variables {t_n} which take values in {{\mathbb N}}. If the sequence {t_n} is independent and identically distributed with {\mathop{\mathbb E}[t_n] = T}, then the strong law of large numbers shows that {\frac 1n(t_1+\cdots + t_n)\rightarrow T} almost surely.

It is often the case, however, that one or both of these conditions fail; there may be some dependence between the variables {t_n}, or the distribution may not be identical for all values of {n}. For example, the following situation arose recently in a project I have been working on: there is a sequence of random variables {t_n} as above, which has the property that no matter what values {t_1,\dots,t_{n-1}} take, the random variable {t_n} has expected value at most {T}. In the language of conditional expectation, we have

\displaystyle  \mathop{\mathbb E}[t_n \mid t_1,\dots, t_{n-1}] \leq T.

This is true despite the fact that the distribution of {t_n} may vary according to the values of {t_1,\dots,t_{n-1}}. In what follows we will write {\mathcal{F}_n} for the {\sigma}-algebra generated by {t_1,\dots, t_n} and write the above condition as

\displaystyle  \mathop{\mathbb E}[t_n \mid \mathcal{F}_{n-1}]\leq T.

It is plausible to conjecture that in this setting one still has {\varlimsup \frac 1n(t_1+\cdots+t_n) \leq T} almost surely. And indeed, it turns out that a statement very close to this can be proved.

Let {\mathop{\mathcal F}_n} be an increasing sequence of {\sigma}-algebras and let {t_n} be a sequence of {{\mathbb N}}-values random variables such that {t_n} is {\mathop{\mathcal F}_n}-measurable. Suppose that there is {p\colon {\mathbb N} \rightarrow [0,1]} such that

\displaystyle  \mathop{\mathbb P}[t_n=t \mid \mathcal{F}_{n-1}] \leq p(t) \ \ \ \ \ (1)

and moreover,

\displaystyle   T := \sum_{t=1}^\infty t \cdot p(t) < \infty. \ \ \ \ \ (2)

The following result is proved by modifying one of the standard proofs of the strong law of large numbers, which can be found, for example, in Billingsley’s book “Probability and Measure”, where it is Theorem 22.1. There is also a discussion on Terry Tao’s blog, which emphasises the role of the main tools in this proof: the moment method and truncation.

Proposition 1 If {t_n} is any sequence of random variables satisfying (1) and (2), then {\varlimsup_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n t_k \leq T} almost surely.

Proof: Consider the truncated random variables {s_n = t_n {\mathbf{1}}_{[t_n\leq n]}}, and note that by (1) and (2) we have

\displaystyle  0 \leq s_n \leq t_n \Rightarrow \mathop{\mathbb E}[s_n \mid \mathcal{F}_{n-1}] \leq T.

Now consider the random variables {r_n = s_n + T - \mathop{\mathbb E}[s_n \mid \mathcal{F}_{n-1}]}. Note that {r_n} is {\mathcal{F}_n}-measurable, takes values in {{\mathbb N}}, and moreover

\displaystyle  \mathop{\mathbb E}[r_n \mid \mathcal{F}_{n-1}] = T \text{ for all } n. \ \ \ \ \ (3)

The most important consequence of this is that the sequence {r_n} is uncorrelated — that is, {\mathop{\mathbb E}[r_ir_j] = \mathop{\mathbb E}[r_i]\mathop{\mathbb E}[r_j] = T^2} for all {i\neq j}, which in turn implies that {\mathop{\mathbb E}[(r_i - T)(r_j-T)] = 0}. This means that the variance is additive for the sequence {r_n}:

\displaystyle  \begin{aligned} \mathop{\mathbb E}\Big[ \Big(\sum_{k=1}^n (r_k -T)\Big)^2\Big] &= \sum_{k=1}^n \mathop{\mathbb E}[(r_k -T)^2] + 2\sum_{1\leq i<j\leq n} \mathop{\mathbb E}[(r_i-T)(r_j-T)] \\ &= \sum_{k=1}^n \mathop{\mathbb E}[(r_k -T)^2]. \end{aligned}

Let {X_n = \frac 1n (r_1 + \cdots + r_n) - T}, so that {\mathop{\mathbb E}[X_N] = 0}. We can estimate {\mathop{\mathbb E}[X_n^2]} using the previous computation:

\displaystyle  \begin{aligned} \mathop{\mathbb E}[X_n^2] &= \frac 1{n^2} \sum_{k=1}^n \mathop{\mathbb E}[(r_k-T)^2] = \frac 1{n^2} \sum_{k=1}^n (\mathop{\mathbb E}[r_k^2] - 2T\mathop{\mathbb E}[r_k] + T^2) \\ &= \frac 1{n^2} \left( nT^2 + \sum_{k=1}^n \mathop{\mathbb E}[r_k^2]\right). \end{aligned} \ \ \ \ \ (4)

Recall that {p(t)} is the sequence bounding {\mathop{\mathbb P}[t_n = t \mid \mathop{\mathcal F}_{n-1}]}, and let {\bar t} be such that {\sum_{t\geq \bar t} p(t) \leq 1}. Let {Y} be a random variable taking the value {t} with probability {p(t)} for {t\geq \bar t}. Note that by the definition of {s_n} and {r_n}, we have {r_n \leq s_n + T \leq n+T}, thus

\displaystyle  \mathop{\mathbb E}[r_k^2] = \sum_{t=1}^{k+T} t^2 \mathop{\mathbb P}[r_k = t] \leq \sum_{t=1}^{k+T} t^2 p(t) \leq C + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq k+T]}]

for some fixed constant {C}. Note that the final expression is non-decreasing in {k}, and so together with (4) we have

\displaystyle  \mathop{\mathbb E}[X_n^2] \leq \frac 1n \left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n+T]}] \right), \ \ \ \ \ (5)

where again {C'} is a fixed constant. Fixing {\varepsilon>0} and using (5) in Chebyshev’s inequality yields

\displaystyle  \mathop{\mathbb P}[|X_n| \geq \varepsilon] \leq \frac 1{\varepsilon^2 n} \left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n+T]}] \right). \ \ \ \ \ (6)

Now let {n_k} be a sequence of integers such that {\lim_k \frac{n_{k+1}}{n_k} =: \alpha > 1}. We see that (6) yields

\displaystyle  \begin{aligned} \sum_{k=1}^\infty \mathop{\mathbb P}[|X_{n_k}| \geq \varepsilon] &\leq \frac 1{\varepsilon^2} \sum_{k=1}^\infty n_k^{-1}\left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n_k+R]}]\right) \\ &\leq C'' + \frac 1{\varepsilon^2} \mathop{\mathbb E} \left[ Y^2 \sum_{k=1}^\infty n_k^{-1} {\mathbf{1}}_{[Y\leq n_k+T]}\right], \end{aligned} \ \ \ \ \ (7)

where {C''<\infty} depends on {\alpha} and on {\varepsilon}. Observe that

\displaystyle  \sum_{k=1}^\infty n_k^{-1} {\mathbf{1}}_{[Y\leq n_k +T]} \leq \sum_{k=k_0(Y)}^\infty n_k^{-1},

where {k_0(Y)} is the minimal value of {k} for which {n_k\geq Y-T}. Because {n_k} grows like {\alpha^k}, the sum is bounded above by a constant times {n_{k_0(Y)}^{-1}}, so that together with (7) we have

\displaystyle  \sum_{k=1}^\infty \mathop{\mathbb P}[|X_{n_k}| \geq \varepsilon] \leq C'' + C''' \mathop{\mathbb E}[Y] < \infty,

where finiteness of {\mathop{\mathbb E}[Y]} is the crucial place where we use (2).

Using the first Borel-Cantelli lemma and taking an intersection over all rational {\varepsilon>0} gives

\displaystyle  \lim_{n\rightarrow\infty} \frac 1{n_k} \sum_{j=1}^{n_k} r_j = T \text{ a.s.} \ \ \ \ \ (8)

Let {Z_n = \sum_{j=1}^n r_j}. Because {r_j\geq 0} we have {Z_{n_k} \leq Z_n \leq Z_{n_{k+1}}} for all {n_k\leq n\leq n_{k+1}}, and in particular

\displaystyle  \frac {n_k}{n_{k+1}} \frac{Z_{n_k}}{n_k} \leq \frac{Z_n}{n} \leq \frac{n_{k+1}}{n_k} \frac{Z_{n_{k+1}}}{n_{k+1}}.

Taking the limit and using (8) gives

\displaystyle  \frac 1\alpha T \leq \varliminf_{k\rightarrow\infty} \frac 1k Z_k \leq \varlimsup_{k\rightarrow\infty} \frac 1k Z_k \leq \alpha T \text{ a.s.}

Taking an intersection over all rational {\alpha>1} gives

\displaystyle  \lim_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n r_k = T \text{ a.s.} \ \ \ \ \ (9)

Finally, we recall from the definition of {s_n} and {r_n} that {t_n\leq r_n} whenever {t_n\leq n}. In particular, we may observe that

\displaystyle  \sum_{n=1}^\infty \mathop{\mathbb P}[t_n > r_n] \leq \sum_{n=1}^\infty \mathop{\mathbb P}[t_n > n] \leq \mathop{\mathbb E}[t_n] \leq T

and apply Borel-Cantelli again to deduce that with probability one, {t_n > r_n} for at most finitely many values of {n}. In particular, (9) implies that

\displaystyle  \varlimsup_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n t_k \leq T \text{ a.s.} \ \ \ \ \ (10)

which completes the proof of Proposition 1. \Box

Posted in statistical laws, theorems | Tagged | Leave a comment

Equidistribution for random rotations

Two very different types of dynamical behaviour are illustrated by a pair of very well-known examples on the circle: the doubling map and an irrational rotation. On the unit circle in {{\mathbb C}}, the doubling map is given by {z\mapsto z^2}, while an irrational rotation is given by {z\mapsto e^{2\pi i\theta}z} for some irrational {\theta}.

Lebesgue measure (arc length) is invariant for both transformations. For the doubling map, it is just one of many invariant measures; for an irrational rotation, it turns out to be the only invariant measure. We say that the doubling map exhibits hyperbolic behaviour, while the irrational rotation exhibits elliptic behaviour.

Systems with hyperbolicity have many invariant measures, as we saw in a series of previous posts. The goal of this post is to recall a proof that the opposite situation is true for an irrational rotation, and that in particular every orbit equidistributes with respect to Lebesgue measure; then we consider orbits generated by random rotations, where instead of rotating by a fixed angle {\theta}, we rotate by either {\theta_1} or {\theta_2}, with the choice of which to use being made at each time step by flipping a coin.

1. Invariant measures via {C(X)}

First we recall some basic facts from ergodic theory for topological dynamical systems. Given a compact metric space {X} and a continuous map {f\colon X\rightarrow X}, let {\mathcal{M}} denote the space of Borel probability measures on {X}. Writing {C(X)} for the space of all continuous functions {X\rightarrow {\mathbb C}}, recall that {C(X)^*} is the space of all continuous linear functionals {C(X)\rightarrow {\mathbb C}}. Then {C(X)^*} is (isomorphic to) the space of finite complex Borel measures on {X}. (This last assertion uses the fact that {X} is a compact metric space and combines various results from “Linear Operators” by Dunford and Schwartz, but a more precise reference will have to wait until I have the book available to look at.)

Using this fact together with the polar decomposition for finite complex Borel measures, we have the following: for every {L\in C(X)^*}, there is {\mu\in \mathcal{M}} and a measurable function {\theta\colon X\rightarrow {\mathbb R}} such that

\displaystyle  L(\varphi) = \|L\| \int \varphi e^{i\theta} \,d\mu \text{ for all } \varphi\in C(X). \ \ \ \ \ (1)

Note that although {C(X)^*} is endowed with the operator norm, we will usually think of it as a topological vector space with the weak* topology. Thus {\mathcal{M}} embeds naturally into {C(X)^*}, and (1) shows that every element of {C(X)^*} can be described in a canonical way in terms of {\mathcal{M}}.

Let {P\subset C(X)} be a countable set whose span is dense in {C(X)}. Then to every {L\in C(X)^*} we can associate the sequence {\Phi_P(L) = \{ L(p) \mid p\in P\} \subset {\mathbb C}^P}, where depending on the context we may index using either {{\mathbb N}} or {{\mathbb Z}}. If {P} is bounded then {\Phi_P(L)\in \ell^\infty} for every {L\in C(X)^*}, and so such a {P} defines a linear map {\Phi_P\colon C(X)^*\rightarrow \ell^\infty}.

Because {L} is determined by the values {L(p)} (by linearity and continuity of {L}, and density of the span of {P}), the map {\Phi} is 1-1. In particular, it is an isomorphism onto its image, which we denote by

\displaystyle  V_P := \Phi_P(C(X)^*) \subset \ell^\infty

Note that {V_P \neq \ell^\infty} because {C(X)^*} is separable and {\ell^\infty} is not.

It is straightforward to see that {\Phi_P} is continuous, and its inverse is also continuous on {V_P}. Thus we can translate questions about {C(X)^*}, and in particular about {\mathcal{M}}, into questions about {\ell^\infty}.

Remark 1 It is a nontrivial problem to determine which elements of {\ell^\infty} correspond to elements of {C(X)^*}, and also to determine which of those sequences correspond to actual measures (elements of {\mathcal{M}}). We will not need to address either of these problems here.

The action {f\colon X\rightarrow X} induces an action on {C(X)} by {\varphi\mapsto \varphi\circ f}, and hence on {C(X)^*} by duality. This action {f_*\colon C(X)^*\rightarrow C(X)^*} is given by

\displaystyle  (f_*L)(\varphi) = L(\varphi\circ f). \ \ \ \ \ (2)

In particular, {f} also induces an action {f_*\colon \mathcal{M}\rightarrow\mathcal{M}} by

\displaystyle  \int\varphi\,d(f_*\mu) = \int (\varphi \circ f)\,d\mu. \ \ \ \ \ (3)

A measure {\mu} is {f}-invariant iff it is a fixed point of {f_*}. Let {f_P} be the action induced by {f} on {V_P\subset \ell^\infty}; that is,

\displaystyle  f_P(\Phi_P(\mu)) = \Phi_P(f_*\mu). \ \ \ \ \ (4)

If {P} can be chosen so that {f_P} takes a particularly nice form, then this can be used to understand what invariant measures {f} has, and how empirical measures converge.

Let us say more clearly what is meant by convergence of empirical measures. Given {x\in X} and {n\in {\mathbb N}}, let {\mathcal{E}_n(x) = \frac 1n \sum_{k=0}^{n-1} \delta_{f^kx}} be the empirical measure along the orbit segment {x,f(x),\dots,f^{n-1}(x)}. Let {V(x)\subset \mathcal{M}} be the set of weak* accumulation points of the sequence {\mathcal{E}_n(x)}. By compactness of {\mathcal{M}}, the set {V(x)} is non-empty, and it is a standard exercise to show that every measure in {V(x)} is {f}-invariant.

In particular, if {(X,f)} is uniquely ergodic, then it only has one invariant measure {\mu}, and so {V(x) = \{\mu\}} for every {x\in X}. In this case we have {\mathcal{E}_n(x) \rightarrow \mu} for every {x}, and it is reasonable to ask how quickly this convergence occurs.

2. Irrational rotations

Now we specialise to the case of an irrational rotation. Let {X=S^1\subset {\mathbb C}} be the unit circle, fix {\theta\in {\mathbb R}} irrational, and let {f\colon X\rightarrow X} be given by {f(z) = e^{2\pi i\theta}z}. We will show that {\mu\in \mathcal{M}} is {f}-invariant iff it is Lebesgue measure, and then examine what happens in a broader setting.

Given {n\in {\mathbb Z}}, let {p_n(z) = z^n}, and let {P = \{p_n \mid n\in {\mathbb Z}\}}. Then the span of {P} contains all functions {S^1\rightarrow {\mathbb C}} that are polynomials in {z} and {\bar{z}}, thus it is a subalgebra of {C(S^1)} that contains the constant functions, separates points, and is closed under complex conjugation. By the Stone–Weierstrass theorem, this span is dense in {C(S^1)}, and since {P} is bounded the discussion from above gives an isomorphism {\Phi\colon C(S^1)^* \rightarrow V\subset \ell^\infty}, where we suppress {P} in the notation. This isomorphism is given by

\displaystyle  \Phi(L)_n = L(p_n) \ \ \ \ \ (5)

for a general {L\in C(S^1)^*}, and for {\mu\in \mathcal{M}} we write

\displaystyle  \Phi(\mu)_n = \int_{S^1} z^n \,d\mu(z). \ \ \ \ \ (6)

The sequence {\Phi(\mu)} is the sequence of Fourier coefficients associated to the measure {\mu}. The choice of {p_n} means that the action {f} induces on {V\subset \ell^\infty} takes a simple form: the Fourier coefficients of {L} and {f_*L} are related by

\displaystyle  \Phi(f_*L)_n = (f_*L)(z^n) = L((f(z))^n) = L\left(e^{2\pi i \theta n} z^n\right) = e^{2\pi i \theta n} \Phi(L)_n.

Thus if {\mu} is invariant, we have {\Phi(\mu)_n = e^{2\pi i \theta n} \Phi(\mu)_n} for all {n=0,1,2,\dots}. Because {\theta} is irrational, we have {e^{2\pi i\theta n} \neq 1} for all {n\neq 0}, and so {\Phi(\mu)_n=0}. Thus the only non-zero Fourier coefficient is {\Phi(\mu)_0 = \int {\mathbf{1}} \,d\mu(z) = 1}. Because {\Phi} is an isomorphism between {C(S^1)^*} and {V\subset \ell^\infty}, this shows that the only {L\in C(S^1)^*} with {f_*L=L} is Lebesgue measure. In particular, {f} is uniquely ergodic, with Lebesgue as the only invariant measure, and thus for any {z\in S^1}, the empirical measures {\mathcal{E}_n(z)} converge to Lebesgue.

3. Random rotations

Consider a sequence of points {z_1,z_2,z_3,\dots\in S^1}, and let {m_n\in \mathcal{M}} be the average of the point masses on the first {n} points of the sequence:

\displaystyle  m_n = \frac 1n \sum_{k=1}^n \delta_{z_k}, \qquad\qquad m_n(\varphi) = \frac 1n \sum_{k=1}^n \varphi(z_k). \ \ \ \ \ (7)

We say that the sequence {z_n} equidistributes if {m_n} converges to Lebesgue on {S^1} in the weak* topology.

The previous sections showed that if the points of the sequence are related by {z_{n+1} = e^{2\pi i \theta} z_n}, where {\theta} is irrational, then the sequence equidistributes. A natural generalisation is to ask what happens when the points {z_n} are related not by a fixed rotation, but by a randomly chosen rotation.

Here is one way of making this precise. Let {\Omega = \{1,2\}^{\mathbb N}} be the set of infinite sequences of 1s and 2s, and let {\mu} be the {\left(\frac 12, \frac 12\right)}-Bernoulli measure on {\Omega}, so that all sequences of length {n} are equally likely. Fix real numbers {\theta_1} and {\theta_2}, and fix {z_1\in S^1}. Given {\omega\in \Omega}, consider the sequence {z_n(\omega)} given by

\displaystyle  z_{n+1}(\omega) = e^{2\pi i \theta_{\omega_n}} z_n(\omega). \ \ \ \ \ (8)

Then one may ask whether or not {z_n(\omega)} equidistributes almost surely (that is, with probability 1 w.r.t. {\mu}). The remainder of this post will be dedicated to proving the following result.

Theorem 1 If either of {\theta_1} or {\theta_2} is irrational, then {z_n(\omega)} equidistributes almost surely.

Remark 2 The proof given here follows a paper by Lagarias and Soundararajan, to which I was referred by Lucia on MathOverflow.

Using Fourier coefficients as in the previous section, we have that {z_n(\omega)} equidistributes iff all the non-constant Fourier coefficients of {m_n(\omega)} converge to zero — that is, iff {\Phi(z_n(\omega))_k \rightarrow 0} as {n\rightarrow\infty} for all {k\neq 0}. This is Weyl’s criterion for equidistribution.

Fix a value of {k\neq 0}, which will be suppressed in the notation from now on. Write {a_n} for the absolute value of the {k}th Fourier coefficient of {m_n(\omega)}, and note that

\displaystyle  a_n := |\Phi(z_n(\omega))_k| = |m_n(\omega)(z^k)| = \frac 1n \left|\sum_{j=1}^n z_j(\omega)^k\right|. \ \ \ \ \ (9)

The outline of the proof is as follows.

  1. Show that there is a constant {C} such that the expected value of {a_n} is at most {C/n}.
  2. Given {\delta>0}, show that there is a constant {C'} such that the probability that {a_n} exceeds {\delta} is at most {C'/n}.
  3. Find an exponentially increasing sequence {n_j\rightarrow\infty} such that if {a_{n_j}\leq \delta}, then {a_n\leq 2\delta} for every {n\in [n_j,n_{j+1}]}.
  4. Use the Borel–Cantelli lemma to deduce that with probability 1, {a_{n_j}} exceeds {\delta} only finitely often, hence {a_n} exceeds {2\delta} only finitely often. Since {\delta>0} was arbitrary this shows that {a_n\rightarrow 0}.

Step 1

Given {\xi\in {\mathbb R}}, let {\|\xi\|} denote the distance between {\xi} and the nearest integer.

Lemma 2

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \left( 1 + \frac 1{\|k\theta_1\|^2 + \|k\theta_2\|^2}\right) \frac 1n \ \ \ \ \ (10)

Proof: Let {y_1\in {\mathbb R}} be such that {z_1 = e^{2\pi i y_1}}, and define {y_n} recursively by {y_{n+1} = y_n + \theta_{\omega_n}}, so that {z_n = e^{2\pi i y_n}}.

\displaystyle  \begin{aligned} a_n^2 &= \frac 1{n^2} \left\lvert \sum_{j=1}^n z_j^k \right\rvert^2 = \frac 1{n^2} \left(\sum_{\ell=1}^n z_\ell^k\right) \left(\sum_{j=1}^n \overline{z_j^k}\right) \\ &= \frac 1{n^2} \left( n + \sum_{\ell\neq j} z_\ell^k z_j^{-k} \right) = \frac 1{n^2} \left( n + \sum_{\ell\neq j} e^{2\pi i k(y_\ell - y_j)} \right). \end{aligned}

Using the fact that {z+\bar{z} = 2\Re(z)}, we have

\displaystyle  \sum_{\ell \neq j} e^{2\pi i k(y_\ell - y_j)} = 2\Re \sum_{1\leq \ell < j \leq n} e^{2\pi i k(y_\ell - y_j)}.

If {\ell - j = r}, then

\displaystyle  \begin{aligned} \mathop{\mathbb E}_\mu[e^{2\pi i k (y_\ell - y_j)}] &= \frac 1{2^r} \sum_{\omega_\ell,\dots, \omega_{j-1}} e^{2\pi i k \sum_{m=\ell}^{j-1} \theta_{\omega_{m}}} = \frac 1{2^r} \sum \prod_{m=\ell}^{j-1} e^{2\pi i k \theta_{\omega_m}} \\ &= \left(\frac{e^{2\pi i k\theta_1} + e^{2\pi i k\theta_2}}{2}\right)^r, \end{aligned}

where the sums are over all {\omega_m\in \{1,2\}} for {\ell\leq m<j}. Since there are {n-r} values of {\ell,j} with {\ell - j = r}, we have

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] = \frac 1{n^2}\left(n + 2\Re \sum_{r=1}^n (n-r)z^r\right), \ \ \ \ \ (11)

where {z=\frac 12(e^{2\pi i k\theta_1} + e^{2\pi i k\theta_2})}. Now

\displaystyle  \begin{aligned} \sum_{r=1}^n (n-r) z^r &= \sum_{r=1}^{n-1} z_r + \sum_{r=1}^{n-2} z^r + \cdots + \sum_{r=1}^1 z^r \\ &= \frac z{1-z}\sum_{s=1}^{n-1} (1-z^s) = \frac z{1-z} \left( n-\sum_{s=0}^{n-1} z^s \right). \end{aligned}

and since {|z|\leq 1} we have

\displaystyle  \left\lvert \sum_{r=1}^n (n-r) z^r \right\rvert \leq \frac{2n}{|1-z|}.

Together with (11), this gives

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \frac 1n \left( 1 + \frac 4{|1-z|}\right) = \frac 1n\left( 1 + \frac 8{|2 - e^{2\pi i k \theta_1} - e^{2\pi i k \theta_2}|} \right).

Using the fact that {|w|\geq |\Re w|} and that {\Re(1-e^{2\xi i}) \geq 1-\cos(2\xi) = 2\sin^2\xi}, we have

\displaystyle  \begin{aligned} |2 - e^{2\pi ik\theta_1} - e^{2\pi ik\theta_2}| &\geq 2(\sin^2 \pi k\theta_1 + \sin^2 \pi k\theta_2), \end{aligned}

and so

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \frac 1n\left(1 + \frac 4{\sin^2\pi k\theta_1 + \sin^2\pi k\theta_2}\right)

Finally, for {|\xi|\leq \frac 12} we have {|\sin \pi\xi| \geq |2\xi|}, which proves the bound in (10). \Box

Because one of {\theta_1,\theta_2} is irrational, the denominator in (10) is positive, and so {C := 1 + (\|k\theta_1\|^2 + \|k\theta_2\|^2)^{-1} < \infty}, which completes Step 1.

Step 2

Given {\delta>0}, we have

\displaystyle  \mathop{\mathbb E}[a_n^2] \geq \delta^2 \mathop{\mathbb P}(a_n\geq \delta),

and so by Lemma 2 we have

\displaystyle  \mathop{\mathbb P}(a_n\geq \delta) < \frac C{\delta^2 n}.

Putting {C' = C\delta^{-2}} completes step 2.

Step 3

Given {m\leq n\in {\mathbb N}}, we have

\displaystyle  na_n \leq ma_m + \left|\sum_{\ell=m}^{n-1} z_\ell^k\right| \leq ma_m + n-m,

and so

\displaystyle  a_n \leq \frac{n-m}n + \frac mn a_m.

In particular, if {n\leq (1+\delta)m} for some {\delta>0}, we have

\displaystyle  a_n \leq \delta + a_m. \ \ \ \ \ (12)

Let {n_j} be such that

\displaystyle  1+\frac \delta 2 \leq \frac{n_{j+1}}{n_j} \leq 1+\delta \ \ \ \ \ (13)

for all {j}. If {a_{n_j}\leq \delta}, then (12) implies that {a_n\leq 2\delta} for all {n\in [n_j, n_{j+1}]}.

Step 4

Let {E_j} be the event that {a_{n_j} \geq \delta}. By Part 2, we have {\mathop{\mathbb P}(E_j) \leq C'/n_j}, and because {n_j} increases exponentially in {j} by (13), we have {\sum_{j=1}^\infty \mathop{\mathbb P}(E_j) < \infty}. By the Borel–Cantelli lemma, this implies that with probability 1, there are only finitely many values of {j} for which {a_{n_j}\geq \delta}.

By the previous part, this in turn implies that there are only finitely many values of {n} for which {a_n\geq 2\delta}. In particular, {\varlimsup a_n \leq 2\delta}, and since {\delta>0} was arbitrary, we have {a_n\rightarrow 0}. Thus the sequence {z_n} satisfies Weyl’s criterion almost surely, which completes the proof of Theorem 1.

Posted in ergodic theory, examples, random dynamics, statistical laws | Tagged | Leave a comment

Fubini foiled

An important issue in hyperbolic dynamics is that of absolute continuity. Suppose some neighbourhood {U} of a smooth manifold {M} is foliated by a collection of smooth submanifolds {\{W_\alpha \mid \alpha\in A\}}, where {A} is some indexing set. (Here “smooth” may mean {C^1}, or {C^2}, or even more regularity depending on the context.)

Fixing a Riemannian metric on {M} gives a notion of volume {m} on the manifold {M}, as well as a notion of “leaf volume” {m_\alpha} on each {W_\alpha}, which is just the volume form coming from the induced metric. Then one wants to understand the relationship between {m} and {m_\alpha}.

The simplest example of this comes when {U=(0,1)^2} is the (open) unit square and {\{W_\alpha\}} is the foliation into horizontal lines, so {I=(0,1)} and {W_\alpha = (0,1)\times \{\alpha\}}. Then Fubini’s theorem says that if {E} is any measurable set, then {m(E)} can be found by integrating the leaf volumes {m_\alpha(E)}. In particular, if {m(E)=0}, then {m_\alpha(E)=0} for almost every {\alpha}, and conversely, if {E} has full measure, then {m_\alpha(E)=1} for almost every {\alpha}.

For the sake of simplicity, let us continue to think of a foliation of the two-dimensional unit square by one-dimensional curves, and assume that these curves are graphs of smooth functions {\phi_\alpha\colon (0,1)\rightarrow (0,1)}. The story in other dimensions is similar.

Writing {\Phi(x,y) = \phi_y(x)} gives a map {\Phi\colon (0,1)^2 \rightarrow (0,1)^2}, and we see that our foliation is the image under {\Phi} of the foliation by horizontal lines. Now we must make a very important point about regularity of foliations. The leaves of the foliation are smooth, and so {\Phi} depends smoothly on {x}. However, no assumption has been made so far on the transverse direction — that is, dependence on {y}. In particular, we cannot assume that {\Phi} depends smoothly on {y}.

If {\Phi} depends smoothly on both {x} and {y}, then we have a smooth foliation, not just smooth leaves, and in this case basic results from calculus show that a version of Fubini’s theorem holds:

\displaystyle  m(E) = \int_A \int_{W_\alpha} \rho_\alpha(z) {\mathbf{1}}_E(z) \,dm_\alpha(z)\,d\alpha,

where the density functions {\rho_\alpha\in L^1(W_\alpha,m_\alpha)} can be determined in terms of the derivative of {\Phi}.

It turns out that when the foliation is by local stable and unstable manifolds of a hyperbolic map, smoothness of the foliation is too much to ask for, but it is still possible to find density functions {\rho_\alpha} such that Fubini’s theorem holds, and the relationship between {m}-null sets and {m_\alpha}-null sets is as expected.

However, there are foliations where the leaves are smooth but the foliation as a whole does not even have the absolute continuity properties described in the previous paragraph. This includes a number of dynamically important examples, namely intermediate foliations (weak stable and unstable, centre directions in partial hyperbolicity) for certain systems. These take some time to set up and study. There is an elementary non-dynamical example first described by Katok; a similar example was described by Milnor (Math. Intelligencer 19(2), 1997), and it is this construction which I want to outline here, since it is elegant, does not require much machinery to understand, and neatly illustrates the possibility that absolute continuity may fail.

The example consists of two constructions: a set {E} and a foliation of the square by a collection of smooth curves {W_\alpha}. These are built in such a way that {E} has full Lebesgue measure ({m(E)=1}) but intersects each curve {W_\alpha} in at most a single point, so that in particular {m_\alpha(E)=0} for all {\alpha}.

1. The set

Given {p\in (0,1)}, consider the piecewise linear map {f_p\colon (0,1)\rightarrow (0,1]} given by

\displaystyle  f_p(y) = \begin{cases} \frac yp & y\leq p \\ \frac {y-p}{1-p} &y >p \end{cases}

It is not hard to see that {f_p} preserves Lebesgue measure {dy} and that {(f_p,dx)} is measure-theoretically conjugate to {(\sigma,\mu_p)}, where {\sigma} is the shift map on {\Sigma=\{0,1\}^{\mathbb N}} and {\mu_p} is the {(p,1-p)}-Bernoulli measure. The conjugacy {\pi_p\colon (0,1) \rightarrow \Sigma} is given by coding trajectories according to whether the {n}th iterate falls into {I_0 = (0,p]} or {I_1=(p,1)}.

Given {y}, let {\beta_n(y)} be the number of times that {y,f_p(y),\dots,f_p^{n-1}(y)} lands in {I_1}. Then by the strong law of large numbers, Lebesgue almost every {y\in (0,1)} has {\frac 1n \beta_n(y) \rightarrow 1-p}. Let {E} be the set of all pairs {(p,y)} such that this is true; then {m(E) = 1} by Fubini’s theorem (applied to the foliation of the unit square into vertical lines). It only remains to construct a foliation that intersects {E} in a “strange” way.

2. The foliation

The curves {W_\alpha} will be chosen so that two points {(p,y)} and {(p',y')} lie on the same {W_\alpha} if and only if they are coded by the same sequence in {\Sigma} — that is, if and only if {\pi_p(y) = \pi_{p'}(y')}. Then because the limit of {\frac 1n\beta_n} must be constant on {W_\alpha} if it exists, we deduce that each {W_\alpha} intersects the set {E} at most once.

It remains to see that this condition defines smooth curves. Given {\alpha\in (0,1)}, let {a=\pi_{1/2}(\alpha)=a_1a_2\cdots} be the binary expansion of {\alpha}, so that {\alpha = \sum a_n 2^{-n}}. Now let {W_\alpha = \{(p,y) \mid \pi_p(y)=a\}}.

Fix {p\in (0,1)}, let {y_1=y = \phi_\alpha(p)} be such that {(p,y)\in W_\alpha}, and let {y_{n+1}=f_p(y_n)}, so that {y_n\in I_{a_n}} for all {n}. Thus

\displaystyle  y_{n+1} = \begin{cases} \frac {y_n}{p} & a_n=0 \\ \frac{y_n-p}{1-p} & a_n=1 \end{cases}

For convenience of notation, write {p(0)=p} and {p(1)=1-p}. Then the relationship between {y_{n+1}} and {y_n} can be written as

\displaystyle  y_{n+1} = \frac{y_n - a_n p(0)}{p(a_n)} \qquad\Rightarrow\qquad y_n = p(0)a_n + p(a_n)y_{n+1}.

This can be used to write {y=\phi_\alpha(p)} in terms of the sequence {a_n}. Indeed,

\displaystyle  \begin{aligned} \phi_\alpha(p) &=y = y_1 = p(0)a_1 + p(a_1)y_2 \\ &= p(0)a_1 + p(a_1)\big(p(0) a_2 + p(a_2)y_3\big) \\ &= p(0)\big(a_1 + p(a_1)a_2\big) + p(a_1)p(a_2)y_3 \\ &= p(0)\big(a_1 + p(a_1)a_2 + p(a_1)p(a_2)a_3\big) + p(a_1)p(a_2)p(a_3)y_4, \end{aligned}

and so on, so that writing {\psi_n(p) = p(a_1)p(a_2)\cdots p(a_{n-1})}, one has

\displaystyle  \phi_\alpha(p) = p(0) \sum_{n=1}^\infty \psi_n(p) a_n.

The summands are analytic functions of {p}, and the sum converges uniformly on each interval {[\epsilon,1-\epsilon]}, since for {p\in [\epsilon, 1-\epsilon]} we have {\psi_n(p) \leq (1-\epsilon)^n}. In fact, this uniform convergence extends to complex values of {p} in the disc of radius {\frac 12 - \epsilon} centred at {\frac 12}, and so by the Weierstrass uniform convergence theorem, the function {\phi_\alpha} is analytic in {p}.

As discussed above, {W_\alpha} is the graph of the analytic function {\phi_\alpha}, and each {W_\alpha} intersects {E} at most once, despite the fact that {E} has Lebesgue measure 1 in the unit square. This demonstrates the failure of absolute continuity.

Posted in ergodic theory, examples, smooth dynamics | Tagged | 2 Comments

Sharkovsky’s theorem

This post is based on a talk given by Keith Burns in the UH dynamical systems seminar yesterday, in which he presented a streamlined proof of Sharkovsky’s theorem due to him and Boris Hasselblatt.

1. Background and statement of the theorem

We recall the statement of Sharkovsky’s theorem. The setting is a continuous map {f\colon {\mathbb R}\rightarrow {\mathbb R}}, and one is interested in what dynamical properties {f} must necessarily have if we know that {f} has a periodic orbit of some given least period {p} — for example, does existence of such an orbit force the existence of periodic points with least period {q} for various values of {q}? Or does it force the map to have positive entropy?

The assumptions of continuity and one-dimensionality are vital: for discontinuous one-dimensional maps, or for continuous higher-dimensional maps, it is possible to have a point of least period {p} without having points of any other period or having positive entropy.

In one dimension, on the other hand, one can begin with the following simple observation: if {f} has a periodic point with least period 2, then there is {p} such that {f(p) > p=f^2(p)}, and so applying the intermediate value theorem to {f(x)-x}, the map must have a fixed point.

What if {f} has a periodic point {p} with least period 3? Suppose that {p<f(p)<f^2(p)} (the other configurations are similar), and let {I_1=[p,f(p)]} and {I_2=[f(p),f^2(p)]}. Once again using the intermediate value theorem, we see that {f(I_1) \supset I_2} and {f(I_2) \supset I_1 \cup I_2}. Thus we may consider the topological Markov chain {(X,\sigma)} given by the graph in Figure 1, and observe that it embeds into the dynamics of {f}, so that in particular, {f} has periodic points of every order.

Fig 1

Sharkovsky’s theorem generalises the above observations. Consider the following unorthodox total order on the integers:

\displaystyle  \begin{aligned} 3\prec 5&\prec 7\prec \cdots \\ 6\prec 10&\prec 14\prec \cdots \\ 12\prec 20&\prec 28\prec \cdots \\ &\vdots \\ \cdots \prec 16 &\prec 8 \prec 4 \prec 2 \prec 1 \end{aligned}

Then if {f} has a periodic point of least period {p}, it has a point of least period {q} for every {q\succ p}.

The paper by Burns and Hasselblatt includes a discussion of the history of the theorem and the various proofs that have been given. Here we outline the most streamlined proof available at the present time.

2. Sketch of the proof

Given two intervals {I,J\subset {\mathbb R}}, write {I\rightarrow J} if {f(I)\supset J}. As in the case of a period 3 orbit above, the key will be to construct a collection of intervals {I_j} (whose endpoints are certain points of the period {p} orbit) such that the covering relations {I_j\rightarrow I_k} yield a directed graph with periodic paths of all the desired periods.

To deduce existence of various periodic orbits for {f} from these covering relations, one makes the following observations.

  1. If {f^p(I)\supset I}, then {I} contains a fixed point of {f^p}.
  2. If {I_0 \rightarrow I_1 \rightarrow \cdots \rightarrow I_k}, then {f^k(I_0)\supset I_0}.
  3. The above two observations guarantee existence of {x\in I_0} with {f^k(x)=x}. To guarantee that {x} can be chosen with least period {k}, it suffices to know that some {I_j} to have disjoint interior from all the others, and that the endpoints of this {I_j} do not follow the itinerary given by the sequence of covering relations above.

Detailed proofs of the above observations are in the Burns-Hasselblatt paper. With these tools in hand, the proof comes down to constructing the directed graph of covering relations referred to above.

Fig 2

To this end, consider a periodic orbit {P} with {p} points — the green points in Figure 2. These are permuted by {f}; some are moved to the left, some are moved to the right. Of all the points in the orbit that are moved to the right, let {z} be the rightmost, so that all points in the orbit to the right of {z} are moved to the left. Let {z'} be the next point to the right of {z}, and let {\Delta=[z,z']} as in the figure.

Now consider the intervals {I_k = f^k(\Delta)}. Note that {I_0} has the following property: every point in {P} lying to the left of {\Delta} moves to the right of {\Delta}, and vice versa. Let {I_m} be the first interval that does not have this property, and let {x\in I_m\cap P} be a point that does not switch sides of {\Delta} under the action of {f}. For concreteness we consider the case where {x} is to the left of {\Delta}, as in Figure 2.

By construction, all the points between {x} and {\Delta} are mapped to the right of {\Delta}. Of these, let {y} be the point that is mapped furthest to the right.

Write {R_y} for the interval {[y,\infty)}, and consider the intervals {J_k = I_k \cap R_y} for {0\leq k < m}, together with {J_m = [x,y]}. We claim that {J_0\rightarrow J_0}, {J_m\rightarrow J_0}, and {J_k \rightarrow J_{k+1}} for each {0\leq k<m}, so that we have the covering relations shown in Figure 3, which can be used to establish existence of periodic points with all least periods greater than {m}.

Fig 3

So why do the intervals {J_k} have the covering relations shown?

  1. The fact that {J_0\rightarrow J_0} follows from the fact that {J_0=I_0=\Delta} and the endpoints of {\Delta} switch sides of {\Delta}.
  2. The definition of {I_k} gives {I_k\rightarrow I_{k+1}}.
  3. Writing {I_k^R} for the part of {J_k} lying in or to the right of {\Delta} and {I_k^L} for the part lying in or to the left, we see that {I_k^R\rightarrow I_{k+1}^L} and {I_k^L\rightarrow I_{k+1}^R}; this is because for {0\leq k < m}, all points of {P\cap I_k} switch sides of {\Delta} under the action of {f}. (This is the definition of {m}.)
  4. Using similar notation for {J_k^L} and {J_k^R}, we see that {I_k^R = J_k^R} and {J_k^L \rightarrow I_k^R}; this is because {y} was chosen to be the point that maps furthest to the right.
  5. This establishes that {J_k^L \rightarrow J_k^R}, and similarly {J_k^R \rightarrow J_k^L}.

For {k=m} we see that {f(x)} lies to the left of {\Delta}, and {f(y)} to the right, so that {f(J_m)\supset \Delta=I_0}.

The extra piece of information in the final item of the list above shows that the graph in Figure 3 can be replaced by the one in Figure 4.

Fig 4

Let us sum up what we know. One of the following two cases holds. Case 1: There is some point {x} of the orbit such that {x} and {f(x)} both lie on the same side of {\Delta}. Case 2: Every point of the orbit is mapped to the other side of {\Delta} by {f}.

In Case 1, the arguments above show that for every {q>p} and every even {q<p}, there is an orbit of least period {q}, which is at least as strong as the conclusion in Sharkovsky’s theorem.

In Case 2, one may observe that since {f} permutes the elements of {P} and every element is moved from one side of {\Delta} to the other, {p} must be even. Moreover, by considering {P^L} and {P^R} separately, as invariant sets for {f^2}, we can repeat the argument inductively to complete the proof of the theorem.

Posted in theorems | Tagged | Leave a comment

Central Limit Theorem for dynamical systems using martingales

This post is based on notes from Matt Nicol’s talk at the UH summer school in dynamical systems. The goal is to present the ideas behind a proof of the central limit theorem for dynamical systems using martingale approximations.

1. Conditional expectation

Before we can define and use martingales, we must recall the definition of conditional expectation. Let {(\Omega,\mathop{\mathbb P})} be a probability space, with {\mathop{\mathbb P}} defined on a {\sigma}-algebra {\mathop{\mathcal B}}. Let {\mathop{\mathcal F}\subset \mathop{\mathcal B}} be a sub-{\sigma}-algebra of {\mathop{\mathcal B}}.

Example 1 Consider the doubling map {T\colon [0,1]\rightarrow [0,1]} given by {T(x) = 2x\pmod 1}. Let {\mathop{\mathbb P}} be Lebesgue measure, {\mathop{\mathcal B}} the Borel {\sigma}-algebra, and {\mathop{\mathcal F} = T^{-1}\mathop{\mathcal B} = \{T^{-1}(B) \mid B\in \mathop{\mathcal B}\}}. Then {\mathop{\mathcal F}} is a sub-{\sigma}-algebra of {\mathop{\mathcal B}}, consisting of precisely those sets in {\mathop{\mathcal B}} which are unions of preimage sets {T^{-1}(x)} — that is, those sets {F\in\mathop{\mathcal B}} for which a point {y\in [0,\frac 12]} is in {F} if and only if {y+\frac 12\in F}.

This example extends naturally to yield a decreasing sequence of {\sigma}-algebras

\displaystyle  \mathop{\mathcal B} \supset T^{-1}\mathop{\mathcal B} \supset T^{-2}\mathop{\mathcal B} \supset \cdots.

Given a sub-{\sigma}-algebra {\mathop{\mathcal F}\subset \mathop{\mathcal B}} and a random variable {Y\colon \Omega\rightarrow {\mathbb R}} that is measurable with respect to {\mathop{\mathcal B}}, the conditional expectation of {Y} given {\mathop{\mathcal F}} is any random variable {Z} such that

  1. {Z} if {\mathop{\mathcal F}}-measurable (that is, {Z^{-1}(I)\in\mathop{\mathcal F}} for every interval {I\subset {\mathbb R}}), and
  2. {\int_A Z\,d\mathop{\mathbb P} = \int_A Y\,d\mathop{\mathbb P}} for every {A\in \mathop{\mathcal F}}.

It is not hard to show that these conditions characterise {Z} for an almost-sure choice of {\omega\in\Omega}, and so the conditional expectation is uniquely defined as a random variable. We write it as {\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]}.

A key property is that conditional expectation is linear: for every {\mathop{\mathcal F}\subset \mathop{\mathcal B}}, every {Y_1,Y_2\colon \Omega\rightarrow{\mathbb R}}, and every {a\in {\mathbb R}}, we have

\displaystyle  \mathop{\mathbb E}[aY_1 + Y_2\mid \mathop{\mathcal F}] = a\mathop{\mathbb E}[Y_1\mid\mathop{\mathcal F}] + \mathop{\mathbb E}[Y_2\mid\mathop{\mathcal F}].

Example 2 If {Y} is already {\mathop{\mathcal F}}-measurable, then {\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]=Y}.

Example 3 At the other extreme, if {Y} and {\mathop{\mathcal F}} are independent — that is, if {\mathop{\mathbb E}[Y{\mathbf{1}}_A]=\mathop{\mathbb E}[Y]\mathop{\mathbb P}[A]} for every {A\in\mathop{\mathcal F}} — then {\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]} is the constant function {\mathop{\mathbb E}[Y]{\mathbf{1}}}.

Example 4 Suppose {\{\Omega_i\mid i\in{\mathbb N}\}} is a countable partition of {\Omega} such that {\mathop{\mathbb P}[\Omega_i]>0} for every {i}. Let {\mathop{\mathcal F}=\sigma(\Omega_1,\Omega_2,\dots)} be the {\sigma}-algebra generated by the sets {\Omega_i}. Then

\displaystyle  \mathop{\mathbb E}[Y|\mathop{\mathcal F}] = \sum_{i\in{\mathbb N}} \frac{\mathop{\mathbb E}[Y{\mathbf{1}}_{\Omega_i}]}{\mathop{\mathbb P}[\Omega_i]} {\mathbf{1}}_{\Omega_i}.

2. Martingales

Now we can define martingales, which are a particular sort of stochastic process (sequence of random variables) with “enough independence” to generalise results from the IID case.

Definition 1 A sequence {Y_n\colon \Omega\rightarrow {\mathbb R}} of random variables is a martingale if

  1. {\mathop{\mathbb E}[|Y_n|]<\infty} for all {n};
  2. there is an increasing sequence of {\sigma}-algebras (a filtration) {\mathop{\mathcal F}_1\subset\mathop{\mathcal F}_2\subset\cdots \subset \mathop{\mathcal B}} such that {Y_n} is measurable with respect to {\mathop{\mathcal F}_n};
  3. the conditional expectations satisfy {\mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] = Y_n}.

The first condition guarantees that everything is in {L^1}. If {\mathop{\mathcal F}_n} is taken to be the {\sigma}-algebra of events that are determined by the first {n} outcomes of a sequence of experiments, then the second condition states that {Y_n} only depends on those first {n} outcomes, while the third condition requires that if the first {n} outcomes are known, then the expected value of {Y_{n+1} - Y_n} is 0.

Example 5 Let {Y_n} be a sequence of fair coin flips — IID random variables taking the values {\pm1} with equal probability. Let {S_n=Y_1+\cdots+Y_n}. As suggested in the previous paragraph, let {\mathop{\mathcal F}_n=\sigma(Y_1,\dots,Y_n)} be the smallest {\sigma}-algebra with respect to which {Y_1,\dots,Y_n} are all measurable. (The sets in {\mathop{\mathcal F}_n} are precisely those sets in {\mathop{\mathcal B}} which are determined by knowing the values of {Y_1,\dots,Y_n}.)

It is easy to see that {S_n} satisfies the first two properties of a martingale, and for the third, we use linearity of expectation and the definition of {\mathop{\mathcal F}_n} to get

\displaystyle  \mathop{\mathbb E}[S_{n+1}|\mathop{\mathcal F}_n] = \mathop{\mathbb E}[Y_1 + \cdots + Y_n|\mathop{\mathcal F}_n] + \mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] = S_n + 0 = S_n.

When {Y_n} is a sequence of random variables for which {S_n = Y_1+\cdots+Y_n} is a martingale, we say that the sequence {Y_n} is a martingale difference.

In the previous example the martingale property (the third condition) was a direct consequence of the fact that the random variables {Y_n = S_{n+1}-S_n} were IID. However, there are examples where the martingale differences are not IID.

Example 6 Polya’s urn is a stochastic process defined as follows. Consider an urn containing some number of red and blue balls. At each step, a single ball is drawn at random from the urn, and then returned to the urn, along with a new ball that matches the colour of the one drawn. Let {Y_n} be the fraction of the balls that are red after the {n}th iteration of this process.

Clearly the sequence of random variables {Y_n} is neither independent nor identically distributed. However, it is a martingale, as the following computation shows: suppose that at time {n} there are {p} red balls and {q} blue balls in the urn. (This knowledge represents knowing which element of {\mathop{\mathcal F}_n} we are in.) Then at time {n+1}, there will be {p+1} red balls with probability {\frac{p}{p+q}}, and {p} red balls with probability {\frac{q}{p+q}}. Either way, there will be {p+q+1} total balls, and so the expected fraction of red balls is

\displaystyle  \begin{aligned} \mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] &= \frac{p}{p+q}\cdot\frac{p+1}{p+q+1} + \frac{q}{p+q}\cdot \frac{p}{p+q+1} \\ &= \frac{p(p+q+1)}{(p+q)(p+q+1)} = \frac{p}{p+q} = Y_n. \end{aligned}

If we assume that the martingale differences are stationary (that is, identically distributed) and ergodic, then we have the following central limit theorem for martingales, from a 1974 paper of McLeish (we follow some notes by S. Sethuraman for the statement).

Theorem 2 Let {Y_i} be a stationary ergodic sequence such that {\sigma^2 = \mathop{\mathbb E}[Y_i^2]<\infty} and {\mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n]=0}, where {\mathop{\mathcal F}_n} is the {\sigma}-algebra generated by {\{Y_i\mid 1\leq i\leq n\}}. Then {S_n = \sum_{i=1}^n Y_i} is a martingale, and {\frac{S_n}{\sigma\sqrt{n}}} converges in distribution to {N(0,1)}.

More sophisticated versions of this result are available, but this simple version will suffice for our needs.

3. Koopman operator and transfer operator

Now we want to apply 2 to a dynamical system {T\colon X\rightarrow X} with an ergodic measure {\mu} by taking {Y_i=\varphi\circ T^i} for some observable {\varphi\colon X\rightarrow{\mathbb R}}.

To carry this out, we consider two operators on {L^2(\mu)}. First we consider the Koopman operator {U\colon \varphi\mapsto \varphi\circ T}. Then we define the transfer operator {\mathop{\mathcal P}} to be its {L^2} adjoint — that is,

\displaystyle  \int(\mathop{\mathcal P}\varphi)\psi\,d\mu = \int\varphi(\psi\circ T)\,d\mu \ \ \ \ \ (1)

for all {\varphi,\psi\in L^2}. The key result for our purposes is that the operators {\mathop{\mathcal P}} and {U} are one-sided inverses of each other.

Proposition 3 Given {\varphi\in L^2}, we have

  1. {\mathop{\mathcal P} U\varphi=\varphi};
  2. {U\mathop{\mathcal P}\varphi=\mathop{\mathbb E}[\varphi|T^{-1}\mathop{\mathcal B}]}, where {\mathop{\mathcal B}} is the {\sigma}-algebra on which {\mu} is defined.

Proof: For the first claim, we see that for all {\psi\in L^2} we have

\displaystyle  \int\psi\cdot (\mathop{\mathcal P} U\varphi)\,d\mu = \int (\psi\circ T)(\varphi\circ T)\,d\mu = \int\psi\varphi\,d\mu,

where the first equality uses the definition of {\mathop{\mathcal P}} and the second uses the fact that {\mu} is invariant. To prove the second claim, we first observe that given an interval {I\subset{\mathbb R}}, we have

\displaystyle  (\mathop{\mathcal P} U\varphi)^{-1}(I) = T^{-1}((\mathop{\mathcal P} \varphi)^{-1}(I)) \in T^{-1}\mathop{\mathcal B},

since {\mathop{\mathcal P}} maps {\mathop{\mathcal B}}-measurable functions to {\mathop{\mathcal B}}-measurable functions. This shows that {\mathop{\mathcal P} U\varphi} is {T^{-1}\mathop{\mathcal B}}-measurable, and it remains to show that

\displaystyle  \int_{T^{-1}B} U\mathop{\mathcal P}\varphi\,d\mu = \int_{T^{-1}B} \varphi\,d\mu \text{ for all }B\in \mathop{\mathcal B}. \ \ \ \ \ (2)

This follows from a similar computation to the one above: given {B\in \mathop{\mathcal B}} we have

\displaystyle  \begin{aligned} \int_{T^{-1}B} U\mathop{\mathcal P}\varphi\,d\mu &= \int ((\mathop{\mathcal P}\varphi)\circ T) \cdot {\mathbf{1}}_{T^{-1}B}\,d\mu = \int((\mathop{\mathcal P}\varphi)\circ T) ({\mathbf{1}}_B \circ T)\,d\mu \\ &= \int(\mathop{\mathcal P}\varphi) {\mathbf{1}}_B\,d\mu = \int\varphi\cdot ({\mathbf{1}}_B\circ T)\,d\mu = \int_{T^{-1}B} \varphi\,d\mu, \end{aligned}

which establishes (2) and completes the proof. \Box

We see from Proposition 3 that a function has zero conditional expectation with respect to {T^{-1}\mathop{\mathcal B}} if and only if it is in the kernel of {\mathop{\mathcal P}}. In particular, if {\mathop{\mathcal P} h = 0} then {\sum_{j=1}^n h\circ T^j} is a martingale; this will be a key tool in the next section.

Example 7 Let {X=[0,1]} and {T} be the doubling map. Let {\mu} be Lebesgue measure. For convenience of notation we consider the {L^2} space of complex-valued functions on {X}; the functions {\varphi_n\colon x\mapsto e^{2\pi inx}} form an orthonormal basis for this space. A simple calculation shows that

\displaystyle  U\varphi_n(x) = \varphi_n(2x) = e^{2\pi in(2x)} = \varphi_{2n}(x),

so {U\colon \varphi_n\rightarrow \varphi_{2n}}. For the transfer operator we obtain {\mathop{\mathcal P}\colon \varphi_{2n}\rightarrow \varphi_n}, while for odd values of {n} we have

\displaystyle  \begin{aligned} \mathop{\mathcal P}\varphi_n(x) &= \frac 12 \left(\varphi_n\left(\frac x2\right) + \varphi_n\left(\frac{1+x}{2}\right)\right) \\ &= \frac 12 \left( e^{\pi i n x} + e^{\pi i n x} e^{\pi i n}\right) = 0. \end{aligned}

4. Martingale approximation and CLT

The machinery of the Koopman and transfer operators from the previous section can be used to apply the martingale central limit theorem to observations of dynamical systems via the technique of martingale approximation, which was introduced by M. Gordin in 1969.

The idea is that if {\mathop{\mathcal P}^n\varphi\rightarrow 0} quickly enough for functions {\varphi} with {\int\varphi\,d\mu=0}, then we can approximate the sequence {\varphi\circ T^n} with a martingale sequence {h\circ T^n}.

More precisely, suppose that the sequence {\|\mathop{\mathcal P}^n\varphi\|_2} is summable; then we can define an {L^2} function {g} by

\displaystyle  g = \sum_{n=1}^\infty \mathop{\mathcal P}^n\varphi. \ \ \ \ \ (3)

Let {h = \varphi + g - g\circ T\in L^2}. We claim that {S_nh = \sum_{j=1}^n h\circ T^j} is a martingale. Indeed,

\displaystyle  \mathop{\mathcal P} h = \mathop{\mathcal P}\varphi + \mathop{\mathcal P} \sum_{n\geq 1} \mathop{\mathcal P}^n\varphi - \mathop{\mathcal P} U \sum_{n\geq 1} \mathop{\mathcal P}^n\varphi,

and since {\mathop{\mathcal P} U} is the identity we see that the last term is just {\sum_n \mathop{\mathcal P}^n \varphi}, so that {\mathop{\mathcal P} h = 0}.

Proposition 3 now implies that {\mathop{\mathbb E}[h|T^{-1}\mathop{\mathcal B}]=0}, and we conclude that {S_nh} is a martingale, so by the martingale CLT {\frac{S_n h}{\sigma\sqrt{n}}} converges in distribution to {N(0,1)}, where {\sigma^2 = \mathop{\mathbb E}[h^2] = \int h^2\,d\mu}.

Now we want to apply this result to obtain information about {\varphi} itself, and in particular about {S_n\varphi = \sum_{j=1}^n \varphi\circ T^j}. We have {\varphi = h + g\circ T - g}, and so

\displaystyle  S_n\varphi = S_n h + \sum_{j=1}^n (g\circ T^{j+1} - g\circ T^j) = S_nh + g\circ T^{n+1} - g.

This yields

\displaystyle  \frac{S_n\varphi}{\sigma\sqrt n} = \frac{S_n h}{\sigma\sqrt n} + \frac{g\circ T^{n+1} - g}{\sigma\sqrt n},

and the last term goes to 0 in probability, which yields the central limit theorem for {S_n\varphi}.

Remark 1 There is a technical problem we have glossed over, which is that the sequence of {\sigma}-algebras {T^{-n}\mathop{\mathcal B}} is decreasing, not increasing as is required by the definition of a martingale. One solution to this is to pass to the natural extension {\hat T} and to consider the functions {h\circ \hat{T}^{-j}} and the {\sigma}-algebras {\hat{\mathop{\mathcal F}}_j = \hat{T}^j\mathop{\mathcal B}}. Another solution is to use reverse martingales, but we do not discuss this here.

Example 8 Let {X=[0,1]} and {T\colon X\rightarrow X} be an intermittent type (Manneville–Pomeau) map given by

\displaystyle  T(x) = \begin{cases} x(1 - (2x)^\alpha) & x\in [0,\frac 12), \\ 2x-1 & x\in [\frac 12,1], \end{cases}

where {0<\alpha<1} is a fixed parameter. It can be shown that {T} has a unique absolutely continuous invariant probability measure {\mu}, and that the transfer operator {\mathop{\mathcal P}} has the following contraction property: for every {\varphi\in L^2} with {\int \varphi\,d\mu=0}, there is {C\in{\mathbb R}} such that {\|\mathop{\mathcal P}^n\varphi\|_2 \leq Cn^{-\gamma}}, where {\gamma = \frac 12(\frac 1\alpha -1)}.

For small values of {\alpha}, this shows that {\|\mathop{\mathcal P}^n\varphi\|_2} is summable, and consequently {\mu} satisfies the CLT by the above discussion.

Posted in ergodic theory, statistical laws, theorems | Tagged , | 4 Comments

The Perron-Frobenius theorem and the Hilbert metric

In the last post, we introduced basic properties of convex cones and the Hilbert metric. In this post, we look at how these tools can be used to obtain an explicit estimate on the rate of convergence in the Perron–Frobenius theorem.

1. Perron–Frobenius theorem

We start by stating a version of the Perron–Frobenius theorem. Let {A} be a {d\times d} stochastic matrix, where here we use this to mean that the entries of {A} are non-negative, and every column sums to 1: {A_{ij}\in[0,1]} for all {i,j}, and {\sum_{i=1}^d A_{ij} = 1} for all {j}. Thus the columns of {A} are probability vectors.

Such a matrix {A} describes a weighted random walk on {d} sites: if the walker is presently at site {j}, then {A_{ij}} gives the probability that he will move to site {i} at the next step. Thus if we interpret a probability vector {v} as giving the probability of the walker being at site {j} with probability {v_j}, then {v\mapsto Av} gives the evolution of this probability under one step of the random walk.

Now one version of the Perron–Frobenius theorem is as follows: If {A} is a stochastic matrix with {A>0} (that is, {A_{ij}>0} for all {i,j}), then there is exactly one probability vector {\pi} that is an eigenvector for {A}. Moreover, the eigenvalue associated to this eigenvector is 1, the eigenvalue 1 is simple, and all other eigenvalues have modulus {<1}. In particular, given any {v\in[0,\infty)^2} we have {A^n v \rightarrow \pi} exponentially quickly.

The eigenvector {\pi} is the stationary distribution for the random walk (Markov chain) given by {A}, and the convergence result states that any initial distribution converges to the stationary distribution under iteration of the process.

The assumption that {A>0} is quite strong: for the random walk, this says that the walker can get from any site to any other site in a single step. A more general condition is that {A} is primitive: that is, there exists {N\in{\mathbb N}} such that {A^N>0}. This says that there is a time {N} such that by taking {N} steps, the walker can get from any site to any other site. The same result as above holds in this case too.

In fact, the result holds in the even more general case when {A} is irreducible: for every {i,j} there exists {N} such that {(A^N)_{ij}>0}. This says that the walker can get from every site to every other site, but removes the assumption that there is a single time {N} that works for all site. For example, consider a random walk on a chessboard, where the walker is allowed to move one square horizontally or vertically at each step. Then for a sufficiently large even value of {N}, the walker can get from any white square to any other white square, but to get to a black square requires an odd value of {N}.

2. Rate of decay using convex cones

As stated above, the Perron–Frobenius theorem does not give any result on the rate with which {A^n v} converges to {\pi}. One way to give an estimate on this rate is to use convex cones and the Hilbert metric, which were discussed in the last post.

2.1. A cone and a metric

Let {{\mathop{\mathcal C}}} be the convex cone {[0,\infty)^d\subset{\mathbb R}^d}. We want an estimate on the diameter of {A({\mathop{\mathcal C}})} in the Hilbert metric {d_{\mathop{\mathcal C}}}. Recall that this metric is given by {d_{\mathop{\mathcal C}}(v,w)=\log(\beta/\alpha)}, where

\displaystyle  \begin{aligned} \beta &= \inf \{\mu>0 \mid \mu v - w \in {\mathop{\mathcal C}}\},\\ \alpha &= \sup\{\lambda>0\mid w-\lambda v\in {\mathop{\mathcal C}}\}. \end{aligned}

Another way of interpreting the cone {{\mathop{\mathcal C}}} is in terms of the partial order it places on {V}, which is given by {v\preceq w \Leftrightarrow w-v\in {\mathop{\mathcal C}}\cup\{0\}}. We see that {\beta} and {\alpha} can be characterised as

\displaystyle  \alpha = \sup\{\lambda \mid \lambda w \preceq v\},\qquad \beta = \inf\{\mu \mid v\preceq \mu w\}.

In our present example, we see that the cone {{\mathop{\mathcal C}}=[0,\infty)^d} induces the partial order {v\preceq w \Leftrightarrow v_i\leq w_i\ \forall i}. Thus

\displaystyle  \alpha = \sup \{\lambda \mid \lambda w_i\leq v_i\ \forall i\} = \min_{1\leq i\leq d} \frac{v_i}{w_i}, \ \ \ \ \ (1)

and similarly {\beta = \max_{1\leq i\leq d} \frac{v_i}{w_i}}.

2.2. Diameter of {A({\mathop{\mathcal C}})}

Now we need to determine the diameter {\Delta} of {A({\mathop{\mathcal C}})} in the Hilbert metric {d_{\mathop{\mathcal C}}}. If {\Delta<\infty}, then the theorem of Birkhoff from the previous post will imply that {d_{\mathop{\mathcal C}}} contracts distances by a factor of {\tanh(\Delta/4)<1}.

Let {e_i} be the standard basis vectors in {{\mathbb R}^d}. Because {d_{\mathop{\mathcal C}}} is projective we can compute {\Delta} by considering {d_{\mathop{\mathcal C}}(Av,Aw)} where {\sum v_i = \sum w_j = 1}. Using the triangle inequality, we have

\displaystyle  \begin{aligned} d_{\mathop{\mathcal C}}(Av,Aw) &= d_{\mathop{\mathcal C}}\left(A\sum v_i e_i, A\sum w_j e_j\right) = d_{\mathop{\mathcal C}}\left(\sum v_i (Ae_i), \sum w_j (Ae_j) \right) \\ &\leq \sum_{i,j} v_i w_j d_{\mathop{\mathcal C}}(Ae_i,Ae_j) \leq \max_{i,j} d_{\mathop{\mathcal C}}(Ae_i,Ae_j), \end{aligned}

so it suffices to consider {d_{\mathop{\mathcal C}}(Ae_i,Ae_j)} for {1\leq i,j\leq d}. But {Ae_i} is just the {i}th column of the matrix {A}, so writing {A=[v^1 \cdots v^n]}, where {v^i} is the {i}th column vector, we see that

\displaystyle  \Delta \leq \max_{i,j} d_{\mathop{\mathcal C}}(v^i,v^j). \ \ \ \ \ (2)

2.3. Contraction under multiplication by {A}

Now we have a very concrete procedure for estimating the amount of contraction in the {d_{\mathop{\mathcal C}}} metric under multiplication by {A}:

  1. estimate {\Delta} using (2) and the expression for {d_{\mathop{\mathcal C}}} in (1) and the discussion preceding it;
  2. get a contraction rate of {\tanh(\Delta/4)<1}.

From (1) and the discussion preceding it, the distance {d_{\mathop{\mathcal C}}(v^i,v^j)} is given as

\displaystyle  d_{\mathop{\mathcal C}}(v^i,v^j) = \log\beta - \log\alpha = \log\left(\max_{1\leq k\leq d} \frac{v^i_k}{v^j_k}\cdot \max_{1\leq k\leq d} \frac{v^j_k}{v^i_k}\right). \ \ \ \ \ (3)

Let {\Lambda = \tanh(\Delta/4)}. To write an explicit estimate for {\Lambda}, we use

\displaystyle  \Lambda = \frac{e^{\Delta/4} - e^{-\Delta/4}}{e^{\Delta/4} + e^{-\Delta/4}} = \frac{1 - e^{-\Delta/2}}{1+e^{-\Delta/2}} \leq \frac{1-s}{1+s}, \ \ \ \ \ (4)

where {s<1} is any estimate we can obtain satisfying {e^{-\Delta/2}\geq s}. From (3) and (2), we have

\displaystyle  e^{-\Delta/2} \geq \max_{i,j}\sqrt{ \min_k\left(\frac {v_k^i}{v_k^j}\right) \min_k \left(\frac{v_k^j}{v_k^i}\right)} =:s. \ \ \ \ \ (5)

This allows us to obtain estimates on {d_{\mathop{\mathcal C}}(A^nv, A^nw)}. However, we want to estimate {d(A^nv,A^nw)} in a more familiar metric, such as one coming from a norm. We can relate the two by observing that if {v,w\in(0,1]^d}, then

\displaystyle  \begin{aligned} d_{\mathop{\mathcal C}}(v,w) &= \log \max_k \left(\frac {v_k}{w_k}\right) + \log \max_k \left(\frac{w_k}{v_k}\right) \\ &\geq \max_k|\log v_k - \log w_k| \geq \max_k |v_k-w_k| = \|v-w\|_{L^\infty}, \end{aligned}

where the last inequality uses the fact that {\log } has derivative {\geq 1} on {(0,1]}. Since {A} maps the unit simplex to itself (because {A} is stochastic), we see that

\displaystyle  \|A^nv-A^nw\|_{L^\infty} \leq d_{\mathop{\mathcal C}}(A^nv,A^nw) \leq C\Lambda^n, \ \ \ \ \ (6)

where {\Lambda} is given by (4) and (5), and where we can take either {C=d_{\mathop{\mathcal C}}(v,w)} or {C=\Delta/\Lambda} (since {d_{\mathop{\mathcal C}}(Av,Aw)\leq \Delta}), whichever gives the better bound. Since all norms on {{\mathbb R}^d} are equivalent, we have a similar bound in any norm.

3. Nonnegative matrices

The analysis in the previous section required {A} to be positive ({A_{ij}>0} for all {i,j}). A more general condition is that {A} is nonnegative and primitive: that is, {A_{ij}\geq 0} for all {i,j}, and moreover there exists {N} such that {A^N>0}.

If {A_{ij}=0} for some {i,j}, then it is easy to see from the calculations in the previous section that {A({\mathop{\mathcal C}})} has infinite diameter in the Hilbert metric, so the above arguments do not apply directly. However, they do apply to {A^N} when {A^N>0}, and so we fix {N} for which this is true, and we obtain {\Lambda<1} such that {d_{\mathop{\mathcal C}}(A^Nv, A^Nw) \leq \Lambda d_{\mathop{\mathcal C}}(v,w)} for all {v,w\in{\mathop{\mathcal C}}}.

Moreover, let {L\in{\mathbb R}} be such that {\|A^r\|\leq L} for all {0\leq r<N}. Then for any {n\in{\mathbb N}} we can write {A^n = A^{kN+r}} for some {0\leq r<N}, so that

\displaystyle  \|A^nv-A^nw\| = \|A^r(A^{kN}v - A^{kN}w)\| \leq LC \Lambda^k,

where {C} is as in (6). Thus we conclude that asymptotically, {A^nv} approaches the eigenvector with contraction rate {\Lambda^{1/N}}.

To see this in action, consider a Markov chain with transition matrix

\displaystyle  A=\begin{pmatrix} \frac 12 & 1 \\ \frac 12 & 0\end{pmatrix}.

That is, from the first state the walker transitions to either state with probability 1/2, while from the second state the walker always returns to the first state. Since the transition from the second state to itself is forbidden, {A({\mathop{\mathcal C}})} has infinite diameter. However, the two-step transition matrix is

\displaystyle  A^2 = \begin{pmatrix} \frac 34 & \frac 12 \\ \frac 14 & \frac 12 \end{pmatrix},

for which we can compute

\displaystyle  s = \sqrt{\frac{1/4}{1/2} \cdot \frac{1/2}{3/4}} = \frac 1{\sqrt{6}} \quad \Rightarrow\quad \Lambda \leq \frac{\sqrt{6}-1}{\sqrt{6}+1}.

Thus the estimate on {A^2} gives us a definite rate of contraction, which the estimate from {A} does not.

It can be useful to use the estimate on {A^N} even when {A>0}. For example, if we consider the Markov chain with transition matrix

\displaystyle  A = \begin{pmatrix} \frac 15 & \frac 9{10} \\ \frac 45 & \frac 1{10} \end{pmatrix},

then we have

\displaystyle  s = \sqrt{\frac{1/5}{9/10}\cdot \frac{1/10}{4/5}} = \sqrt{\frac 29 \cdot \frac 18} = \frac 16 \quad\Rightarrow\quad \Lambda\leq \frac 57 \approx .714

as the rate of contraction, while considering

\displaystyle  A^2 = \begin{pmatrix} \frac{19}{25} & \frac{27}{100} \\ \frac{6}{25} & \frac{73}{100} \end{pmatrix}

gives

\displaystyle  s = \sqrt{\frac{27/100}{19/25}\cdot\frac{6/25}{73/100}} \approx .3418 \quad\Rightarrow\quad \Lambda \leq \frac{.6582}{1.3418}\approx .4906 \approx (.7)^2,

a better estimate than we obtained from considering {A} itself.

Posted in theorems | Tagged , , , | Leave a comment