Lebesgue probability spaces, part I

In various areas of mathematics, classification theorems give a more or less complete understanding of what kinds of behaviour are possible. For example, in linear algebra we learn that up to isomorphism, {{\mathbb R}^n} is the only real vector space with dimension {n}, and every linear operator on a finite-dimensional vector space can be put into Jordan normal form via a change of coordinates; this means that many questions in linear algebra can be answered by understanding properties of Jordan normal form. A similar classification result is available in measure theory, but the preliminaries are a little more involved. In this and the next post I will describe the classification result for complete probability spaces, which gives conditions under which such a space is equivalent to the unit interval with Lebesgue measure.

The main original references for these results are a 1942 paper by Halmos and von Neumann [“Operator methods in classical mechanics. II”, Ann. of Math. (2), 43 (1942), 332–350, and a 1949 paper by Rokhlin [“On the fundamental ideas of measure theory”, Mat. Sbornik N.S. 25(67) (1949). 107–150, English translation in Amer. Math. Soc. Translation 1952 (1952). no. 71, 55 pp.]. I will refer to these as [HvN] and [Ro], respectively.

1. Equivalence of measure spaces

First we must establish exactly which class of measure spaces we work with, and under what conditions two measure spaces will be thought of as equivalent. Let {I} be the unit interval and {\mathop{\mathcal B},\mathop{\mathcal L}} the Borel and Lebesgue {\sigma}-algebras, respectively; let {m} be Lebesgue measure (on either of these). To avoid having to distinguish between {(I,\mathop{\mathcal B},m)} and {(I,\mathop{\mathcal L},m)}, let us agree to only work with complete measure spaces; this is no great loss, since given an arbitrary metric space {(X,\mathcal{A},\mu)} we can pass to its completion {(X,\overline{\mathcal{A}},\overline\mu)}.

The most obvious notion of isomorphism is that two complete measure spaces {(X,\mathcal{A},\mu)} and {(X',\mathcal{A}',\mu')} are isomorphic if there is a bijection {f\colon X\rightarrow X'} such that {f,f^{-1}} are measurable and {f_*\mu = \mu'}; that is, given {A\subset X} we have {A\in \mathcal{A}} if and only if {f(A) \in \mathcal{A}'}, and in this case {\mu(A) = \mu'(f(A))}.

In the end we want to loosen this definition a little bit. For example, consider the space {X = \{0,1\}^{\mathbb N}} of all infinite binary sequences, equipped with the Borel {\sigma}-algebra {\mathop{\mathcal B}} associated to the product topology (or if you prefer, the metric {d(x,y) = e^{-\min\{n \mid x_n \neq y_n\}}}). Let {\mu} be the {(\frac 12,\frac 12)}-Bernoulli measure on {(X,\mathop{\mathcal B})}; that is, for each {w\in \{0,1\}^n} the cylinder {[w] = \{x\in X \mid x_1 \cdots x_n = w\}} gets weight {\mu[w] = 2^{-n}}. Then there is a natural correspondence between the completion {(X,\overline{\mathop{\mathcal B}},\overline\mu)} and {(I,\mathop{\mathcal L},m)} given by

\displaystyle  \begin{aligned} f\colon X&\rightarrow I \\ x &\mapsto \sum_{n=1}^\infty x_n 2^{-n}. \end{aligned}

By looking at dyadic intervals {[\frac k{2^n}, \frac{k+1}{2^n}] \subset I} one can readily verify that {f_* \overline\mu = m}; however, {f} is not a bijection because for every {w\in \{0,1\}^n} we have {f(w01^\infty) = f(w10^\infty)}.

The points at which {f} is non-injective form a {\mu}-null set (since there are only countably many of them), so from the point of view of measure theory, it is natural to disregard them. This motivates the following definition.

Definition 1 Two measure spaces {(X,\mathcal{A},\mu)} and {(X',\mathcal{A}',\mu')} are isomorphic mod 0 if there are measurable sets {E \subset X} and {E'\subset X'} such that {\mu(X\setminus E) = \mu'(X'\setminus E') = 0}, together with a bijection {f\colon E\rightarrow E'} such that {f,f^{-1}} are measurable and {f_*(\mu|_E) = \mu'|_{E'}}.

From now on we will be interested in the question of classifying complete measure spaces up to isomorphism mod 0. The example above suggests that {(I,\mathop{\mathcal L},m)} is a reasonable candidate for a `canonical’ complete measure space that many others are equivalent to, and we will see that this is indeed the case.

Notice that the total measure {\mu(X)} is clearly an invariant of isomorphism mod 0, and hence we restrict our attention to probability spaces, for which {\mu(X)=1}.

2. Separability, etc.

Let {(X,\mathcal{A},\mu)} be a probability space. We describe several related conditions that all give senses in which {\mathcal{A}} can be understood via countable objects.

The {\sigma}-algebra {\mathcal{A}} carries a natural pseudo-metric given by {\rho(A,B) = \mu(A\Delta B)}. Write {A\sim B} if {\rho(A,B)=0}; this is an equivalence relation on {\mathcal{A}}, and we write {\hat{\mathcal{A}}} for the space of equivalence classes. The function {\rho} induces a metric {\hat\rho} on {\hat{\mathcal{A}}} in the natural way, and we say that {(X,\mathcal{A},\mu)} is separable if the metric space {(\hat{\mathcal{A}},\hat\rho)} is separable; that is, if it has a countable dense subset.

Another countability condition is this: call {\mathcal{A}} “countably generated” if there is a countable subset {\Gamma \subset \mathcal{A}} such that {\mathcal{A} = \sigma(\Gamma)} is the smallest {\sigma}-algebra containing {\Gamma}. We write (CG) for this property; for example, the Borel {\sigma}-algebra in {[0,1]} satisfies (CG) because we can take {\Gamma} to be the set of all intervals with rational endpoints. (In [HvN], such an {\mathcal{A}} is called “strictly separable”, but we avoid the word “separable” as we have already used it in connection with the metric space {(\hat{\mathcal{A}},\hat\rho)}.)

In and of itself, (CG) is not quite the right sort of property for our current discussion, because it does not hold when we pass to the completion; the Lebesgue {\sigma}-algebra {\mathop{\mathcal L}} is not countably generated (one can prove this using cardinality estimates). Let us say that {\mathcal{A}} satisfies property (CG0) (for “countably generated mod 0”) if there is a countably generated {\sigma}-algebra {\Sigma \subset \mathcal{A}} with the property that for every {E\in \mathcal{A}}, there is {F\in \Sigma} with {\mu(E\Delta F) = 0}. In other words, we have {\hat{\mathcal{A}} = \hat\Sigma}. Note that {\mathop{\mathcal L}} is countably generated mod 0 by taking {\Sigma = \mathop{\mathcal B}}. (In [HvN], such an {\mathcal{A}} is called “separable”; the same property is used in §2.1 of [Ro] with the label {(L')}, rendered in a font that I will not attempt to duplicate here.)

In fact, the approximation of {\mathop{\mathcal L}} by {\mathop{\mathcal B}} satisfies an extra condition. Let us write (CG0+) for the following condition on {\mathcal{A}}: there is a countably generated {\Sigma \subset \mathcal{A}} such that for every {E\in \mathcal{A}}, there is {F\in \Sigma} with {E\subset F} and {\mu(F\setminus E)=0}. This is satisfied for {\mathop{\mathcal L}} and {\mathop{\mathcal B}}. (In [HvN], such an {\mathcal{A}} is called “properly separable”; the same property is used in §2.1 of [Ro] with the label {(L)}.)

The four properties introduced above are related as follows.

\displaystyle  \textbf{(CG)} \Rightarrow \textbf{(CG0+)} \Rightarrow \textbf{(CG0)} \Leftrightarrow \text{separable}

The first two implications are immediate, and their converses fail in general:

  • The Lebesgue {\sigma}-algebra {\mathop{\mathcal L}} satisfies (CG0+) but not (CG).
  • Let {\mathcal{A} = \{ A \subset [0,1] \mid A\in \mathop{\mathcal L}, \mu(A)=0 \text{ or } 1\}}. Then {\mathcal{A}} satisfies (CG0) but not (CG0+).

Now we prove that (CG0) and separability are equivalent. First note that if {\Gamma \subset \mathcal{A}} is a countable subset, then the algebra {\mathop{\mathcal F}} generated by {\Gamma} is also countable; in particular, {(X,\mathcal{A},\mu)} is separable if and only if there is a countable algebra {\mathop{\mathcal F}\subset \mathcal{A}} that is dense with respect to {\rho}, and similarly in the definition of (CG0) the generating set can be taken to be an algebra. To show equivalence of (CG0) and separability it suffices to show that given an algebra {\mathop{\mathcal F} \subset \mathcal{A}} and {E\in \mathcal{A}}, we have

\displaystyle  (E\in \overline{\mathop{\mathcal F}}) \Leftrightarrow (\text{there is } A\in \sigma(\mathop{\mathcal F}) \text{ with } \rho(E,A) = \mu(E\Delta A) = 0). \ \ \ \ \ (1)

First we prove {(\Leftarrow)} by proving that {\overline{\mathop{\mathcal F}}} is a {\sigma}-algebra, and hence contains {\sigma(\mathop{\mathcal F})}; this will show that (CG0) implies separability.

  • Closure under {{}^c}: if {E\in \overline{\mathop{\mathcal F}}} then there are {A_n\in \mathop{\mathcal F}} such that {\rho(A_n,E) \rightarrow 0}. Since {\rho(A_n^c, E^c) = \rho(A_n, E)} and {A_n^c\in \mathop{\mathcal F}} (since it is an algebra), this gives {E^c\in \overline{\mathop{\mathcal F}}}.
  • Closure under {\cup}: given {E_1,E_2,\dots \in \overline{\mathop{\mathcal F}}}, let {E = \bigcup_n E_n}. To show that {E \in \overline{\mathop{\mathcal F}}}, note that given any {\epsilon>0}, there are {A_n\in \mathop{\mathcal F}} such that {\rho(A_n,E_n) < \epsilon 2^{-n}}. Let {F_N = \bigcup_{n=1}^N E_n} and {B_N = \bigcup_{n=1}^N A_n}; note that

    \displaystyle  \rho(F_N,B_N) \leq \sum_{n=1}^N \rho(E_n,A_n) < \epsilon.

    Moreover by continuity from below we have {\lim \mu(F_N) = \mu(E)}, so {\lim \rho(E,F_N)=0}, and thus for sufficiently large {N} we have {\rho(E,B_N) < \rho(E,F_N) + \rho(F_N,B_N) < 2\epsilon}. This holds for all {\epsilon>0}, so {E\in \overline{\mathop{\mathcal F}}}.

Now we prove {(\Rightarrow)}, thus proving that {\overline{\mathop{\mathcal F}}} is “large enough” that separability implies (CG0). Given any {E\in \overline{\mathop{\mathcal F}}}, there are {A_n\in \mathop{\mathcal F}} such that {\rho(A_n,E)\leq 2^{-n}}. Let { A = \bigcap_{N\in {\mathbb N}} \bigcup_{n\geq N} A_n \in \sigma(\mathop{\mathcal F}). } We get

\displaystyle  \begin{aligned} \mu(A \cap E^c) &= \mu\big(\bigcap_N \bigcup_{n\geq N} (A_n \cap E^c)\big) = \lim_{N\rightarrow\infty} \mu\big( \bigcup_{n\geq N} (A_n \cap E^c) \big) \\ &\leq \lim_{N\rightarrow\infty} \sum_{n\geq N} \mu(A_n \cap E^c) \leq \lim_{N\rightarrow\infty} 2^{1-N} = 0, \end{aligned}

and similarly, {A^c \cap E = \bigcup_N \bigcap_{n\geq N} (A_n^c \cap E)}, which gives

\displaystyle  \mu(A^c \cap E) \leq \sum_N \mu\big( \bigcap_{n\geq N} A_n^c \cap E\big) \leq \sum_N \limsup_{n\rightarrow\infty} \mu(A_n^c \cap E) = 0.

Then {\rho(A,E) = 0}, which completes the proof of {(\Rightarrow)}.

The first half of the argument above (the {\Leftarrow} direction) appears in this MathOverflow answer to a question discussing the relationship between different notions of separability, which ultimately inspired this post. That answer (by Joel David Hamkins) also suggests one further notion of “countably generated”, distinct from all of the above; say that {(X,\mathcal{A},\mu)} satisfies (CCG) (for “completion of countably generated”) if there is a countably generated {\sigma}-algebra {\Sigma \subset \mathcal{A}} such that {\mathcal{A} \subset \overline{\Sigma}}, where {\overline{\Sigma}} is the completion of {\Sigma} with respect to the measure {\mu}. One quickly sees that

\displaystyle  \textbf{(CG)} \Rightarrow \textbf{(CCG)} \Rightarrow \textbf{(CG0)}.

Both reverse implications fail; the Lebesgue {\sigma}-algebra satisfies (CCG) but not {\textbf{(CG)}}, and an example satisfying separability (and hence (CG0)) but not (CCG) was given in that same MathOverflow answer (the example involves ordinal numbers and something called the “club filter”, which I will not go into here).

3. Abstract {\sigma}-algebras

It is worth looking at some of the previous arguments through a different lens, that will also appear next time when we discuss the classification problem.

Recall the space of equivalence classes {\hat{\mathcal{A}}} from earlier, where {A\sim B} means that {\mu(A\Delta B) = 0}. Although elements of {\hat{\mathcal{A}}} are not subsets of {X}, we can still speak of the “union” of two such elements by choosing representatives from the respective equivalence classes; that is, given {\hat A, \hat B\in \hat{\mathcal{A}}}, we choose representatives {A\in \hat A} and {B\in \hat B} (so {A,B\in \mathcal{A}}), and consider the “union” of {\hat A} and {\hat B } to be the equivalence class of {A\cup B}; write this as {\hat A\vee \hat B}. One can easily check that this is well-defined; if {A_1\sim A_2} and {B_1\sim B_2}, then {(A_1\cup B_1) \sim (A_2 \cup B_2)}.

This shows that {\cup} induces a binary operation {\vee} on the space {\hat{\mathcal{A}}}; similarly, {\cap} induces a binary operation {\wedge}, complementation {A\mapsto A^c} induces an operation {\hat A \mapsto \hat A'}, and set inclusion {A\subset B} induces a partial order {\hat A \leq \hat B}. These give {\hat{\mathcal{A}}} the structure of a Boolean algebra; say that {\Sigma} is an abstract Boolean algebra if it has a partial order {\leq}, binary operations {\vee}, {\wedge}, and a unary operation {'}, satisfying the same rules as inclusion, union, intersection, and complementation:

  1. {A \vee B} is the join of {A} and {B} (the minimal element such that {A,B \leq A\vee B}), and {A\wedge B} is the meet of {A} and {B} (the maximal element such that {A,B \geq A\wedge B});
  2. the distributive laws {A\vee (B\wedge C) = (A\vee B) \wedge (A\vee C)} and {A\wedge (B\vee C) = (A\wedge B) \vee (A\wedge C)} hold;
  3. there is a maximal element {X} whose complement {X'} is the minimal element;
  4. {A\wedge A' = X'} and {A\vee A' = X}.

For the form of this list I have followed this blog post by Terry Tao, which gives a good in-depth discussion of some other issues relating to concrete and abstract Boolean algebras and {\sigma}-algebras.

Exercise 1 Using the four axioms above, prove the following properties:

  • {A'} is the unique element satisfying (4) — that is, if {A\vee B =X} and {A\wedge B = X'}, then {B=A'};
  • {(A')' = A};
  • de Morgan’s laws: {(A\vee B)' = A' \wedge B'} and {(A\wedge B)' = A' \vee B'}.

If you get stuck, see Chapter IV, Lemma 1.2 in A Course in Universal Algebra by Burris and Sankappanavar.

In fact {\hat{\mathcal{A}}} inherits just a little bit more, since {\vee} (and hence {\wedge}) can be iterated countably many times. We add this as a fifth axiom, and say that an abstract Boolean algebra {\Sigma} is an abstract {\sigma}-algebra if in addition to (1)–(4) it satisfies

  1. any countable family {A_1,A_2,\dots, \in \Sigma} has a least upper bound {\bigvee_n A_n} and a greatest lower bound {\bigwedge_n A_n}.

A measured abstract {\sigma}-algebra is a pair {(\Sigma,\mu)}, where {\Sigma} is an abstract {\sigma}-algebra and {\mu\colon \Sigma\rightarrow [0,\infty]} is a function satisfying the usual properties: {\mu(X')=0} and {\mu(\bigvee_n A_n) = \sum_n \mu(A_n)} whenever {A_i \wedge A_j = X'} for all {i\neq j}. (Note that {X'} is playing the role of {\emptyset}, but we avoid the latter notation to remind ourselves that elements of {\Sigma} do not need to be represented as subsets of some ambient space.)

The operations {\vee,\wedge,'} induce a binary operator {\Delta} on {\Sigma} by

\displaystyle  A \Delta B = (A\wedge B') \vee (B \wedge A'),

which is the abstract analogue of set difference, and so a measured abstract {\sigma}-algebra carries a pseudo-metric {\rho} defined by

\displaystyle  \rho(A,B) = \mu(A \Delta B).

If {(\Sigma,\mu)} has the property that {\mu(A)>0} for all {A\neq X'}, then this becomes a genuine metric.

In particular, if {(X,\mathcal{A},\mu)} is a measure space and {\Sigma = \hat{\mathcal{A}}} is the space of equivalence classes modulo {\sim} (equivalence mod 0), then {\mu} induces a function {\hat{\mathcal{A}} \rightarrow [0,\infty]}, which we continue to denote by {\mu}, such that {(\hat{\mathcal{A}},\mu)} is a measured abstract {\sigma}-algebra; this has the property that {\mu(A)>0} for all non-trivial {A\in \hat{\mathcal{A}}}, and so it defines a metric {\rho} as above.

Given an abstract {\sigma}-algebra {\Sigma} and a subset {\Gamma \subset \Sigma}, the algebra ({\sigma}-algebra) generated by {\Gamma\subset \Sigma} is the smallest algebra ({\sigma}-algebra) in {\Sigma} that contains {\Gamma}. Now we can interpret the equivalence (1) from the previous section (which drove the correspondence between (CG0) and separability) in terms of the measured abstract {\sigma}-algebra {\hat{\mathcal{A}}}.

Proposition 2 Let {(\Sigma,\mu)} be a measured abstract {\sigma}-algebra with no non-trivial null sets. Then for any algebra {\mathop{\mathcal F} \subset \Sigma}, we have {\overline{\mathop{\mathcal F}} = \sigma(\mathop{\mathcal F})}; that is, the {\rho}-closure of {\mathop{\mathcal F}} is equal to the {\sigma}-algebra generated by {\mathop{\mathcal F}}.

Next time we will see how separability (or equivalently, (CG0)) can be used to give a classification result for abstract measured {\sigma}-algebras, which at first requires us to take the abstract point of view introduced in this section. Finally, we will see what is needed to go from there to a similar result for probability spaces.

Posted in Uncategorized | 1 Comment

Unique MMEs with specification – an alternate proof

The variational principle for topological entropy says that if {X} is a compact metric space and {f\colon X\rightarrow X} is a continuous map, then {h_{\mathrm{top}}(f) = \sup_\mu h_\mu(f)}, where {h_{\mathrm{top}}} is the topological entropy, and the supremum is taken over all {f}-invariant Borel probability measures. A measure achieving this supremum is called a measure of maximal entropy (MME for short), and it is interesting to understand when a system has a unique MME.

Let’s look at this question in the setting of symbolic dynamics. Let {A} be a finite set, which we call an alphabet, and let {A^{\mathbb N}} be the set of all infinite sequences of symbols in {A}. This is a compact metric space in a standard way, and the shift map {\sigma} defined by {(\sigma(x))_n = x_{n+1}} is continuous. We consider a compact {\sigma}-invariant set {X\subset A^{\mathbb N}} and ask whether or not {(X,\sigma)} has a unique MME.

When {X} is a mixing subshift of finite type (SFT), this was answered in the 1960’s by Parry; there is a unique MME, and it can be obtained by considering the transition matrix for {X} and using some tools from linear algebra. A different proof was given by Bowen in 1974 using the specification property; this holds for all mixing SFTs, but also for a more general class of shift spaces.

The purpose of this note is to describe a variant of Bowen’s proof; roughly speaking, we follow Bowen’s proof for the first half of the argument, then give a shorter version of the second half of the argument, which follows comments made in conversation with Dan Thompson and Mike Hochman.

1. The strategy

The language of the shift space {X} is the set of all finite words that appear in some element of {X}. That is, given {n\in {\mathbb N}} we write {\mathcal{L}_n = \{w\in A^n \mid [w]\neq\emptyset\}}, where {[w] = \{x\in X \mid x_1\cdots x_n = w\}}. Then we write {\mathcal{L} = \bigcup_n \mathcal{L}_n}. Write {|w|} for the length of {w}.

In the setting of symbolic dynamics, the topological transitivity property takes the following form: {X} is transitive if and only if for every {u,v\in \mathcal{L}} there is {w\in \mathcal{L}} such that {uwv\in \mathcal{L}}. Transitivity by itself gives no control over the length of {w}. We say that shift space {X} has specification if there is {\tau\in {\mathbb N}} such that for every {u,v\in \mathcal{L}} there is {w\in \mathcal{L}} with {|w|=\tau} such that {uwv\in \mathcal{L}}; thus specification can be thought of as a uniform transitivity property.

Let {h = h_{\mathrm{top}}(f) = \lim_{n\rightarrow\infty} \frac 1n \log \#\mathcal{L}_n} be the topological entropy of {(X,\sigma)}. Bowen’s proof of uniqueness has the following structure:

  1. Show that there is {C>0} such that {e^{nh} \leq \#\mathcal{L}_n \leq C e^{nh}} for every {n}.
  2. Prove that for every {\alpha>0} there is {\beta>0} such that if {\mu} is an MME and {\mathcal{D}_n \subset \mathcal{L}_n} is a collection of words with {\mu(\mathcal{D}_n) \geq \alpha}, then we have {\#\mathcal{D}_n \geq \beta e^{nh}}. This can be thought of as a uniform version of the Katok entropy formula.
  3. Follow the proof of the variational principle to explicitly construct an MME {\mu}; then use the specification property and the counting estimates on {\#\mathcal{L}_n} to show that {\mu} has the Gibbs property and is ergodic.
  4. Show that if {\nu} is another ergodic MME, then the uniform Katok entropy formula for {\nu} and the Gibbs property for {\mu} cannot hold simultaneously; this contradiction proves uniqueness.

The proof we give here follows the above structure exactly for steps 1 and 2. Instead of steps 3 and 4, though, we give the following argument; if {\nu\neq \mu} are two distinct ergodic MMEs, then we can use the uniform Katok entropy formula for {\nu,\mu} together with the specification property to create more entropy in the system, a contradiction. In the next section we make all of this precise.

The advantage of this proof is that it allows us to replace step 3 of Bowen’s proof with a different argument that may be easier and less technical (depending on one’s taste). The disadvantage is that it does not include a proof of the Gibbs property for {\mu}, which is useful to know about in various settings.

2. The proof

2.1. Counting estimates

Write {\mathcal{L}_{m} \mathcal{L}_n} for the set of all words of the form {vw}, where {v\in \mathcal{L}_m} and {w\in \mathcal{L}_n}. Then it is clear that {\mathcal{L}_{m+n} \subset \mathcal{L}_m \mathcal{L}_n}, so {\#\mathcal{L}_{m+n} \leq (\#\mathcal{L}_m)(\#\mathcal{L}_n)}. In particular, {\log \#\mathcal{L}_n} is subadditive, so by Fekete’s lemma {h = \lim \frac 1n \log \#\mathcal{L}_n = \inf_n \frac 1n \log \#\mathcal{L}_n}, and we get {\#\mathcal{L}_n \geq e^{nh}} for every {n}.

The upper bound requires the specification property. Define a map {(\mathcal{L}_n)^k \rightarrow \mathcal{L}_{k(n+\tau)}} by sending {w_1,\dots,w_k} to {w_1 u_1 w_2 u_2 \cdots w_k u_k}, where {u_i \in \mathcal{L}_\tau} are provided by specification. This map is 1-1 so {\#\mathcal{L}_{k(n+\tau)} \geq (\#\mathcal{L}_n)^k}. Taking logs gives

\displaystyle  \frac 1{k(n+\tau)} \log\#\mathcal{L}_{k(n+\tau)} \geq \frac 1{n+\tau} \log \#\mathcal{L}_n,

and sending {k\rightarrow\infty} takes the left-hand side to {h}, so {\#\mathcal{L}_n \leq e^{(n+\tau)h}}; this gives the counting bounds we claimed earlier, with {C= e^{\tau h}}.

The gluing construction in the previous paragraph will be used later on when we need to derive a contradiction by producing extra entropy.

2.2. Uniform Katok formula

Now suppose {\mu} is any MME, and given {w\in \mathcal{L}} write {\mu(w) = \mu([w])}. Similarly write {\mu(\mathcal{D}_n) = \mu(\bigcup_{w\in \mathcal{D}_n} [w])}. Applying Fekete’s lemma to the subadditive sequence {H_n(\mu) = \sum_{w\in \mathcal{L}_n} -\mu(w) \log \mu(w)}, we get {nh \leq H_n(\mu)}.

Given {\mathcal{D}_n} with {\mu(\mathcal{D}_n)\geq \alpha}, we have

\displaystyle  \begin{aligned} nh &\leq \sum_{w\in \mathcal{D}_n} - \mu(w)\log \mu(w) + \sum_{w\in \mathcal{D}_n^c} -\mu(w)\log\mu(w) \\ &= \mu(\mathcal{D}_n)\sum_{w\in \mathcal{D}_n} - \frac{\mu(w)}{\mu(\mathcal{D}_n)} \log \frac{\mu(w)}{\mu(\mathcal{D}_n)} + \mu(\mathcal{D}_n^c)\sum_{w\in \mathcal{D}_n^c} - \frac{\mu(w)}{\mu(\mathcal{D}_n^c)} \log \frac{\mu(w)}{\mu(\mathcal{D}_n^c)} \\ &\qquad - \mu(\mathcal{D}_n)\log\mu(\mathcal{D}_n) - \mu(\mathcal{D}_n^c)\log\mu(\mathcal{D}_n^c) \\ &\leq \mu(\mathcal{D}_n) \log\#\mathcal{D}_n + \mu(\mathcal{D}_n^c) \log \#\mathcal{D}_n^c + \log 2 \\ &\leq \mu(\mathcal{D}_n) \log\#\mathcal{D}_n + (1-\mu(\mathcal{D}_n)) (nh + \log C) + \log 2 \\ &= \mu(\mathcal{D}_n) (\log \#\mathcal{D}_n - nh - \log C) + nh + \log (2C). \end{aligned}

Solving for {\#\mathcal{D}_n} gives

\displaystyle  \log\#\mathcal{D}_n \geq nh + \log C - \frac{\log(2C)}{\mu(\mathcal{D}_n)} \geq nh + \log C - \frac{\log(2C)}{\alpha},

so taking {\beta} with {\log\beta = \log C - \log(2C)/\alpha} gives {\log \#\mathcal{D}_n \geq \beta e^{nh}}, verifying the uniform Katok formula.

Note that the argument here does not rely directly on the specification property; it only uses the fact that {\mu} is an MME and that we have the uniform upper bound {\#\mathcal{L}_n \leq Ce^{nh}}. In fact, if one is a little more careful with the computations it is possible to show that we can take {\beta = \beta(\alpha) = C^{1-\frac 1\alpha} e^{-\frac{h(\alpha)}\alpha}}, where {h(\alpha) = -\alpha\log\alpha - (1-\alpha)\log(1-\alpha)}. This has the pleasing property that {\beta\rightarrow 1} as {\alpha\rightarrow 1}, so the lower bound for {\#\mathcal{D}_n} converges to the lower bound for {\#\mathcal{L}_n}.

2.3. Producing more entropy

Now suppose {\mu,\nu} are two distinct ergodic MMEs. Then {\mu\perp \nu} so there are disjoint sets {Y,Z\subset X} with {\mu(Y) = 1} and {\nu(Z)=1}. Fix {\delta>0} and let {Y'\subset Y} and {Z'\subset Z} be compact sets such that {\mu(Y')> 1-\delta} and {\nu(Z')>1-\delta}. Then the distance between {Y',Z'} is positive, so there is {N\in {\mathbb N}} such that for every {n\geq N-\tau}, no {n}-cylinder intersects both {Y',Z'}. In particular, putting {\mathcal{Y}_n = \{w\in \mathcal{L}_n \mid [w] \cap Y' \neq\emptyset\}}, and similarly for {\mathcal{Z}_n} with {Z'}, we have

  • {\mathcal{Y}_n \cap \mathcal{Z}_n = \emptyset} for every {n\geq N-\tau}, and
  • {\mu(\mathcal{Y}_n) \geq 1-\delta} and {\nu(\mathcal{Z}_n) \geq 1-\delta}, so by the uniform Katok formula we have {\#\mathcal{Y}_n,\#\mathcal{Z}_n \geq \beta e^{nh}}, where {\beta = \beta(1-\delta) > 0}.

Up to now all of the arguments that we have made appear in Bowen’s proof; the approximation argument just given appears in step 4 of his proof, where one uses the Gibbs property of the constructed MME to derive a contradiction with the uniform Katok formula. It is at this point that our argument diverges from Bowen’s, since we have not proved a Gibbs property. Instead, we apply the specification property to the collections of words {\mathcal{Y},\mathcal{Z} \subset \mathcal{L}} to get a lower bound on {\#\mathcal{L}_n} that grows more quickly than {e^{nh}}, a contradiction.

Before giving the correct argument, let me describe three incorrect arguments that illustrate the main ideas and also show why certain technical points arise in the final argument; in particular this will describe the process of arriving at the proof, which I think is worthwhile.

First wrong argument

Here is a natural way to proceed. With {N,\mathcal{Y},\mathcal{Z}} as above, consider for each {k\in {\mathbb N}} and each {\omega\in \{0,1\}^k} the set of all words

\displaystyle  w^1 u^1 w^2 u^2 \cdots w^k,

where {w^i \in \mathcal{Y}_{N-\tau}} if {\omega_i = 0} and {\mathcal{Z}_{N-\tau}} if {\omega_i=1}, and {u^i \in \mathcal{L}_\tau} are the gluing words provided by specification. Note that each choice of {\omega} produces at least {(\beta e^{(N-\tau)h})^k} words in {\mathcal{L}_{kN}}. Moreover, the collections of words produced by different choices of {\omega} are disjoint, because {\mathcal{Y}_{N-\tau}} and {\mathcal{Z}_{N-\tau}} are disjoint. We conclude that

\displaystyle  \#\mathcal{L}_{kN} \geq 2^k \beta^k e^{k(N-\tau)h},


\displaystyle  \tfrac 1{kN}\log \#\mathcal{L}_{kN} \geq h + \tfrac 1N(\log(2\beta) - \tau h).

If {\log(2\beta) > \tau h} then this would be enough to show that {h_\mathrm{top} > h}, a contradiction. Unfortunately since {\beta< 1} this argument does not work if {\tau h \geq 2}, so we must try again…

Second wrong argument

Looking at the above attempted proof, one may describe the problem as follows: each of the words {u^i} makes us lose a factor of {e^{-\tau h}} from our estimate on {\#\mathcal{L}_{kN}}. If {\omega_i = \omega_{i+1}}, then instead of letting {w^i} and {w^{i+1}} range over {\mathcal{Y}_{N-\tau}} and then gluing them, we could replace the words {w^i u^i w^{i+1}} with the words {v\in \mathcal{Y}_{2N-\tau}}. In particular, this would replace the estimate {\beta^2 e^{2(N-\tau)}} with the estimate {\beta e^{2N - \tau}}.

This suggests that given {\omega\in \{0,1\}^k}, we should only keep track of the set {J \subset \{1,\dots, k\}} for which {\omega_{j+1} \neq \omega_j}, since if {\omega_{i+1} = \omega_i} we can avoid losing the factor of {e^{-\tau h}} by avoiding the gluing process.

So, let’s try it. Given {J = \{j_1 < \cdots < j_\ell \} \subset \{1, \dots, k\}}, let {m_i = j_{i+1} - j_i} (with {m_0=j_1} and {m_\ell = k - j_\ell}), and consider the map

\displaystyle  \pi_J \colon \mathcal{Y}_{m_0 N - \tau} \times \mathcal{Z}_{m_1 N - \tau} \times \mathcal{Y}_{m_2 N - \tau} \times \cdots \times \mathcal{Y}_{m_{\ell-1}N - \tau} \times \mathcal{Z}_{m_\ell N} \rightarrow \mathcal{L}_{kN}

given by specification (whether the product ends with {\mathcal{Y}} or {\mathcal{Z}} depends on the parity of {\ell}).

Let {\mathcal{D}^J_{kN}} be the image of {\pi_J}, and note that

\displaystyle  \#\mathcal{D}_{kN}^J \geq \prod_{i=0}^\ell \beta e^{(m_iN - \tau) h} \geq \beta^{\ell + 1} e^{(kN - \ell\tau)h}. \ \ \ \ \ (1)

If the collections {\mathcal{D}_{kN}^J}, {\mathcal{D}_{kN}^{J'}} were disjoint for different choices of {J,J'}, then we could fix {\gamma>0} and sum (1) over all {J} with {\#J \leq \gamma k} to get

\displaystyle  \#\mathcal{L}_{kN} \geq \left(\sum_{\ell = 0}^{\gamma k} {k \choose \ell} \right) \beta^{\gamma k} e^{(kN - \gamma k \tau)h} \geq e^{(-\gamma \log \gamma)k} e^{k(\gamma\log\beta + Nh - \gamma\tau h)},

where we use Sitrling’s approximation for the last inequality. Taking logs and dividing by {kN} gives

\displaystyle  \tfrac 1{kN} \log \#\mathcal{L}_{kN} \geq h + \tfrac \gamma N ( -\log \gamma + \log\beta - \tau h).

For {\gamma>0} sufficiently small this is {> h}, which gives {h_\mathrm{top} > h}, a contradiction.

The problem with this argument is that the collections {\mathcal{D}_{kN}^J} need not be disjoint for different choices of {J}; this is because {w\in \mathcal{Y}_{mN}} may have {w_{[jN, (j+1)N)} \in \mathcal{Z}} for some value of {j}, so that we cannot necessarily recover {J} uniquely from knowing {w\in \mathcal{D}_{kN}^J}.

Third wrong argument

Let’s try to address the issue just raised, that we cannot recover {J} uniquely from {w\in \mathcal{D}_{kN}^J} because {w\in \mathcal{Y}_{mN}} may have subwords in {\mathcal{Z}}. We address this by only using words {w} where there are `not many’ such subwords. More precisely, given {w\in \mathcal{Y}_{n}} for {n\geq N-\tau}, let

\displaystyle  B(w) = \{ 0 \leq j < \tfrac{n+\tau}N \mid w_{[jN, (j+1)N-\tau)}\in \mathcal{Z} \}

be the set of `bad’ times, and similarly with the roles of {\mathcal{Y}} and {\mathcal{Z}} reversed. Let {b(w) = \#B(w)}, and observe that since {\mu(Z') < \delta}, invariance of {\mu} gives

\displaystyle  \mu\{w\in \mathcal{Y}_{n} \mid j\in B(w)\} \leq \delta

for every {0\leq j < \frac{n+\tau}N}. We conclude that

\displaystyle  \sum_{w\in \mathcal{Y}_{n}} b(w)\mu(w) \leq \sum_{j=0}^{\lfloor\frac{n+\tau}N\rfloor-1} \mu\{w\in \mathcal{Y}_{n} \mid j\in B(w)\} \leq \delta \tfrac{n+\tau}N.

Let {\mathcal{Y}'_{n} = \{w\in \mathcal{Y}_{n} \mid b(w) \leq 2\delta \frac{n+\tau}N\}}, and note that

\displaystyle  \delta \tfrac{n+\tau}N \geq \sum_{w\in \mathcal{Y}_{n} \setminus \mathcal{Y}_{n}'} \mu(w) 2\delta \tfrac{n+\tau}N,


\displaystyle  \tfrac 12 \geq \mu(\mathcal{Y}_n) - \mu(\mathcal{Y}_n') \geq 1-\delta - \mu(\mathcal{Y}_n').

We conclude that

\displaystyle  \mu(\mathcal{Y}_{n}') \geq \tfrac 12 - \delta,

so taking {\beta = \beta(\frac 12 - \delta)}, the uniform Katok estimate gives {\#\mathcal{Y}_{n}' \geq \beta e^{nh}}. A similar definition and argument gives {\mathcal{Z}_{n}'}.

Now we repeat the argument from the previous section (the second wrong argument) using {\mathcal{Y}',\mathcal{Z}'} in place of {\mathcal{Y},\mathcal{Z}}. Given {J\subset \{1,\dots, k\}}, let {\mathcal{D}_{kN}^J} be the image of the map {\pi_J} with the corresponding restricted domain, and note that the estimate (1) still holds (with the new value of {\beta}).

The final piece of the puzzle is to take {v \in \mathcal{D}_{kN}^J} and estimate how many other collections {\mathcal{D}_{kN}^{J'}} can contain it; that is, how many possibilities there are for {J} once we know {v}. We would like to do the following: write {v = w^1 u^1 w^2 u^2 \cdots w^\ell} and then argue that {J'} must be contained in the union of the sets of times corresponding to the {B(w^i)}. The problem is that this only works when there is a disagreement between which of the sets {\mathcal{Y}',\mathcal{Z}'} the maps {\pi_J,\pi_{J'}} are trying to use, and so I cannot quite make this give us the bounds we want.

The correct argument

To fix this final issue we change the construction slightly; instead of letting {J} mark the times where we transition between {\mathcal{Y}',\mathcal{Z}'}, we do a construction where at each {j\in J} we transition from {\mathcal{Y}'} to {\mathcal{Z}} and then immediately back to {\mathcal{Y}'}. Then {v = \mathcal{D}_{kN}^J \cap \mathcal{D}_{kN}^{J'}} will impose strong conditions on the set {J'}.

Let’s make everything precise and give the complete argument. As in the very beginning of this section we fix {\delta>0} and let {Y',Z'} be disjoint compact sets with {\mu(Y'),\nu(Z')>1-\delta}, where {\mu,\nu} are distinct ergodic MMEs. Let {N} be such that for every {n\geq N-\tau}, no {n}-cylinder intersects both {Y',Z'}.

Let {\mathcal{Y}_n = \{w\in \mathcal{L}_n \mid [w] \cap Y' \neq \emptyset\}}, and similarly for {\mathcal{Z}_n}. We repeat verbatim part of the argument from the last section; given {w\in \mathcal{Y}_{n}} for {n\geq N-\tau}, let

\displaystyle  B(w) = \{ 0 \leq j < \tfrac{n+\tau}N \mid w_{[jN, (j+1)N-\tau)}\in \mathcal{Z} \}

be the set of `bad’ times. Let {b(w) = \#B(w)}, and observe that since {\mu(Z') < \delta}, invariance of {\mu} gives

\displaystyle  \mu\{w\in \mathcal{Y}_{n} \mid j\in B(w)\} \leq \delta

for every {0\leq j < \frac{n+\tau}N}. We conclude that

\displaystyle  \sum_{w\in \mathcal{Y}_{n}} b(w)\mu(w) \leq \sum_{j=0}^{\lfloor\frac{n+\tau}N\rfloor-1} \mu\{w\in \mathcal{Y}_{n} \mid j\in B(w)\} \leq \delta \tfrac{n+\tau}N.

Let {\mathcal{Y}'_{n} = \{w\in \mathcal{Y}_{n} \mid b(w) \leq 2\delta \frac{n+\tau}N\}}, and note that

\displaystyle  \delta \tfrac{n+\tau}N \geq \sum_{w\in \mathcal{Y}_{n} \setminus \mathcal{Y}_{n}'} \mu(w) 2\delta \tfrac{n+\tau}N,


\displaystyle  \tfrac 12 \geq \mu(\mathcal{Y}_n) - \mu(\mathcal{Y}_n') \geq 1-\delta - \mu(\mathcal{Y}_n').

We conclude that

\displaystyle  \mu(\mathcal{Y}_{n}') \geq \tfrac 12 - \delta,

so taking {\beta = \beta(\frac 12 - \delta)}, the uniform Katok estimate gives {\#\mathcal{Y}_{n}' \geq \beta e^{nh}}.

Now given {k\in {\mathbb N}} and {J = \{j_1 < j_2< \cdots < j_\ell\} \subset \{1,\dots,k\}}, let {m_i = j_{i+1} - j_i} for {0\leq i\leq \ell} (putting {j_0=0} and {j_{\ell+1} = k}) and define a map

\displaystyle  \pi_J \colon \prod_{i=0}^{\ell+1} (\mathcal{Y}_{(m_i-1)N - \tau}' \times \mathcal{Z}_{N-\tau}) \rightarrow \mathcal{L}_{kN}

by the specification property, so that for {v^i \in \mathcal{Y}'_{(m_i-1)N-\tau}} and {w^i\in \mathcal{Z}_{N-\tau}} we have

\displaystyle  \pi_J(v^0,w^0,\dots,v^\ell,w^\ell) = v^0 u^0 w^0 \hat u^0 v^1 \cdots v^\ell u^\ell w^\ell,

where {u^i,\hat u^i} have length {\tau} and are provided by specification. Let {\mathcal{D}_{kN}^J} be the image of {\pi_J} and note that

\displaystyle  \#\mathcal{D}_{kN}^J \geq \beta^{2\ell+2} e^{(kN - (2\ell +1)\tau)h}. \ \ \ \ \ (2)

Given {J} and {x\in \mathcal{D}_{kN}^J}, let {\mathcal{J}(x,J)} denote the set of {J' \subset \{1,\dots, k\}} such that {x\in \mathcal{D}_{kN}^{J'}}. Writing {x = v^0 u^0 w^0 \hat u^0 \cdots v^\ell u^\ell w^\ell}, note that {x\in \mathcal{D}_{kN}^{J'}} implies that for each {j\in J'}, either {j\in J} or there are consecutive elements {j_i < j_{i+1}} of {J} such that {j_i < j < j_{i+1}}, and in this latter case we have that {v^i_{[(j - j_i)N, (j-j_i + 1)N-\tau)} \in \mathcal{Z}_{N-\tau}}, so {j-j_i \in B(v^i)}. We conclude that

\displaystyle  J' \subset J \cup \left(\bigcup_{i=0}^\ell j_i + B(v^i)\right).

By our choice of {\mathcal{Y}'}, for each {x} the set on the right has at most {\ell + 2\delta k} elements. In particular, we have

\displaystyle  \#\mathcal{J}(x,J) \leq 2^{\ell + 2\delta k}.

This bounds the number of {\mathcal{D}_{kN}^J} that can contain a given {x\in \mathcal{L}_{kN}}, and since there are {\geq e^{(-\gamma\log\gamma)k}} distinct choices of {J} with {\#J \leq \gamma k}, the bound in (2) gives

\displaystyle  \#\mathcal{L}_{kN} \geq 2^{-(\gamma + 2\delta)k} e^{(-\gamma\log\gamma)k}\beta^{2\gamma k+2} e^{(kN - (2\gamma k+1)\tau)h}.

Taking logs gives

\displaystyle  \log \#\mathcal{L}_{kN} \geq kNh -(\gamma + 2\delta)k\log 2 - (\gamma\log\gamma)k + (2\gamma k + 2)\log\beta - (2\gamma k + 1)\tau h.

Dividing by {kN} and sending {k\rightarrow\infty} gives

\displaystyle  h_\mathrm{top} \geq h + \tfrac 1N(-\gamma\log\gamma - (\gamma + 2\delta)\log 2 + 2\gamma\log\beta - 2\gamma\tau h).

Putting {\gamma = \delta} this gives

\displaystyle  h_\mathrm{top} \geq h + \tfrac \delta N(-\log\delta - \log 8 + 2\log\beta - 2\tau h).

Thus it suffices to make the appropriate choice of {\delta} at the beginning of the proof. More precisely, let {\beta' = \beta(\frac 14)} be as in the uniform Katok lemma, and let {\delta\in (0,\frac 14)} be small enough that {-\log\delta > \log 8 - 2\log\beta' + 2\tau h}. Then {\beta(\frac 12-\delta) \geq \beta'} and so the estimate above gives

\displaystyle  h_\mathrm{top} \geq h + \tfrac\delta N(-\log\delta - \log 8 + 2\log\beta' - 2\tau h) > h,

which contradicts our original assumption that {h} was the topological entropy. This contradiction shows that there is a unique measure of maximal entropy.

Posted in Uncategorized | Leave a comment

De Bruijn graphs and entropy at finite scales

Let {\mathcal{A}} be a finite set, which we call an alphabet, and let {x \in \mathcal{A}^{\mathbb N}} be an infinite sequence of letters from {A}. It is natural to ask how complex the sequence {x} is: for example, if the alphabet is {\{H,T\}}, then we expect a typical sequence produced by flipping a coin to be in some sense more complex than the sequence {x=HHHHH\dots}.

One important way of making this notion precise is the entropy of the shift space generated by {x}, a notion coming from symbolic dynamics. Let {a_k(x)} be the number of words of length {k} (that is, elements of {A^k}) that appear as subwords of {x}. Clearly we have {1\leq a_k(x) \leq (\#\mathcal{A})^k}. Roughly speaking, the entropy of {x} is the exponential growth rate of {a_k(x)}. More precisely, we write

\displaystyle  h(x) := \varlimsup_{k\rightarrow\infty} \tfrac 1k \log a_k(x).

Of course, in practice it is often the case that one does not have an infinite sequence, but merely a very long one. For example, it has been suggested that entropy (and a related quantity, the topological pressure) can play a role in the analysis of DNA sequences; see [D. Koslicki, “Topological entropy of DNA sequences”, Bioinformatics 27(8), 2011, p. 1061–1067] and [D. Koslicki and D.J. Thompson, “Coding sequence density estimation via topological pressure”, J. Math. Biol. 70, 2015, p. 45–69]. In this case we have {\mathcal{A} = \{A,C,G,T\}} and are dealing with sequences {x} whose length is large, but finite.

Given a sequence {x} with length {n}, one can try to get some reasonable `entropy-like’ quantity by fixing some {k} and putting {h(x) = \frac 1k\log a_k(x)}. But what should we take {k} to be? If we take {k} to be too small we will get an overestimate (with {k=1} we will probably just find out that {x} contains every letter of {\mathcal{A}}), but if we take {k} too small we get an underestimate (with {k=n} we have {a_k(x)=1} so {h=0}).

The convention proposed by Koslicki in the first paper above is to let {k} be the largest number such that there is some word of length {n} that contains every word of length {k}. If this actually happens, then {h(x)} achieves its maximum value {\log \#\mathcal{A}}; if some words do not appear, then {h(x)<\log \#\mathcal{A}}.

What is the relationship between {n} and {k} that guarantees existence of a word of length {n} containing every word of length {k}? Let {p=\#\mathcal{A}} and note that there are {p^k} words of length {k}; if {w\in \mathcal{A}^n} contains every such word then we must have {n\geq p^k + k-1}, since the length-{k} subwords of {w} are precisely {(w_1\cdots w_k)}, {(w_2\cdots w_k)}, \dots, {(w_{n-k+1} \cdots w_n)}, so we must have {n-k+1 \geq p^k}.

The converse implication is a little harder, though. Given {k}, let {n=p^k + k-1}. Is it necessarily true that there is a word {w\in \mathcal{A}^n} that contains every subword of length {k}? After all, once {w_1\cdots w_k} is determined, there are not many possibilities for the word {w_2\cdots w_{k+1}}; can we navigate these restrictions successfully?

It is useful to rephrase the problem in the language of graph theory (what follows can be found in the proof of Lemma 1 in Koslicki’s paper). Let {G_k} be the directed graph defined as follows:

  • the vertex set is {\mathcal{A}^k}, so each vertex corresponds to a word of length {k};
  • there is an edge from {u\in \mathcal{A}^k} to {v\in \mathcal{A}^k} if and only if {u_1 v = uv_k}, that is, if {u_2\cdots u_k = v_1\cdots v_{k-1}}.

The graph {G_k} is the {k}-dimensional De Bruijn graph of {p} symbols. Recall that a Hamiltonian path in a graph is a path that visits each vertex exactly once. Thus the question above, regarding existence of a word in {\mathcal{A}^n} that contains every word in {\mathcal{A}^k}, where {n=p^k + k-1}, is equivalent to asking for the existence of a Hamiltonian path in the De Bruijn graph.

There is a correspondence between {G_k} and {G_{k-1}}; vertices in {G_k} correspond to edges in {G_{k-1}} (since both are labeled by elements of {\mathcal{A}^k}). Thus a Hamiltonian path in {G_k} corresponds to an Eulerian path in {G_{k-1}}; that is, a path that visits every edge exactly once.

This correspondence is very useful, since in general the problem of determining whether a Hamiltonian path exists is hard (NP-complete), while it is easy to check existence of an Eulerian path in a directed graph: a sufficient condition is that every vertex have in-degree equal to its out-degree (and a slight weakening of this condition is both necessary and sufficient). This is the case for De Bruijn graphs, where every vertex has {p=\#\mathcal{A}} edges coming in and going out. Thus {G_{k-1}} has an Eulerian path, which corresponds to a Hamiltonian path for {G_k}. This answers the original question, demonstrating that for every {k}, there is a word {w} of length {n=p^k+k-1} such that {w} contains every word of length {k} as a subword.

Posted in Uncategorized | 1 Comment

Entropy of S-gap shifts

1. S-gap shifts

S-gap shifts are a useful example for studying dynamics of shift spaces that are not subshifts of finite type but still exhibit some strong mixing properties. They are defined as follows: given {S\subset \{0,1,2,\dots,\}}, let {G\subset \{0,1\}^*} be the set of all words on two symbols of the form {0^n1} — that is, {n} 0s followed by a single 1. (Edit 8/2/15: As Steve Kass pointed out in his comment, we need to specify here that {0^n1\in G \Leftrightarrow n\in S}.) Then let {G^{{\mathbb Z}}} be the set of all bi-infinite sequences of 0s and 1s that can be written as an infinite concatenation of words in {G}, and let {X\subset \{0,1\}^{\mathbb Z}} be the smallest closed shift-invariant set containing {G^{\mathbb Z}}.

Equivalently, {X} is the collection of bi-infinite sequences {x\in \{0,1\}^{\mathbb Z}} for which every subword of the form {10^n1} has {n\in S}. If {S} is finite then {X} is a shift of finite type. We are usually most interested in the case where {S} is infinite — for example, in this paper (arXiv) where Dan Thompson and I considered questions of uniqueness of the measure of maximal entropy. For purposes of this post, {S} may be finite or infinite, it will not matter.

Recall that if {\mathcal{L}_n} denotes the set of words of length {n} that appear somewhere in the shift {X}, then the topological entropy of {X} is {h(X) = \lim \frac 1n \log \#\mathcal{L}_n}. The following result is well-known and often quoted.

Theorem 1 Given {S\subset \{0,1,2,\dots\}}, the topological entropy of the corresponding {S}-gap shift is {h(X) = \log\lambda}, where {\lambda>1} is the unique solution to {\sum_{n\in S} \lambda^{-(n+1)} = 1}.

Note that when {S=\{0,1,2,\dots\}}, the {S}-gap shift is the full shift on two symbols, and the equation has solution {\lambda=2}.

Despite the fact that Theorem 1 is well-known, I am not aware of a complete proof written anywhere in the literature. In a slightly different language, this result already appears in B. Weiss’ 1970 paper “Intrinsically ergodic systems” [Bull. AMS 76, 1266–1269] as example 3.(3), but no details of the proof are given. It is exercise 4.3.7 in Lind and Marcus’ book “Symbolic dynamics and coding”, and the section preceding it gives ideas as to how the proof may proceed. Finally, a more detailed proof appears in Spandl’s “Computing the topological entropy of subshifts” [Math. Log. Quart. 53 (2007), 493–510], but there is a gap in the proof. The goal of this post is to explain where the gap in Spandl’s proof is, and then to give two other proofs of Theorem 1, one more combinatorial, the other more ergodic theoretic.

2. An incomplete proof

The idea of Spandl’s proof is this. Given an {S}-gap shift, let {\mathcal{R}_n} be the set of words of length {n} that end in the symbol {1}. Every such word is either {0^{n-1}1} or is of the form {w0^k1}, where {w\in \mathcal{R}_{n-(k+1)}} and {k\in S}. Thus we have

\displaystyle \#\mathcal{R}_n = 1 + \sum_{k=0}^{n-1} {\mathbf{1}}_S(k) \#\mathcal{R}_{n-(k+1)}. \ \ \ \ \ (1)

Moreover, every {w\in \mathcal{L}_n} is either {0^n} or is of the form {v0^{n-k}} for some {v\in \mathcal{R}_k}, so {\#\mathcal{L}_n = 1 + \sum_{k=1}^n \#\mathcal{R}_k}. With a little more work, one can use this together with (1) to get

\displaystyle \#\mathcal{L}_n = c_n + \sum_{k=0}^{n-1} {\mathbf{1}}_S(k) \#\mathcal{L}_{n-(k+1)},

where {c_n\in [0,n+2]} for all {n}. Dividing through by {\#\mathcal{L}_n} gives

\displaystyle 1 = \frac{c_n}{\#\mathcal{L}_n} + \sum_{k=0}^{n-1} {\mathbf{1}}_S(k)\frac{\#\mathcal{L}_{n-(k+1)}}{\#\mathcal{L}_n} \ \ \ \ \ (2)

Writing {\lambda = e^{h(X)}}, Spandl now says that {\#\mathcal{L}_n} is asymptotically proportional to {\lambda^n}, and so for each {k} the ratio inside the sum converges to {\lambda^{-(k+1)}} as {n\rightarrow\infty}. Since {c_n} is subexponential, this would prove Theorem 1.

The problem is that the ratio {\frac{\#\mathcal{L}_{n-(k+1)}}{\#\mathcal{L}_n}} may not converge as {n\rightarrow\infty}. Indeed, taking {S = \{1,3,5,7,\dots\}} it is not hard to show that when {k} is even, the limit taken along odd values of {n} differs from the limit taken along even values of {n}.

One might observe that in this specific example, the terms where {k} is even do not contribute to the sum in (2), since {S} contains no even numbers. Thus it is plausible to make the following conjecture.

Conjecture. For any {S}, let {d = \mathrm{gcd}(S+1)} and let {h = \log\lambda} be the topological entropy of the corresponding {S}-gap shift. Then for every {j=0,1,\dots,d-1} the limit

\displaystyle a_j := \lim_{n\rightarrow\infty} \frac{\#\mathcal{L}_{j+dn}}{\lambda^{j+dn}}

exists. In particular, if {k\in S} then

\displaystyle \frac{\#\mathcal{L}_{n-(k+1)}}{\#\mathcal{L}_n} = \lambda^{-(k+1)}\frac{\#\mathcal{L}_{n-(k+1)}}{\lambda^{n-(k+1)}} \frac{\lambda^n}{\#\mathcal{L}_n} \rightarrow \lambda^{-(k+1)} \text{ as }n\rightarrow\infty.

If the conjecture is true, this would complete Spandl’s proof of Theorem 1. I expect that the conjecture is true but do not know how to prove it. In general, any shift space with entropy {h=\log \lambda} has the property that {\#\mathcal{L}_n\geq \lambda^n}. There are examples of shift spaces where {\#\mathcal{L}_n /\lambda^n} is not bounded above; however, it can be shown that every {S}-gap shift admits an upper bound, so that {\#\mathcal{L}_n/\lambda^n} is bounded away from 0 and {\infty} (this is done in my paper with Dan Thompson). I don’t see how those techniques can extend to a proof of the conjecture.

So instead of patching the hole in this proof, we investigate two others.

3. A combinatorial proof

Given a shift space {X} with language {\mathcal{L}}, let {a_n = \#\mathcal{L}_n} be the number of words of length {n}. Consider the following generating function:

\displaystyle H(z) = \sum_{n=1}^\infty a_n z^n.

(This is similar to the dynamical zeta function but is somewhat different since we consider all words of length {n} and not just ones that represent periodic orbits of period {n}.) Observe that the radius of convergence of {H} is {\xi := e^{-h(X)}}. Indeed, for {|z|<\xi} we can put {\beta = |z|/\xi < 1} and observe that the quantity {|a_n z^n| = \frac{\#\mathcal{L}_n}{e^{nh(X)}} \beta^n} decays exponentially; similarly, for {|z|>\xi} the terms in the sum grow exponentially.

Now fix an {S}-gap shift and consider the function

\displaystyle F_1(z) = \sum_{n\in S} z^{n+1}.

Our goal is to find a relationship between {H} and {F_1} allowing us to show that the radius of convergence of {H} is given by the positive solution to {F_1(z)=1}.

First recall the set {G = \{0^n1\mid n\in S\}}. Given integers {k,n}, let {A_n^k} be the number of words in {\{0,1\}^n} that can be written as a concatenation of exactly {k} words from {G}. Note that {A_n^1 = {\mathbf{1}}_S(n-1)}, so that

\displaystyle F_1(z) = \sum_{n=1}^\infty A_n^1 z^n.

More generally, we consider the power series

\displaystyle F_k(z) = \sum_{n=1}^\infty A_n^k z^n.

A little thought reveals that {F_k(z)F_\ell(z) = F_{k+\ell}(z)}, since the coefficient of {z^n} in {F_k(z)F_\ell(z)} is given by

\displaystyle \sum_{m=1}^n A_{n-m}^k A_m^\ell,

which is equal to {A_n^{k+\ell}} (here {m} represents the location where the {\ell}th element of {G} ends). In particular, we get

\displaystyle F_k(z) = (F_1(z))^k.

At this point, the natural thing to do is to say that {\#\mathcal{L}_n = \sum_{k\geq 1} A_n^k} and hence {H(z) = \sum_{k\geq 1} F_k(z)}. However, this is not quite correct because {\mathcal{L}_n} includes words that are not complete concatenations of words from {G} and so are not counted by any {A_n^k}. We return to this in a moment, but first point out that if this were true then we would have {H(z) = \sum_{k\geq 1} F_1(z)^k}, and so {H(z)} converges if {F_1(z)<1} and diverges if {F_1(z)>1}, which was our goal.

To make this precise, we observe that every word in {\mathcal{L}_n} is either {0^n} or is of the form {0^i w 0^j} where {w\in A_{n-(i+j)}^k} for some {k}. Thus we have the bounds

\displaystyle \sum_{k\geq 1} A_n^k \leq \#\mathcal{L}_n \leq 1 + \sum_{k\geq 1} \sum_{i\geq 0} \sum_{j\geq 0} A_{n-(i+j)}^k.

In particular, for all {z\geq 0} we get

\displaystyle \sum_{k\geq 1} F_k (z) \leq H(z) \leq \sum_{n\geq 1} z^n + \sum_{k\geq 1} F_k (z) \left(\sum_{i\geq 0} z^i \right) \left(\sum_{i\geq 0} z^j\right) \ \ \ \ \ (3)

Writing {\hat H(z) = \sum_k F_k(z) = \sum_k F_1(z)^k}, we note that for every {z\in [0,1)} we have {\sum_n z^n < \infty} and so {\hat H(z)} converges if and only if {H(z)} converges. Thus {\hat H} and {H} have the same radius of convergence; in particular, the radius of convergence of {H(z)} is the unique {z} such that {F_1(z) = \sum_{n\in S} z^{n+1} = 1}, and by the earlier discussion we have {h(X) = \log(1/z)}, proving Theorem 1.

4. An ergodic theory proof

Now we sketch a proof that goes via the variational principle, relating an {S}-gap shift (on the finite alphabet {\{0,1\}}) to a full shift on a (possibly countable) alphabet. The combinatorial proof above is elementary and requires little advanced machinery; this proof, on the other hand, requires a number of (rather deep) results from ergodic theory and thermodynamic formalism, but has the advantage of illuminating various aspects of the structure of {S}-gap shifts.

Let {X} be an {S}-gap shift and let {Y} be the set of sequences {x\in X} which contain the symbol 1 infinitely often both forwards and backwards. By the Poincaré recurrence theorem, {\mu(Y)=1} for every ergodic measure other than the delta-measure on the fixed point {0^\infty}. Note that {Y} is {\sigma}-invariant but not compact (unless {S} is finite).

Let {Z = Y \cap [1]}, and let {F\colon Z\rightarrow Z} be the first return map. Thus {F(x) = \sigma^n(x)} for all {x\in [10^n 1]}. Note that {F} is topologically conjugate to the full shift {S^{\mathbb Z}} on the alphabet {S}, which we allow to be finite or countable. The conjugacy is given by {\pi\colon S^{\mathbb Z}\rightarrow Z} that takes {\vec n = \cdots n_{-1}.n_0n_1\cdots} to {\cdots 10^{n_{-1}}.10^{n_0}10^{n_1}1\cdots}.

Given an ergodic {\sigma}-invariant probability measure {\mu} on {X} with {\mu(Y)=1}, let {\mu_Z} be the induced {F}-invariant measure on {Z}. Then by Abramov’s formula, we have

\displaystyle h_\mu(X,\sigma) = \mu(Z) h_{\mu_Z}(Z,F).

Associate to each {\mu} the shift-invariant measure on {S^{\mathbb Z}} given by {h^{-1}_*\mu_Z}. Then we have

\displaystyle h_\mu(X,\sigma) = \mu(Z) h_{\nu}(S^{\mathbb Z},\sigma).

Our goal is to relate the topological entropy on {(X,\sigma)} to the topological pressure of a suitable function on {(S^{\mathbb Z},\sigma)}. Let {\varphi\colon S^{\mathbb Z}\rightarrow {\mathbb R}} be the function taking {\vec n} to {n_0 + 1}, and observe that if {\mu} and {\nu} are identified as above, we have {\mu(Z) = \mu([1]) = 1/\int \varphi\,d\nu}, so that

\displaystyle h_\nu = \left(\int \varphi\,d\nu \right) h_\mu.

At this point we elide some technical details regarding Gurevich pressure, etc., and simply remark that for {t>0} we have

\displaystyle \begin{aligned} P(S^{\mathbb Z},-t\varphi) &= \lim_{k\rightarrow\infty} \frac 1k \log \sum_{(n_0,\dots,n_{k-1})\in S^k} \sup_{\vec m\in [n_0\cdots n_{k-1}]} e^{-t \sum_{i=0}^{k-1} \varphi(\sigma^i\vec m)} \\ &= \lim_{k\rightarrow\infty} \frac 1k \log \left(\sum_{n_0,\dots,n_{k-1}} e^{-t\sum_{i=0}^{k-1} (n_i + 1)} \right) \\ &= \lim_{k\rightarrow\infty} \frac 1k \log \left( \sum_{n\in S} e^{-t(n+1)} \right)^k = \log \sum_{n\in S} e^{-t(n+1)}, \end{aligned}

while by the variational principle

\displaystyle P(S^{\mathbb Z},-t\varphi) = \sup_\nu \left(h_\nu - t \int\varphi\,d\nu\right) = \sup_\nu \left(\int\varphi\,d\nu\right) (h_\mu - t).

We conclude that

\displaystyle \sup_\mu (h_\mu - t) \mu[1]^{-1} = \log \sum_{n\in S} e^{-t(n+1)}. \ \ \ \ \ (4)

Let {t} be such that the right-hand side is equal to 0; to prove Theorem 1 we need to prove that {h(X,\sigma) = t}. First observe that for every {\mu} we have {h_\mu - t \leq 0}, thus {h(X,\sigma) \leq t}. For the other inequality, let {\nu} be the Bernoulli measure on {S^{\mathbb Z}} that assigns weight {e^{-t(n+1)}} to the symbol {n}. Then {\nu} is an equilibrium state for {-t\varphi} on {S^{\mathbb Z}}, and by our choice of {t}, we have

\displaystyle h_\nu - \int \varphi\,d\nu = 0,

so that in particular the left hand side of (4) vanishes and we get {h_\mu=t} for the measure {\mu} that corresponds to {\nu}. This shows that {h(X,\sigma) = t} and completes the proof of the theorem.

We note that this proof has the advantage of giving an explicit description of the MME. With a little more work it can be used to show that the MME is unique and has good statistical properties (central limit theorem, etc.).

Posted in ergodic theory, examples | Tagged , | 2 Comments

Slowly mixing sets

There are two equivalent definitions of mixing for a measure-preserving dynamical system {(X,f,\mu)}. One is in terms of sets:

\displaystyle  \lim_{n\rightarrow\infty} |\mu(A \cap f^{-n}B) - \mu(A)\mu(B)| = 0 \ \ \ \ \ (1)

for all measurable {A,B\subset X}. The other is in terms of functions:

\displaystyle  \lim_{n\rightarrow\infty} \left\lvert \int \varphi \cdot(\psi\circ f^n)\,d\mu - \int\varphi\,d\mu \int\psi\,d\mu\right\rvert = 0 \ \ \ \ \ (2)

for all {\varphi,\psi\in L^1(X,\mu)}. In both cases one may refer to the quantity inside the absolute value as the correlation function, and write it as {C_n(A,B)} or {C_n(\varphi,\psi)}, so that (1) and (2) become {C_n(A,B)\rightarrow 0} and {C_n(\varphi,\psi)\rightarrow 0}, respectively.

Then it is natural to ask how quickly mixing occurs; that is, how quickly the correlation functions {C_n} decay to 0. Is the convergence exponential in {n}? Polynomial? Something else?

In a previous post, we discussed one method for establishing exponential decay of correlations using a spectral gap for a certain operator associated to the system. The result there stated that as long as {\varphi} and {\psi} are chosen from a suitable class of test functions (often Hölder continuous functions), then {C_n(\varphi,\psi)} decays exponentially.

A comment was made in that post that it is important to work with sufficiently regular test functions, because for arbitrary measurable functions, or for the setwise correlations {C_n(A,B)}, the decay may happen arbitrarily slowly even when the system is “as mixing as possible”. But no examples were given there — so I’d like to explain this a little more completely here.

Let {X=\{0,1\}^{\mathbb N}} be the space of infinite sequences of 0s and 1s, and let {f=\sigma} be the shift map. Let {\mu} be {(\frac 12,\frac 12)}-Bernoulli measure, so that writing {[x_1\cdots x_n]} for the {n}-cylinder containing all sequences in {X} that begin with the symbols {x_1,\dots,x_n}, we have {\mu[x_1\cdots x_n] = 2^{-n}}. This is isomorphic to the doubling map on the circle with Lebesgue measure, which was discussed in the previous post and which has exponential decay of correlations for Hölder observables {\varphi,\psi}. Nevertheless, we will show that there are sets {A,B\subset X} such that {C_n(A,B)} decays quite slowly.

We can think of {(X,f,\mu)} as modeling sequences of coin flips, with tails corresponding to the symbol 0, and heads to 1.

Let {B = [1]} be the set of all sequences in {X} that begin with 1. Then {B} corresponds to the event that the first flip is heads, and {\sigma^{-n}B} corresponds to the event that the {(n+1)}st flip is heads.

The description of {A} is a little more involved. Given {j\leq k \in {\mathbb N}}, let {A_{j,k} = \bigcup_{i=j}^{k-1} \sigma^{-i}[1]}, so that {x\in A_{j,k}} if and only if (at least) one of the symbols {x_j,\dots,x_{k-1}} is 1. This corresponds to the event that heads appears at least once on or after the {j}th flip and before the {k}th flip. Note that {\mu(X \setminus A_{j,k}) = 2^{-(k-j)}}, and so {\mu(X \setminus A_{j,k}) = 1-2^{-(k-j)}}.

Fix an increasing sequence of integers {k_1,k_2,\dots} with {k_1 = 1}, and let {A = \bigcap_{i=1}^\infty A_{k_i, k_{i+1}}} be the set of all sequences {x\in X} such that every subword {x_{k_i} \cdots x_{k_{i+1}-1}} contains at least one occurrence of the symbol 1.

Using the interpretation as coin flips, we see that the events {A_{k_i,k_{i+1}-1}} are independent, and so

\displaystyle  \mu(A) = \mathop{\mathbb P}(A) = \prod_{i=1}^\infty \mathop{\mathbb P}(A_{k_i,k_{i+1}}) = \prod_{i=1}^\infty (1-2^{-(k_{i+1} - k_i)}). \ \ \ \ \ (3)

We recall the following elementary lemma.

Lemma 1 Let {t_i\in {\mathbb R}} be a sequence of real numbers such that {0\leq t_i < 1} for all {i\in {\mathbb N}}. Then {\prod_{i=1}^\infty (1-t_i) > 0} if and only if {\sum_{i=1}^\infty t_i < \infty}.

It follows from Lemma 1 that {\mu(A)>0} if and only if

\displaystyle  \sum_{i=1}^\infty 2^{-(k_{i+1} - k_i)} < \infty. \ \ \ \ \ (4)

Thus from now on we assume that the sequence {k_i} is chosen such that (4) holds. Note that the correlation function {C_n(A,B)} can be computed via conditional probabilities: using

\displaystyle  \mathop{\mathbb P}(A \mid \sigma^{-n}B) = \frac{\mu(A \cap \sigma^{-n}B)}{\mu(B)}

(notice that this uses shift-invariance of {\mu}), we have

\displaystyle  \begin{aligned} C_n(A,B) &= \mu(A\cap \sigma^{-n}B) - \mu(A)\mu(B) \\ &= \mu(B) (\mathop{\mathbb P}(A \mid \sigma^{-n}B) - \mu(A)). \end{aligned} \ \ \ \ \ (5)

Recalling that {A} is the intersection of the independent events {A_{k_i,k_{i+1}}}, we let {j=j(n)} be such that {n\in [k_j, k_{j+1})}, and observe that {\sigma^{-n}(B) \subset A_{k_j,k_{j+1}}}. Thus

\displaystyle  \mathop{\mathbb P}(A \mid \sigma^{-n}B) = \prod_{i\neq j} (1-2^{-(k_{i+1}-k_i)}) = \frac{\mu(A)}{1-2^{-(k_{j+1}-k_j)}},

where the last equality uses the expression for {\mu(A)} from (3). Together with the expression (5) for {C_n(A,B)} in terms of the conditional probability {\mathop{\mathbb P}(A\mid \sigma^{-n}B)}, this gives

\displaystyle  \begin{aligned} C_n(A,B) &= \mu(A) \mu(B) \left( \frac 1{1-2^{-(k_{j+1}-k_j)}} - 1 \right) \\ &= \mu(A)\mu(B) \frac 1{2^{k_{j+1} - k_j} - 1}, \end{aligned} \ \ \ \ \ (6)

where we emphasise that {j} is a function of {n}.

Example 1 Take {k_i = \lfloor 2 i\log_2 i \rfloor}, so that {k_{i+1} - k_i \approx 2 i \log_2 (1+\frac 1i) + 2\log_2(i+1)}, and {2^{k_{i+1} - k_i} \approx (i+1)^2 (1+\frac 1i)^{2i}}. Since {(1+\frac 1i)^{2i} \rightarrow e^2}, we see that (4) is satisfied, so that {\mu(A)>0}.

Moreover, {j=j(n)} satisfies {2j\log_2 j \leq n}, and so in particular {j\leq n}. This implies that {k_{j+1} - k_j \leq k_{n+1} - k_n}, and so we can estimate the correlation function using (6) by

\displaystyle  C_n(A,B) \geq \mu(A)\mu(B)\frac 1{2^{k_{n+1} - k_n} - 1} \approx \mu(A)\mu(B) \frac 1{n^2}.

The example shows that the correlation function of sets may only decay polynomially, even if the system has exponential decay of correlations for regular observable functions. In fact, we can use the construction above to produce sets with correlations that decay more or less as slowly as we like along some subsequence of times {n}.

Theorem 2 Let {\gamma_n>0} be any sequence of positive numbers converging to 0. Then there is a sequence {k_i} such that the sets {A,B} from the construction above have {\limsup_{n\rightarrow\infty} C_n(A,B)/\gamma_n = \infty}.

The result says that no matter how slowly {\gamma_n} converges to 0, we can choose {A,B} such that {C_n(A,B)} is not bounded above by {K\gamma_n} for any constant {K}.

To prove Theorem 2, we will choose {k_i} such that {C_n(A,B)/\gamma_n} is large whenever {n\in [k_j, k_{j+1})} with {j} even. Let {c_i,d_i} be any increasing sequences of integers (conditions on these will be imposed soon) and define {k_i} recursively by

\displaystyle  k_1 = 1, \quad k_{2i} = k_{2i-1} + c_i, \qquad k_{2i+1} = k_{2i} + d_i.

Because {c_i,d_i} are increasing, we have

\displaystyle  \sum_{i=1}^\infty 2^{-(k_{i+1} - k_i)} = \sum_{i=1}^\infty 2^{-c_i} + 2^{-d_i} < \infty,

so (4) holds and {\mu(A)>0}. The idea is that we will take {c_i \gg d_i}, so that when {n\in [k_{2j},k_{2j+1})}, we have {k_{2j+1} - k_{2j} = d_j} small relative to {k_{2j}}, and hence to {n}.

Let’s flesh this out a little. Given {n\in [k_{2j},k_{2j+1})}, it follows from (6) that

\displaystyle  C_n(A,B) = \mu(A)\mu(B) \frac 1{2^{k_{j+1} - k_j} - 1} \geq \mu(A)\mu(B) 2^{-d_j}. \ \ \ \ \ (7)

On the other hand, {n\geq k_{2j} \geq \sum_{i=1}^j c_i}. Now we may take {d_j} to be any increasing sequence ({d_j=j} will do just fine) and define {c_j} recursively in terms of {c_1,\dots, c_{j-1}} and {d_j}. We do this by observing that since {\gamma_n \rightarrow 0}, given any {c_1,\dots, c_{j-1}} and {d_j}, we can take {c_j} large enough so that for every {n\geq \sum_{i=1}^j c_i}, we have

\displaystyle  j\gamma_n \leq \mu(A) \mu(B) 2^{-d_j}.

In particular, (7) gives {C_n(A,B) \geq j\gamma_n} for every {n\in [k_{2j},k_{2j+1})}. Sending {j\rightarrow\infty} completes the proof of Theorem 2.

Posted in ergodic theory, examples | Tagged | 1 Comment

Law of large numbers for dependent but uncorrelated random variables

One of the fundamental results in probability theory is the strong law of large numbers, which was discussed in an earlier post under the guise of the Birkhoff ergodic theorem.

Suppose we have a sequence of random variables {t_n} which take values in {{\mathbb N}}. If the sequence {t_n} is independent and identically distributed with {\mathop{\mathbb E}[t_n] = T}, then the strong law of large numbers shows that {\frac 1n(t_1+\cdots + t_n)\rightarrow T} almost surely.

It is often the case, however, that one or both of these conditions fail; there may be some dependence between the variables {t_n}, or the distribution may not be identical for all values of {n}. For example, the following situation arose recently in a project I have been working on: there is a sequence of random variables {t_n} as above, which has the property that no matter what values {t_1,\dots,t_{n-1}} take, the random variable {t_n} has expected value at most {T}. In the language of conditional expectation, we have

\displaystyle  \mathop{\mathbb E}[t_n \mid t_1,\dots, t_{n-1}] \leq T.

This is true despite the fact that the distribution of {t_n} may vary according to the values of {t_1,\dots,t_{n-1}}. In what follows we will write {\mathcal{F}_n} for the {\sigma}-algebra generated by {t_1,\dots, t_n} and write the above condition as

\displaystyle  \mathop{\mathbb E}[t_n \mid \mathcal{F}_{n-1}]\leq T.

It is plausible to conjecture that in this setting one still has {\varlimsup \frac 1n(t_1+\cdots+t_n) \leq T} almost surely. And indeed, it turns out that a statement very close to this can be proved.

Let {\mathop{\mathcal F}_n} be an increasing sequence of {\sigma}-algebras and let {t_n} be a sequence of {{\mathbb N}}-values random variables such that {t_n} is {\mathop{\mathcal F}_n}-measurable. Suppose that there is {p\colon {\mathbb N} \rightarrow [0,1]} such that

\displaystyle  \mathop{\mathbb P}[t_n=t \mid \mathcal{F}_{n-1}] \leq p(t) \ \ \ \ \ (1)

and moreover,

\displaystyle   T := \sum_{t=1}^\infty t \cdot p(t) < \infty. \ \ \ \ \ (2)

The following result is proved by modifying one of the standard proofs of the strong law of large numbers, which can be found, for example, in Billingsley’s book “Probability and Measure”, where it is Theorem 22.1. There is also a discussion on Terry Tao’s blog, which emphasises the role of the main tools in this proof: the moment method and truncation.

Proposition 1 If {t_n} is any sequence of random variables satisfying (1) and (2), then {\varlimsup_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n t_k \leq T} almost surely.

Proof: Consider the truncated random variables {s_n = t_n {\mathbf{1}}_{[t_n\leq n]}}, and note that by (1) and (2) we have

\displaystyle  0 \leq s_n \leq t_n \Rightarrow \mathop{\mathbb E}[s_n \mid \mathcal{F}_{n-1}] \leq T.

Now consider the random variables {r_n = s_n + T - \mathop{\mathbb E}[s_n \mid \mathcal{F}_{n-1}]}. Note that {r_n} is {\mathcal{F}_n}-measurable, takes values in {{\mathbb N}}, and moreover

\displaystyle  \mathop{\mathbb E}[r_n \mid \mathcal{F}_{n-1}] = T \text{ for all } n. \ \ \ \ \ (3)

The most important consequence of this is that the sequence {r_n} is uncorrelated — that is, {\mathop{\mathbb E}[r_ir_j] = \mathop{\mathbb E}[r_i]\mathop{\mathbb E}[r_j] = T^2} for all {i\neq j}, which in turn implies that {\mathop{\mathbb E}[(r_i - T)(r_j-T)] = 0}. This means that the variance is additive for the sequence {r_n}:

\displaystyle  \begin{aligned} \mathop{\mathbb E}\Big[ \Big(\sum_{k=1}^n (r_k -T)\Big)^2\Big] &= \sum_{k=1}^n \mathop{\mathbb E}[(r_k -T)^2] + 2\sum_{1\leq i<j\leq n} \mathop{\mathbb E}[(r_i-T)(r_j-T)] \\ &= \sum_{k=1}^n \mathop{\mathbb E}[(r_k -T)^2]. \end{aligned}

Let {X_n = \frac 1n (r_1 + \cdots + r_n) - T}, so that {\mathop{\mathbb E}[X_N] = 0}. We can estimate {\mathop{\mathbb E}[X_n^2]} using the previous computation:

\displaystyle  \begin{aligned} \mathop{\mathbb E}[X_n^2] &= \frac 1{n^2} \sum_{k=1}^n \mathop{\mathbb E}[(r_k-T)^2] = \frac 1{n^2} \sum_{k=1}^n (\mathop{\mathbb E}[r_k^2] - 2T\mathop{\mathbb E}[r_k] + T^2) \\ &= \frac 1{n^2} \left( nT^2 + \sum_{k=1}^n \mathop{\mathbb E}[r_k^2]\right). \end{aligned} \ \ \ \ \ (4)

Recall that {p(t)} is the sequence bounding {\mathop{\mathbb P}[t_n = t \mid \mathop{\mathcal F}_{n-1}]}, and let {\bar t} be such that {\sum_{t\geq \bar t} p(t) \leq 1}. Let {Y} be a random variable taking the value {t} with probability {p(t)} for {t\geq \bar t}. Note that by the definition of {s_n} and {r_n}, we have {r_n \leq s_n + T \leq n+T}, thus

\displaystyle  \mathop{\mathbb E}[r_k^2] = \sum_{t=1}^{k+T} t^2 \mathop{\mathbb P}[r_k = t] \leq \sum_{t=1}^{k+T} t^2 p(t) \leq C + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq k+T]}]

for some fixed constant {C}. Note that the final expression is non-decreasing in {k}, and so together with (4) we have

\displaystyle  \mathop{\mathbb E}[X_n^2] \leq \frac 1n \left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n+T]}] \right), \ \ \ \ \ (5)

where again {C'} is a fixed constant. Fixing {\varepsilon>0} and using (5) in Chebyshev’s inequality yields

\displaystyle  \mathop{\mathbb P}[|X_n| \geq \varepsilon] \leq \frac 1{\varepsilon^2 n} \left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n+T]}] \right). \ \ \ \ \ (6)

Now let {n_k} be a sequence of integers such that {\lim_k \frac{n_{k+1}}{n_k} =: \alpha > 1}. We see that (6) yields

\displaystyle  \begin{aligned} \sum_{k=1}^\infty \mathop{\mathbb P}[|X_{n_k}| \geq \varepsilon] &\leq \frac 1{\varepsilon^2} \sum_{k=1}^\infty n_k^{-1}\left( C' + \mathop{\mathbb E}[Y^2 {\mathbf{1}}_{[Y\leq n_k+R]}]\right) \\ &\leq C'' + \frac 1{\varepsilon^2} \mathop{\mathbb E} \left[ Y^2 \sum_{k=1}^\infty n_k^{-1} {\mathbf{1}}_{[Y\leq n_k+T]}\right], \end{aligned} \ \ \ \ \ (7)

where {C''<\infty} depends on {\alpha} and on {\varepsilon}. Observe that

\displaystyle  \sum_{k=1}^\infty n_k^{-1} {\mathbf{1}}_{[Y\leq n_k +T]} \leq \sum_{k=k_0(Y)}^\infty n_k^{-1},

where {k_0(Y)} is the minimal value of {k} for which {n_k\geq Y-T}. Because {n_k} grows like {\alpha^k}, the sum is bounded above by a constant times {n_{k_0(Y)}^{-1}}, so that together with (7) we have

\displaystyle  \sum_{k=1}^\infty \mathop{\mathbb P}[|X_{n_k}| \geq \varepsilon] \leq C'' + C''' \mathop{\mathbb E}[Y] < \infty,

where finiteness of {\mathop{\mathbb E}[Y]} is the crucial place where we use (2).

Using the first Borel-Cantelli lemma and taking an intersection over all rational {\varepsilon>0} gives

\displaystyle  \lim_{n\rightarrow\infty} \frac 1{n_k} \sum_{j=1}^{n_k} r_j = T \text{ a.s.} \ \ \ \ \ (8)

Let {Z_n = \sum_{j=1}^n r_j}. Because {r_j\geq 0} we have {Z_{n_k} \leq Z_n \leq Z_{n_{k+1}}} for all {n_k\leq n\leq n_{k+1}}, and in particular

\displaystyle  \frac {n_k}{n_{k+1}} \frac{Z_{n_k}}{n_k} \leq \frac{Z_n}{n} \leq \frac{n_{k+1}}{n_k} \frac{Z_{n_{k+1}}}{n_{k+1}}.

Taking the limit and using (8) gives

\displaystyle  \frac 1\alpha T \leq \varliminf_{k\rightarrow\infty} \frac 1k Z_k \leq \varlimsup_{k\rightarrow\infty} \frac 1k Z_k \leq \alpha T \text{ a.s.}

Taking an intersection over all rational {\alpha>1} gives

\displaystyle  \lim_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n r_k = T \text{ a.s.} \ \ \ \ \ (9)

Finally, we recall from the definition of {s_n} and {r_n} that {t_n\leq r_n} whenever {t_n\leq n}. In particular, we may observe that

\displaystyle  \sum_{n=1}^\infty \mathop{\mathbb P}[t_n > r_n] \leq \sum_{n=1}^\infty \mathop{\mathbb P}[t_n > n] \leq \mathop{\mathbb E}[t_n] \leq T

and apply Borel-Cantelli again to deduce that with probability one, {t_n > r_n} for at most finitely many values of {n}. In particular, (9) implies that

\displaystyle  \varlimsup_{n\rightarrow\infty} \frac 1n \sum_{k=1}^n t_k \leq T \text{ a.s.} \ \ \ \ \ (10)

which completes the proof of Proposition 1. \Box

Posted in statistical laws, theorems | Tagged | Leave a comment

Equidistribution for random rotations

Two very different types of dynamical behaviour are illustrated by a pair of very well-known examples on the circle: the doubling map and an irrational rotation. On the unit circle in {{\mathbb C}}, the doubling map is given by {z\mapsto z^2}, while an irrational rotation is given by {z\mapsto e^{2\pi i\theta}z} for some irrational {\theta}.

Lebesgue measure (arc length) is invariant for both transformations. For the doubling map, it is just one of many invariant measures; for an irrational rotation, it turns out to be the only invariant measure. We say that the doubling map exhibits hyperbolic behaviour, while the irrational rotation exhibits elliptic behaviour.

Systems with hyperbolicity have many invariant measures, as we saw in a series of previous posts. The goal of this post is to recall a proof that the opposite situation is true for an irrational rotation, and that in particular every orbit equidistributes with respect to Lebesgue measure; then we consider orbits generated by random rotations, where instead of rotating by a fixed angle {\theta}, we rotate by either {\theta_1} or {\theta_2}, with the choice of which to use being made at each time step by flipping a coin.

1. Invariant measures via {C(X)}

First we recall some basic facts from ergodic theory for topological dynamical systems. Given a compact metric space {X} and a continuous map {f\colon X\rightarrow X}, let {\mathcal{M}} denote the space of Borel probability measures on {X}. Writing {C(X)} for the space of all continuous functions {X\rightarrow {\mathbb C}}, recall that {C(X)^*} is the space of all continuous linear functionals {C(X)\rightarrow {\mathbb C}}. Then {C(X)^*} is (isomorphic to) the space of finite complex Borel measures on {X}. (This last assertion uses the fact that {X} is a compact metric space and combines various results from “Linear Operators” by Dunford and Schwartz, but a more precise reference will have to wait until I have the book available to look at.)

Using this fact together with the polar decomposition for finite complex Borel measures, we have the following: for every {L\in C(X)^*}, there is {\mu\in \mathcal{M}} and a measurable function {\theta\colon X\rightarrow {\mathbb R}} such that

\displaystyle  L(\varphi) = \|L\| \int \varphi e^{i\theta} \,d\mu \text{ for all } \varphi\in C(X). \ \ \ \ \ (1)

Note that although {C(X)^*} is endowed with the operator norm, we will usually think of it as a topological vector space with the weak* topology. Thus {\mathcal{M}} embeds naturally into {C(X)^*}, and (1) shows that every element of {C(X)^*} can be described in a canonical way in terms of {\mathcal{M}}.

Let {P\subset C(X)} be a countable set whose span is dense in {C(X)}. Then to every {L\in C(X)^*} we can associate the sequence {\Phi_P(L) = \{ L(p) \mid p\in P\} \subset {\mathbb C}^P}, where depending on the context we may index using either {{\mathbb N}} or {{\mathbb Z}}. If {P} is bounded then {\Phi_P(L)\in \ell^\infty} for every {L\in C(X)^*}, and so such a {P} defines a linear map {\Phi_P\colon C(X)^*\rightarrow \ell^\infty}.

Because {L} is determined by the values {L(p)} (by linearity and continuity of {L}, and density of the span of {P}), the map {\Phi} is 1-1. In particular, it is an isomorphism onto its image, which we denote by

\displaystyle  V_P := \Phi_P(C(X)^*) \subset \ell^\infty

Note that {V_P \neq \ell^\infty} because {C(X)^*} is separable and {\ell^\infty} is not.

It is straightforward to see that {\Phi_P} is continuous, and its inverse is also continuous on {V_P}. Thus we can translate questions about {C(X)^*}, and in particular about {\mathcal{M}}, into questions about {\ell^\infty}.

Remark 1 It is a nontrivial problem to determine which elements of {\ell^\infty} correspond to elements of {C(X)^*}, and also to determine which of those sequences correspond to actual measures (elements of {\mathcal{M}}). We will not need to address either of these problems here.

The action {f\colon X\rightarrow X} induces an action on {C(X)} by {\varphi\mapsto \varphi\circ f}, and hence on {C(X)^*} by duality. This action {f_*\colon C(X)^*\rightarrow C(X)^*} is given by

\displaystyle  (f_*L)(\varphi) = L(\varphi\circ f). \ \ \ \ \ (2)

In particular, {f} also induces an action {f_*\colon \mathcal{M}\rightarrow\mathcal{M}} by

\displaystyle  \int\varphi\,d(f_*\mu) = \int (\varphi \circ f)\,d\mu. \ \ \ \ \ (3)

A measure {\mu} is {f}-invariant iff it is a fixed point of {f_*}. Let {f_P} be the action induced by {f} on {V_P\subset \ell^\infty}; that is,

\displaystyle  f_P(\Phi_P(\mu)) = \Phi_P(f_*\mu). \ \ \ \ \ (4)

If {P} can be chosen so that {f_P} takes a particularly nice form, then this can be used to understand what invariant measures {f} has, and how empirical measures converge.

Let us say more clearly what is meant by convergence of empirical measures. Given {x\in X} and {n\in {\mathbb N}}, let {\mathcal{E}_n(x) = \frac 1n \sum_{k=0}^{n-1} \delta_{f^kx}} be the empirical measure along the orbit segment {x,f(x),\dots,f^{n-1}(x)}. Let {V(x)\subset \mathcal{M}} be the set of weak* accumulation points of the sequence {\mathcal{E}_n(x)}. By compactness of {\mathcal{M}}, the set {V(x)} is non-empty, and it is a standard exercise to show that every measure in {V(x)} is {f}-invariant.

In particular, if {(X,f)} is uniquely ergodic, then it only has one invariant measure {\mu}, and so {V(x) = \{\mu\}} for every {x\in X}. In this case we have {\mathcal{E}_n(x) \rightarrow \mu} for every {x}, and it is reasonable to ask how quickly this convergence occurs.

2. Irrational rotations

Now we specialise to the case of an irrational rotation. Let {X=S^1\subset {\mathbb C}} be the unit circle, fix {\theta\in {\mathbb R}} irrational, and let {f\colon X\rightarrow X} be given by {f(z) = e^{2\pi i\theta}z}. We will show that {\mu\in \mathcal{M}} is {f}-invariant iff it is Lebesgue measure, and then examine what happens in a broader setting.

Given {n\in {\mathbb Z}}, let {p_n(z) = z^n}, and let {P = \{p_n \mid n\in {\mathbb Z}\}}. Then the span of {P} contains all functions {S^1\rightarrow {\mathbb C}} that are polynomials in {z} and {\bar{z}}, thus it is a subalgebra of {C(S^1)} that contains the constant functions, separates points, and is closed under complex conjugation. By the Stone–Weierstrass theorem, this span is dense in {C(S^1)}, and since {P} is bounded the discussion from above gives an isomorphism {\Phi\colon C(S^1)^* \rightarrow V\subset \ell^\infty}, where we suppress {P} in the notation. This isomorphism is given by

\displaystyle  \Phi(L)_n = L(p_n) \ \ \ \ \ (5)

for a general {L\in C(S^1)^*}, and for {\mu\in \mathcal{M}} we write

\displaystyle  \Phi(\mu)_n = \int_{S^1} z^n \,d\mu(z). \ \ \ \ \ (6)

The sequence {\Phi(\mu)} is the sequence of Fourier coefficients associated to the measure {\mu}. The choice of {p_n} means that the action {f} induces on {V\subset \ell^\infty} takes a simple form: the Fourier coefficients of {L} and {f_*L} are related by

\displaystyle  \Phi(f_*L)_n = (f_*L)(z^n) = L((f(z))^n) = L\left(e^{2\pi i \theta n} z^n\right) = e^{2\pi i \theta n} \Phi(L)_n.

Thus if {\mu} is invariant, we have {\Phi(\mu)_n = e^{2\pi i \theta n} \Phi(\mu)_n} for all {n=0,1,2,\dots}. Because {\theta} is irrational, we have {e^{2\pi i\theta n} \neq 1} for all {n\neq 0}, and so {\Phi(\mu)_n=0}. Thus the only non-zero Fourier coefficient is {\Phi(\mu)_0 = \int {\mathbf{1}} \,d\mu(z) = 1}. Because {\Phi} is an isomorphism between {C(S^1)^*} and {V\subset \ell^\infty}, this shows that the only {L\in C(S^1)^*} with {f_*L=L} is Lebesgue measure. In particular, {f} is uniquely ergodic, with Lebesgue as the only invariant measure, and thus for any {z\in S^1}, the empirical measures {\mathcal{E}_n(z)} converge to Lebesgue.

3. Random rotations

Consider a sequence of points {z_1,z_2,z_3,\dots\in S^1}, and let {m_n\in \mathcal{M}} be the average of the point masses on the first {n} points of the sequence:

\displaystyle  m_n = \frac 1n \sum_{k=1}^n \delta_{z_k}, \qquad\qquad m_n(\varphi) = \frac 1n \sum_{k=1}^n \varphi(z_k). \ \ \ \ \ (7)

We say that the sequence {z_n} equidistributes if {m_n} converges to Lebesgue on {S^1} in the weak* topology.

The previous sections showed that if the points of the sequence are related by {z_{n+1} = e^{2\pi i \theta} z_n}, where {\theta} is irrational, then the sequence equidistributes. A natural generalisation is to ask what happens when the points {z_n} are related not by a fixed rotation, but by a randomly chosen rotation.

Here is one way of making this precise. Let {\Omega = \{1,2\}^{\mathbb N}} be the set of infinite sequences of 1s and 2s, and let {\mu} be the {\left(\frac 12, \frac 12\right)}-Bernoulli measure on {\Omega}, so that all sequences of length {n} are equally likely. Fix real numbers {\theta_1} and {\theta_2}, and fix {z_1\in S^1}. Given {\omega\in \Omega}, consider the sequence {z_n(\omega)} given by

\displaystyle  z_{n+1}(\omega) = e^{2\pi i \theta_{\omega_n}} z_n(\omega). \ \ \ \ \ (8)

Then one may ask whether or not {z_n(\omega)} equidistributes almost surely (that is, with probability 1 w.r.t. {\mu}). The remainder of this post will be dedicated to proving the following result.

Theorem 1 If either of {\theta_1} or {\theta_2} is irrational, then {z_n(\omega)} equidistributes almost surely.

Remark 2 The proof given here follows a paper by Lagarias and Soundararajan, to which I was referred by Lucia on MathOverflow.

Using Fourier coefficients as in the previous section, we have that {z_n(\omega)} equidistributes iff all the non-constant Fourier coefficients of {m_n(\omega)} converge to zero — that is, iff {\Phi(z_n(\omega))_k \rightarrow 0} as {n\rightarrow\infty} for all {k\neq 0}. This is Weyl’s criterion for equidistribution.

Fix a value of {k\neq 0}, which will be suppressed in the notation from now on. Write {a_n} for the absolute value of the {k}th Fourier coefficient of {m_n(\omega)}, and note that

\displaystyle  a_n := |\Phi(z_n(\omega))_k| = |m_n(\omega)(z^k)| = \frac 1n \left|\sum_{j=1}^n z_j(\omega)^k\right|. \ \ \ \ \ (9)

The outline of the proof is as follows.

  1. Show that there is a constant {C} such that the expected value of {a_n} is at most {C/n}.
  2. Given {\delta>0}, show that there is a constant {C'} such that the probability that {a_n} exceeds {\delta} is at most {C'/n}.
  3. Find an exponentially increasing sequence {n_j\rightarrow\infty} such that if {a_{n_j}\leq \delta}, then {a_n\leq 2\delta} for every {n\in [n_j,n_{j+1}]}.
  4. Use the Borel–Cantelli lemma to deduce that with probability 1, {a_{n_j}} exceeds {\delta} only finitely often, hence {a_n} exceeds {2\delta} only finitely often. Since {\delta>0} was arbitrary this shows that {a_n\rightarrow 0}.

Step 1

Given {\xi\in {\mathbb R}}, let {\|\xi\|} denote the distance between {\xi} and the nearest integer.

Lemma 2

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \left( 1 + \frac 1{\|k\theta_1\|^2 + \|k\theta_2\|^2}\right) \frac 1n \ \ \ \ \ (10)

Proof: Let {y_1\in {\mathbb R}} be such that {z_1 = e^{2\pi i y_1}}, and define {y_n} recursively by {y_{n+1} = y_n + \theta_{\omega_n}}, so that {z_n = e^{2\pi i y_n}}.

\displaystyle  \begin{aligned} a_n^2 &= \frac 1{n^2} \left\lvert \sum_{j=1}^n z_j^k \right\rvert^2 = \frac 1{n^2} \left(\sum_{\ell=1}^n z_\ell^k\right) \left(\sum_{j=1}^n \overline{z_j^k}\right) \\ &= \frac 1{n^2} \left( n + \sum_{\ell\neq j} z_\ell^k z_j^{-k} \right) = \frac 1{n^2} \left( n + \sum_{\ell\neq j} e^{2\pi i k(y_\ell - y_j)} \right). \end{aligned}

Using the fact that {z+\bar{z} = 2\Re(z)}, we have

\displaystyle  \sum_{\ell \neq j} e^{2\pi i k(y_\ell - y_j)} = 2\Re \sum_{1\leq \ell < j \leq n} e^{2\pi i k(y_\ell - y_j)}.

If {\ell - j = r}, then

\displaystyle  \begin{aligned} \mathop{\mathbb E}_\mu[e^{2\pi i k (y_\ell - y_j)}] &= \frac 1{2^r} \sum_{\omega_\ell,\dots, \omega_{j-1}} e^{2\pi i k \sum_{m=\ell}^{j-1} \theta_{\omega_{m}}} = \frac 1{2^r} \sum \prod_{m=\ell}^{j-1} e^{2\pi i k \theta_{\omega_m}} \\ &= \left(\frac{e^{2\pi i k\theta_1} + e^{2\pi i k\theta_2}}{2}\right)^r, \end{aligned}

where the sums are over all {\omega_m\in \{1,2\}} for {\ell\leq m<j}. Since there are {n-r} values of {\ell,j} with {\ell - j = r}, we have

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] = \frac 1{n^2}\left(n + 2\Re \sum_{r=1}^n (n-r)z^r\right), \ \ \ \ \ (11)

where {z=\frac 12(e^{2\pi i k\theta_1} + e^{2\pi i k\theta_2})}. Now

\displaystyle  \begin{aligned} \sum_{r=1}^n (n-r) z^r &= \sum_{r=1}^{n-1} z_r + \sum_{r=1}^{n-2} z^r + \cdots + \sum_{r=1}^1 z^r \\ &= \frac z{1-z}\sum_{s=1}^{n-1} (1-z^s) = \frac z{1-z} \left( n-\sum_{s=0}^{n-1} z^s \right). \end{aligned}

and since {|z|\leq 1} we have

\displaystyle  \left\lvert \sum_{r=1}^n (n-r) z^r \right\rvert \leq \frac{2n}{|1-z|}.

Together with (11), this gives

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \frac 1n \left( 1 + \frac 4{|1-z|}\right) = \frac 1n\left( 1 + \frac 8{|2 - e^{2\pi i k \theta_1} - e^{2\pi i k \theta_2}|} \right).

Using the fact that {|w|\geq |\Re w|} and that {\Re(1-e^{2\xi i}) \geq 1-\cos(2\xi) = 2\sin^2\xi}, we have

\displaystyle  \begin{aligned} |2 - e^{2\pi ik\theta_1} - e^{2\pi ik\theta_2}| &\geq 2(\sin^2 \pi k\theta_1 + \sin^2 \pi k\theta_2), \end{aligned}

and so

\displaystyle  \mathop{\mathbb E}_\mu[a_n^2] \leq \frac 1n\left(1 + \frac 4{\sin^2\pi k\theta_1 + \sin^2\pi k\theta_2}\right)

Finally, for {|\xi|\leq \frac 12} we have {|\sin \pi\xi| \geq |2\xi|}, which proves the bound in (10). \Box

Because one of {\theta_1,\theta_2} is irrational, the denominator in (10) is positive, and so {C := 1 + (\|k\theta_1\|^2 + \|k\theta_2\|^2)^{-1} < \infty}, which completes Step 1.

Step 2

Given {\delta>0}, we have

\displaystyle  \mathop{\mathbb E}[a_n^2] \geq \delta^2 \mathop{\mathbb P}(a_n\geq \delta),

and so by Lemma 2 we have

\displaystyle  \mathop{\mathbb P}(a_n\geq \delta) < \frac C{\delta^2 n}.

Putting {C' = C\delta^{-2}} completes step 2.

Step 3

Given {m\leq n\in {\mathbb N}}, we have

\displaystyle  na_n \leq ma_m + \left|\sum_{\ell=m}^{n-1} z_\ell^k\right| \leq ma_m + n-m,

and so

\displaystyle  a_n \leq \frac{n-m}n + \frac mn a_m.

In particular, if {n\leq (1+\delta)m} for some {\delta>0}, we have

\displaystyle  a_n \leq \delta + a_m. \ \ \ \ \ (12)

Let {n_j} be such that

\displaystyle  1+\frac \delta 2 \leq \frac{n_{j+1}}{n_j} \leq 1+\delta \ \ \ \ \ (13)

for all {j}. If {a_{n_j}\leq \delta}, then (12) implies that {a_n\leq 2\delta} for all {n\in [n_j, n_{j+1}]}.

Step 4

Let {E_j} be the event that {a_{n_j} \geq \delta}. By Part 2, we have {\mathop{\mathbb P}(E_j) \leq C'/n_j}, and because {n_j} increases exponentially in {j} by (13), we have {\sum_{j=1}^\infty \mathop{\mathbb P}(E_j) < \infty}. By the Borel–Cantelli lemma, this implies that with probability 1, there are only finitely many values of {j} for which {a_{n_j}\geq \delta}.

By the previous part, this in turn implies that there are only finitely many values of {n} for which {a_n\geq 2\delta}. In particular, {\varlimsup a_n \leq 2\delta}, and since {\delta>0} was arbitrary, we have {a_n\rightarrow 0}. Thus the sequence {z_n} satisfies Weyl’s criterion almost surely, which completes the proof of Theorem 1.

Posted in ergodic theory, examples, random dynamics, statistical laws | Tagged | Leave a comment