## Fubini foiled

An important issue in hyperbolic dynamics is that of absolute continuity. Suppose some neighbourhood ${U}$ of a smooth manifold ${M}$ is foliated by a collection of smooth submanifolds ${\{W_\alpha \mid \alpha\in A\}}$, where ${A}$ is some indexing set. (Here “smooth” may mean ${C^1}$, or ${C^2}$, or even more regularity depending on the context.)

Fixing a Riemannian metric on ${M}$ gives a notion of volume ${m}$ on the manifold ${M}$, as well as a notion of “leaf volume” ${m_\alpha}$ on each ${W_\alpha}$, which is just the volume form coming from the induced metric. Then one wants to understand the relationship between ${m}$ and ${m_\alpha}$.

The simplest example of this comes when ${U=(0,1)^2}$ is the (open) unit square and ${\{W_\alpha\}}$ is the foliation into horizontal lines, so ${I=(0,1)}$ and ${W_\alpha = (0,1)\times \{\alpha\}}$. Then Fubini’s theorem says that if ${E}$ is any measurable set, then ${m(E)}$ can be found by integrating the leaf volumes ${m_\alpha(E)}$. In particular, if ${m(E)=0}$, then ${m_\alpha(E)=0}$ for almost every ${\alpha}$, and conversely, if ${E}$ has full measure, then ${m_\alpha(E)=1}$ for almost every ${\alpha}$.

For the sake of simplicity, let us continue to think of a foliation of the two-dimensional unit square by one-dimensional curves, and assume that these curves are graphs of smooth functions ${\phi_\alpha\colon (0,1)\rightarrow (0,1)}$. The story in other dimensions is similar.

Writing ${\Phi(x,y) = \phi_y(x)}$ gives a map ${\Phi\colon (0,1)^2 \rightarrow (0,1)^2}$, and we see that our foliation is the image under ${\Phi}$ of the foliation by horizontal lines. Now we must make a very important point about regularity of foliations. The leaves of the foliation are smooth, and so ${\Phi}$ depends smoothly on ${x}$. However, no assumption has been made so far on the transverse direction — that is, dependence on ${y}$. In particular, we cannot assume that ${\Phi}$ depends smoothly on ${y}$.

If ${\Phi}$ depends smoothly on both ${x}$ and ${y}$, then we have a smooth foliation, not just smooth leaves, and in this case basic results from calculus show that a version of Fubini’s theorem holds:

$\displaystyle m(E) = \int_A \int_{W_\alpha} \rho_\alpha(z) {\mathbf{1}}_E(z) \,dm_\alpha(z)\,d\alpha,$

where the density functions ${\rho_\alpha\in L^1(W_\alpha,m_\alpha)}$ can be determined in terms of the derivative of ${\Phi}$.

It turns out that when the foliation is by local stable and unstable manifolds of a hyperbolic map, smoothness of the foliation is too much to ask for, but it is still possible to find density functions ${\rho_\alpha}$ such that Fubini’s theorem holds, and the relationship between ${m}$-null sets and ${m_\alpha}$-null sets is as expected.

However, there are foliations where the leaves are smooth but the foliation as a whole does not even have the absolute continuity properties described in the previous paragraph. This includes a number of dynamically important examples, namely intermediate foliations (weak stable and unstable, centre directions in partial hyperbolicity) for certain systems. These take some time to set up and study. There is an elementary non-dynamical example first described by Katok; a similar example was described by Milnor (Math. Intelligencer 19(2), 1997), and it is this construction which I want to outline here, since it is elegant, does not require much machinery to understand, and neatly illustrates the possibility that absolute continuity may fail.

The example consists of two constructions: a set ${E}$ and a foliation of the square by a collection of smooth curves ${W_\alpha}$. These are built in such a way that ${E}$ has full Lebesgue measure (${m(E)=1}$) but intersects each curve ${W_\alpha}$ in at most a single point, so that in particular ${m_\alpha(E)=0}$ for all ${\alpha}$.

1. The set

Given ${p\in (0,1)}$, consider the piecewise linear map ${f_p\colon (0,1)\rightarrow (0,1]}$ given by

$\displaystyle f_p(y) = \begin{cases} \frac yp & y\leq p \\ \frac {y-p}{1-p} &y >p \end{cases}$

It is not hard to see that ${f_p}$ preserves Lebesgue measure ${dy}$ and that ${(f_p,dx)}$ is measure-theoretically conjugate to ${(\sigma,\mu_p)}$, where ${\sigma}$ is the shift map on ${\Sigma=\{0,1\}^{\mathbb N}}$ and ${\mu_p}$ is the ${(p,1-p)}$-Bernoulli measure. The conjugacy ${\pi_p\colon (0,1) \rightarrow \Sigma}$ is given by coding trajectories according to whether the ${n}$th iterate falls into ${I_0 = (0,p]}$ or ${I_1=(p,1)}$.

Given ${y}$, let ${\beta_n(y)}$ be the number of times that ${y,f_p(y),\dots,f_p^{n-1}(y)}$ lands in ${I_1}$. Then by the strong law of large numbers, Lebesgue almost every ${y\in (0,1)}$ has ${\frac 1n \beta_n(y) \rightarrow 1-p}$. Let ${E}$ be the set of all pairs ${(p,y)}$ such that this is true; then ${m(E) = 1}$ by Fubini’s theorem (applied to the foliation of the unit square into vertical lines). It only remains to construct a foliation that intersects ${E}$ in a “strange” way.

2. The foliation

The curves ${W_\alpha}$ will be chosen so that two points ${(p,y)}$ and ${(p',y')}$ lie on the same ${W_\alpha}$ if and only if they are coded by the same sequence in ${\Sigma}$ — that is, if and only if ${\pi_p(y) = \pi_{p'}(y')}$. Then because the limit of ${\frac 1n\beta_n}$ must be constant on ${W_\alpha}$ if it exists, we deduce that each ${W_\alpha}$ intersects the set ${E}$ at most once.

It remains to see that this condition defines smooth curves. Given ${\alpha\in (0,1)}$, let ${a=\pi_{1/2}(\alpha)=a_1a_2\cdots}$ be the binary expansion of ${\alpha}$, so that ${\alpha = \sum a_n 2^{-n}}$. Now let ${W_\alpha = \{(p,y) \mid \pi_p(y)=a\}}$.

Fix ${p\in (0,1)}$, let ${y_1=y = \phi_\alpha(p)}$ be such that ${(p,y)\in W_\alpha}$, and let ${y_{n+1}=f_p(y_n)}$, so that ${y_n\in I_{a_n}}$ for all ${n}$. Thus

$\displaystyle y_{n+1} = \begin{cases} \frac {y_n}{p} & a_n=0 \\ \frac{y_n-p}{1-p} & a_n=1 \end{cases}$

For convenience of notation, write ${p(0)=p}$ and ${p(1)=1-p}$. Then the relationship between ${y_{n+1}}$ and ${y_n}$ can be written as

$\displaystyle y_{n+1} = \frac{y_n - a_n p(0)}{p(a_n)} \qquad\Rightarrow\qquad y_n = p(0)a_n + p(a_n)y_{n+1}.$

This can be used to write ${y=\phi_\alpha(p)}$ in terms of the sequence ${a_n}$. Indeed,

\displaystyle \begin{aligned} \phi_\alpha(p) &=y = y_1 = p(0)a_1 + p(a_1)y_2 \\ &= p(0)a_1 + p(a_1)\big(p(0) a_2 + p(a_2)y_3\big) \\ &= p(0)\big(a_1 + p(a_1)a_2\big) + p(a_1)p(a_2)y_3 \\ &= p(0)\big(a_1 + p(a_1)a_2 + p(a_1)p(a_2)a_3\big) + p(a_1)p(a_2)p(a_3)y_4, \end{aligned}

and so on, so that writing ${\psi_n(p) = p(a_1)p(a_2)\cdots p(a_{n-1})}$, one has

$\displaystyle \phi_\alpha(p) = p(0) \sum_{n=1}^\infty \psi_n(p) a_n.$

The summands are analytic functions of ${p}$, and the sum converges uniformly on each interval ${[\epsilon,1-\epsilon]}$, since for ${p\in [\epsilon, 1-\epsilon]}$ we have ${\psi_n(p) \leq (1-\epsilon)^n}$. In fact, this uniform convergence extends to complex values of ${p}$ in the disc of radius ${\frac 12 - \epsilon}$ centred at ${\frac 12}$, and so by the Weierstrass uniform convergence theorem, the function ${\phi_\alpha}$ is analytic in ${p}$.

As discussed above, ${W_\alpha}$ is the graph of the analytic function ${\phi_\alpha}$, and each ${W_\alpha}$ intersects ${E}$ at most once, despite the fact that ${E}$ has Lebesgue measure 1 in the unit square. This demonstrates the failure of absolute continuity.

## Sharkovsky’s theorem

This post is based on a talk given by Keith Burns in the UH dynamical systems seminar yesterday, in which he presented a streamlined proof of Sharkovsky’s theorem due to him and Boris Hasselblatt.

1. Background and statement of the theorem

We recall the statement of Sharkovsky’s theorem. The setting is a continuous map ${f\colon {\mathbb R}\rightarrow {\mathbb R}}$, and one is interested in what dynamical properties ${f}$ must necessarily have if we know that ${f}$ has a periodic orbit of some given least period ${p}$ — for example, does existence of such an orbit force the existence of periodic points with least period ${q}$ for various values of ${q}$? Or does it force the map to have positive entropy?

The assumptions of continuity and one-dimensionality are vital: for discontinuous one-dimensional maps, or for continuous higher-dimensional maps, it is possible to have a point of least period ${p}$ without having points of any other period or having positive entropy.

In one dimension, on the other hand, one can begin with the following simple observation: if ${f}$ has a periodic point with least period 2, then there is ${p}$ such that ${f(p) > p=f^2(p)}$, and so applying the intermediate value theorem to ${f(x)-x}$, the map must have a fixed point.

What if ${f}$ has a periodic point ${p}$ with least period 3? Suppose that ${p (the other configurations are similar), and let ${I_1=[p,f(p)]}$ and ${I_2=[f(p),f^2(p)]}$. Once again using the intermediate value theorem, we see that ${f(I_1) \supset I_2}$ and ${f(I_2) \supset I_1 \cup I_2}$. Thus we may consider the topological Markov chain ${(X,\sigma)}$ given by the graph in Figure 1, and observe that it embeds into the dynamics of ${f}$, so that in particular, ${f}$ has periodic points of every order.

Fig 1

Sharkovsky’s theorem generalises the above observations. Consider the following unorthodox total order on the integers:

\displaystyle \begin{aligned} 3\prec 5&\prec 7\prec \cdots \\ 6\prec 10&\prec 14\prec \cdots \\ 12\prec 20&\prec 28\prec \cdots \\ &\vdots \\ \cdots \prec 16 &\prec 8 \prec 4 \prec 2 \prec 1 \end{aligned}

Then if ${f}$ has a periodic point of least period ${p}$, it has a point of least period ${q}$ for every ${q\succ p}$.

The paper by Burns and Hasselblatt includes a discussion of the history of the theorem and the various proofs that have been given. Here we outline the most streamlined proof available at the present time.

2. Sketch of the proof

Given two intervals ${I,J\subset {\mathbb R}}$, write ${I\rightarrow J}$ if ${f(I)\supset J}$. As in the case of a period 3 orbit above, the key will be to construct a collection of intervals ${I_j}$ (whose endpoints are certain points of the period ${p}$ orbit) such that the covering relations ${I_j\rightarrow I_k}$ yield a directed graph with periodic paths of all the desired periods.

To deduce existence of various periodic orbits for ${f}$ from these covering relations, one makes the following observations.

1. If ${f^p(I)\supset I}$, then ${I}$ contains a fixed point of ${f^p}$.
2. If ${I_0 \rightarrow I_1 \rightarrow \cdots \rightarrow I_k}$, then ${f^k(I_0)\supset I_0}$.
3. The above two observations guarantee existence of ${x\in I_0}$ with ${f^k(x)=x}$. To guarantee that ${x}$ can be chosen with least period ${k}$, it suffices to know that some ${I_j}$ to have disjoint interior from all the others, and that the endpoints of this ${I_j}$ do not follow the itinerary given by the sequence of covering relations above.

Detailed proofs of the above observations are in the Burns-Hasselblatt paper. With these tools in hand, the proof comes down to constructing the directed graph of covering relations referred to above.

Fig 2

To this end, consider a periodic orbit ${P}$ with ${p}$ points — the green points in Figure 2. These are permuted by ${f}$; some are moved to the left, some are moved to the right. Of all the points in the orbit that are moved to the right, let ${z}$ be the rightmost, so that all points in the orbit to the right of ${z}$ are moved to the left. Let ${z'}$ be the next point to the right of ${z}$, and let ${\Delta=[z,z']}$ as in the figure.

Now consider the intervals ${I_k = f^k(\Delta)}$. Note that ${I_0}$ has the following property: every point in ${P}$ lying to the left of ${\Delta}$ moves to the right of ${\Delta}$, and vice versa. Let ${I_m}$ be the first interval that does not have this property, and let ${x\in I_m\cap P}$ be a point that does not switch sides of ${\Delta}$ under the action of ${f}$. For concreteness we consider the case where ${x}$ is to the left of ${\Delta}$, as in Figure 2.

By construction, all the points between ${x}$ and ${\Delta}$ are mapped to the right of ${\Delta}$. Of these, let ${y}$ be the point that is mapped furthest to the right.

Write ${R_y}$ for the interval ${[y,\infty)}$, and consider the intervals ${J_k = I_k \cap R_y}$ for ${0\leq k < m}$, together with ${J_m = [x,y]}$. We claim that ${J_0\rightarrow J_0}$, ${J_m\rightarrow J_0}$, and ${J_k \rightarrow J_{k+1}}$ for each ${0\leq k, so that we have the covering relations shown in Figure 3, which can be used to establish existence of periodic points with all least periods greater than ${m}$.

Fig 3

So why do the intervals ${J_k}$ have the covering relations shown?

1. The fact that ${J_0\rightarrow J_0}$ follows from the fact that ${J_0=I_0=\Delta}$ and the endpoints of ${\Delta}$ switch sides of ${\Delta}$.
2. The definition of ${I_k}$ gives ${I_k\rightarrow I_{k+1}}$.
3. Writing ${I_k^R}$ for the part of ${J_k}$ lying in or to the right of ${\Delta}$ and ${I_k^L}$ for the part lying in or to the left, we see that ${I_k^R\rightarrow I_{k+1}^L}$ and ${I_k^L\rightarrow I_{k+1}^R}$; this is because for ${0\leq k < m}$, all points of ${P\cap I_k}$ switch sides of ${\Delta}$ under the action of ${f}$. (This is the definition of ${m}$.)
4. Using similar notation for ${J_k^L}$ and ${J_k^R}$, we see that ${I_k^R = J_k^R}$ and ${J_k^L \rightarrow I_k^R}$; this is because ${y}$ was chosen to be the point that maps furthest to the right.
5. This establishes that ${J_k^L \rightarrow J_k^R}$, and similarly ${J_k^R \rightarrow J_k^L}$.

For ${k=m}$ we see that ${f(x)}$ lies to the left of ${\Delta}$, and ${f(y)}$ to the right, so that ${f(J_m)\supset \Delta=I_0}$.

The extra piece of information in the final item of the list above shows that the graph in Figure 3 can be replaced by the one in Figure 4.

Fig 4

Let us sum up what we know. One of the following two cases holds. Case 1: There is some point ${x}$ of the orbit such that ${x}$ and ${f(x)}$ both lie on the same side of ${\Delta}$. Case 2: Every point of the orbit is mapped to the other side of ${\Delta}$ by ${f}$.

In Case 1, the arguments above show that for every ${q>p}$ and every even ${q, there is an orbit of least period ${q}$, which is at least as strong as the conclusion in Sharkovsky’s theorem.

In Case 2, one may observe that since ${f}$ permutes the elements of ${P}$ and every element is moved from one side of ${\Delta}$ to the other, ${p}$ must be even. Moreover, by considering ${P^L}$ and ${P^R}$ separately, as invariant sets for ${f^2}$, we can repeat the argument inductively to complete the proof of the theorem.

## Central Limit Theorem for dynamical systems using martingales

This post is based on notes from Matt Nicol’s talk at the UH summer school in dynamical systems. The goal is to present the ideas behind a proof of the central limit theorem for dynamical systems using martingale approximations.

1. Conditional expectation

Before we can define and use martingales, we must recall the definition of conditional expectation. Let ${(\Omega,\mathop{\mathbb P})}$ be a probability space, with ${\mathop{\mathbb P}}$ defined on a ${\sigma}$-algebra ${\mathop{\mathcal B}}$. Let ${\mathop{\mathcal F}\subset \mathop{\mathcal B}}$ be a sub-${\sigma}$-algebra of ${\mathop{\mathcal B}}$.

Example 1 Consider the doubling map ${T\colon [0,1]\rightarrow [0,1]}$ given by ${T(x) = 2x\pmod 1}$. Let ${\mathop{\mathbb P}}$ be Lebesgue measure, ${\mathop{\mathcal B}}$ the Borel ${\sigma}$-algebra, and ${\mathop{\mathcal F} = T^{-1}\mathop{\mathcal B} = \{T^{-1}(B) \mid B\in \mathop{\mathcal B}\}}$. Then ${\mathop{\mathcal F}}$ is a sub-${\sigma}$-algebra of ${\mathop{\mathcal B}}$, consisting of precisely those sets in ${\mathop{\mathcal B}}$ which are unions of preimage sets ${T^{-1}(x)}$ — that is, those sets ${F\in\mathop{\mathcal B}}$ for which a point ${y\in [0,\frac 12]}$ is in ${F}$ if and only if ${y+\frac 12\in F}$.

This example extends naturally to yield a decreasing sequence of ${\sigma}$-algebras

$\displaystyle \mathop{\mathcal B} \supset T^{-1}\mathop{\mathcal B} \supset T^{-2}\mathop{\mathcal B} \supset \cdots.$

Given a sub-${\sigma}$-algebra ${\mathop{\mathcal F}\subset \mathop{\mathcal B}}$ and a random variable ${Y\colon \Omega\rightarrow {\mathbb R}}$ that is measurable with respect to ${\mathop{\mathcal B}}$, the conditional expectation of ${Y}$ given ${\mathop{\mathcal F}}$ is any random variable ${Z}$ such that

1. ${Z}$ if ${\mathop{\mathcal F}}$-measurable (that is, ${Z^{-1}(I)\in\mathop{\mathcal F}}$ for every interval ${I\subset {\mathbb R}}$), and
2. ${\int_A Z\,d\mathop{\mathbb P} = \int_A Y\,d\mathop{\mathbb P}}$ for every ${A\in \mathop{\mathcal F}}$.

It is not hard to show that these conditions characterise ${Z}$ for an almost-sure choice of ${\omega\in\Omega}$, and so the conditional expectation is uniquely defined as a random variable. We write it as ${\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]}$.

A key property is that conditional expectation is linear: for every ${\mathop{\mathcal F}\subset \mathop{\mathcal B}}$, every ${Y_1,Y_2\colon \Omega\rightarrow{\mathbb R}}$, and every ${a\in {\mathbb R}}$, we have

$\displaystyle \mathop{\mathbb E}[aY_1 + Y_2\mid \mathop{\mathcal F}] = a\mathop{\mathbb E}[Y_1\mid\mathop{\mathcal F}] + \mathop{\mathbb E}[Y_2\mid\mathop{\mathcal F}].$

Example 2 If ${Y}$ is already ${\mathop{\mathcal F}}$-measurable, then ${\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]=Y}$.

Example 3 At the other extreme, if ${Y}$ and ${\mathop{\mathcal F}}$ are independent — that is, if ${\mathop{\mathbb E}[Y{\mathbf{1}}_A]=\mathop{\mathbb E}[Y]\mathop{\mathbb P}[A]}$ for every ${A\in\mathop{\mathcal F}}$ — then ${\mathop{\mathbb E}[Y\mid\mathop{\mathcal F}]}$ is the constant function ${\mathop{\mathbb E}[Y]{\mathbf{1}}}$.

Example 4 Suppose ${\{\Omega_i\mid i\in{\mathbb N}\}}$ is a countable partition of ${\Omega}$ such that ${\mathop{\mathbb P}[\Omega_i]>0}$ for every ${i}$. Let ${\mathop{\mathcal F}=\sigma(\Omega_1,\Omega_2,\dots)}$ be the ${\sigma}$-algebra generated by the sets ${\Omega_i}$. Then

$\displaystyle \mathop{\mathbb E}[Y|\mathop{\mathcal F}] = \sum_{i\in{\mathbb N}} \frac{\mathop{\mathbb E}[Y{\mathbf{1}}_{\Omega_i}]}{\mathop{\mathbb P}[\Omega_i]} {\mathbf{1}}_{\Omega_i}.$

2. Martingales

Now we can define martingales, which are a particular sort of stochastic process (sequence of random variables) with “enough independence” to generalise results from the IID case.

Definition 1 A sequence ${Y_n\colon \Omega\rightarrow {\mathbb R}}$ of random variables is a martingale if

1. ${\mathop{\mathbb E}[|Y_n|]<\infty}$ for all ${n}$;
2. there is an increasing sequence of ${\sigma}$-algebras (a filtration) ${\mathop{\mathcal F}_1\subset\mathop{\mathcal F}_2\subset\cdots \subset \mathop{\mathcal B}}$ such that ${Y_n}$ is measurable with respect to ${\mathop{\mathcal F}_n}$;
3. the conditional expectations satisfy ${\mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] = Y_n}$.

The first condition guarantees that everything is in ${L^1}$. If ${\mathop{\mathcal F}_n}$ is taken to be the ${\sigma}$-algebra of events that are determined by the first ${n}$ outcomes of a sequence of experiments, then the second condition states that ${Y_n}$ only depends on those first ${n}$ outcomes, while the third condition requires that if the first ${n}$ outcomes are known, then the expected value of ${Y_{n+1} - Y_n}$ is 0.

Example 5 Let ${Y_n}$ be a sequence of fair coin flips — IID random variables taking the values ${\pm1}$ with equal probability. Let ${S_n=Y_1+\cdots+Y_n}$. As suggested in the previous paragraph, let ${\mathop{\mathcal F}_n=\sigma(Y_1,\dots,Y_n)}$ be the smallest ${\sigma}$-algebra with respect to which ${Y_1,\dots,Y_n}$ are all measurable. (The sets in ${\mathop{\mathcal F}_n}$ are precisely those sets in ${\mathop{\mathcal B}}$ which are determined by knowing the values of ${Y_1,\dots,Y_n}$.)

It is easy to see that ${S_n}$ satisfies the first two properties of a martingale, and for the third, we use linearity of expectation and the definition of ${\mathop{\mathcal F}_n}$ to get

$\displaystyle \mathop{\mathbb E}[S_{n+1}|\mathop{\mathcal F}_n] = \mathop{\mathbb E}[Y_1 + \cdots + Y_n|\mathop{\mathcal F}_n] + \mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] = S_n + 0 = S_n.$

When ${Y_n}$ is a sequence of random variables for which ${S_n = Y_1+\cdots+Y_n}$ is a martingale, we say that the sequence ${Y_n}$ is a martingale difference.

In the previous example the martingale property (the third condition) was a direct consequence of the fact that the random variables ${Y_n = S_{n+1}-S_n}$ were IID. However, there are examples where the martingale differences are not IID.

Example 6 Polya’s urn is a stochastic process defined as follows. Consider an urn containing some number of red and blue balls. At each step, a single ball is drawn at random from the urn, and then returned to the urn, along with a new ball that matches the colour of the one drawn. Let ${Y_n}$ be the fraction of the balls that are red after the ${n}$th iteration of this process.

Clearly the sequence of random variables ${Y_n}$ is neither independent nor identically distributed. However, it is a martingale, as the following computation shows: suppose that at time ${n}$ there are ${p}$ red balls and ${q}$ blue balls in the urn. (This knowledge represents knowing which element of ${\mathop{\mathcal F}_n}$ we are in.) Then at time ${n+1}$, there will be ${p+1}$ red balls with probability ${\frac{p}{p+q}}$, and ${p}$ red balls with probability ${\frac{q}{p+q}}$. Either way, there will be ${p+q+1}$ total balls, and so the expected fraction of red balls is

\displaystyle \begin{aligned} \mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n] &= \frac{p}{p+q}\cdot\frac{p+1}{p+q+1} + \frac{q}{p+q}\cdot \frac{p}{p+q+1} \\ &= \frac{p(p+q+1)}{(p+q)(p+q+1)} = \frac{p}{p+q} = Y_n. \end{aligned}

If we assume that the martingale differences are stationary (that is, identically distributed) and ergodic, then we have the following central limit theorem for martingales, from a 1974 paper of McLeish (we follow some notes by S. Sethuraman for the statement).

Theorem 2 Let ${Y_i}$ be a stationary ergodic sequence such that ${\sigma^2 = \mathop{\mathbb E}[Y_i^2]<\infty}$ and ${\mathop{\mathbb E}[Y_{n+1}|\mathop{\mathcal F}_n]=0}$, where ${\mathop{\mathcal F}_n}$ is the ${\sigma}$-algebra generated by ${\{Y_i\mid 1\leq i\leq n\}}$. Then ${S_n = \sum_{i=1}^n Y_i}$ is a martingale, and ${\frac{S_n}{\sigma\sqrt{n}}}$ converges in distribution to ${N(0,1)}$.

More sophisticated versions of this result are available, but this simple version will suffice for our needs.

3. Koopman operator and transfer operator

Now we want to apply 2 to a dynamical system ${T\colon X\rightarrow X}$ with an ergodic measure ${\mu}$ by taking ${Y_i=\varphi\circ T^i}$ for some observable ${\varphi\colon X\rightarrow{\mathbb R}}$.

To carry this out, we consider two operators on ${L^2(\mu)}$. First we consider the Koopman operator ${U\colon \varphi\mapsto \varphi\circ T}$. Then we define the transfer operator ${\mathop{\mathcal P}}$ to be its ${L^2}$ adjoint — that is,

$\displaystyle \int(\mathop{\mathcal P}\varphi)\psi\,d\mu = \int\varphi(\psi\circ T)\,d\mu \ \ \ \ \ (1)$

for all ${\varphi,\psi\in L^2}$. The key result for our purposes is that the operators ${\mathop{\mathcal P}}$ and ${U}$ are one-sided inverses of each other.

Proposition 3 Given ${\varphi\in L^2}$, we have

1. ${\mathop{\mathcal P} U\varphi=\varphi}$;
2. ${U\mathop{\mathcal P}\varphi=\mathop{\mathbb E}[\varphi|T^{-1}\mathop{\mathcal B}]}$, where ${\mathop{\mathcal B}}$ is the ${\sigma}$-algebra on which ${\mu}$ is defined.

Proof: For the first claim, we see that for all ${\psi\in L^2}$ we have

$\displaystyle \int\psi\cdot (\mathop{\mathcal P} U\varphi)\,d\mu = \int (\psi\circ T)(\varphi\circ T)\,d\mu = \int\psi\varphi\,d\mu,$

where the first equality uses the definition of ${\mathop{\mathcal P}}$ and the second uses the fact that ${\mu}$ is invariant. To prove the second claim, we first observe that given an interval ${I\subset{\mathbb R}}$, we have

$\displaystyle (\mathop{\mathcal P} U\varphi)^{-1}(I) = T^{-1}((\mathop{\mathcal P} \varphi)^{-1}(I)) \in T^{-1}\mathop{\mathcal B},$

since ${\mathop{\mathcal P}}$ maps ${\mathop{\mathcal B}}$-measurable functions to ${\mathop{\mathcal B}}$-measurable functions. This shows that ${\mathop{\mathcal P} U\varphi}$ is ${T^{-1}\mathop{\mathcal B}}$-measurable, and it remains to show that

$\displaystyle \int_{T^{-1}B} U\mathop{\mathcal P}\varphi\,d\mu = \int_{T^{-1}B} \varphi\,d\mu \text{ for all }B\in \mathop{\mathcal B}. \ \ \ \ \ (2)$

This follows from a similar computation to the one above: given ${B\in \mathop{\mathcal B}}$ we have

\displaystyle \begin{aligned} \int_{T^{-1}B} U\mathop{\mathcal P}\varphi\,d\mu &= \int ((\mathop{\mathcal P}\varphi)\circ T) \cdot {\mathbf{1}}_{T^{-1}B}\,d\mu = \int((\mathop{\mathcal P}\varphi)\circ T) ({\mathbf{1}}_B \circ T)\,d\mu \\ &= \int(\mathop{\mathcal P}\varphi) {\mathbf{1}}_B\,d\mu = \int\varphi\cdot ({\mathbf{1}}_B\circ T)\,d\mu = \int_{T^{-1}B} \varphi\,d\mu, \end{aligned}

which establishes (2) and completes the proof. $\Box$

We see from Proposition 3 that a function has zero conditional expectation with respect to ${T^{-1}\mathop{\mathcal B}}$ if and only if it is in the kernel of ${\mathop{\mathcal P}}$. In particular, if ${\mathop{\mathcal P} h = 0}$ then ${\sum_{j=1}^n h\circ T^j}$ is a martingale; this will be a key tool in the next section.

Example 7 Let ${X=[0,1]}$ and ${T}$ be the doubling map. Let ${\mu}$ be Lebesgue measure. For convenience of notation we consider the ${L^2}$ space of complex-valued functions on ${X}$; the functions ${\varphi_n\colon x\mapsto e^{2\pi inx}}$ form an orthonormal basis for this space. A simple calculation shows that

$\displaystyle U\varphi_n(x) = \varphi_n(2x) = e^{2\pi in(2x)} = \varphi_{2n}(x),$

so ${U\colon \varphi_n\rightarrow \varphi_{2n}}$. For the transfer operator we obtain ${\mathop{\mathcal P}\colon \varphi_{2n}\rightarrow \varphi_n}$, while for odd values of ${n}$ we have

\displaystyle \begin{aligned} \mathop{\mathcal P}\varphi_n(x) &= \frac 12 \left(\varphi_n\left(\frac x2\right) + \varphi_n\left(\frac{1+x}{2}\right)\right) \\ &= \frac 12 \left( e^{\pi i n x} + e^{\pi i n x} e^{\pi i n}\right) = 0. \end{aligned}

4. Martingale approximation and CLT

The machinery of the Koopman and transfer operators from the previous section can be used to apply the martingale central limit theorem to observations of dynamical systems via the technique of martingale approximation, which was introduced by M. Gordin in 1969.

The idea is that if ${\mathop{\mathcal P}^n\varphi\rightarrow 0}$ quickly enough for functions ${\varphi}$ with ${\int\varphi\,d\mu=0}$, then we can approximate the sequence ${\varphi\circ T^n}$ with a martingale sequence ${h\circ T^n}$.

More precisely, suppose that the sequence ${\|\mathop{\mathcal P}^n\varphi\|_2}$ is summable; then we can define an ${L^2}$ function ${g}$ by

$\displaystyle g = \sum_{n=1}^\infty \mathop{\mathcal P}^n\varphi. \ \ \ \ \ (3)$

Let ${h = \varphi + g - g\circ T\in L^2}$. We claim that ${S_nh = \sum_{j=1}^n h\circ T^j}$ is a martingale. Indeed,

$\displaystyle \mathop{\mathcal P} h = \mathop{\mathcal P}\varphi + \mathop{\mathcal P} \sum_{n\geq 1} \mathop{\mathcal P}^n\varphi - \mathop{\mathcal P} U \sum_{n\geq 1} \mathop{\mathcal P}^n\varphi,$

and since ${\mathop{\mathcal P} U}$ is the identity we see that the last term is just ${\sum_n \mathop{\mathcal P}^n \varphi}$, so that ${\mathop{\mathcal P} h = 0}$.

Proposition 3 now implies that ${\mathop{\mathbb E}[h|T^{-1}\mathop{\mathcal B}]=0}$, and we conclude that ${S_nh}$ is a martingale, so by the martingale CLT ${\frac{S_n h}{\sigma\sqrt{n}}}$ converges in distribution to ${N(0,1)}$, where ${\sigma^2 = \mathop{\mathbb E}[h^2] = \int h^2\,d\mu}$.

Now we want to apply this result to obtain information about ${\varphi}$ itself, and in particular about ${S_n\varphi = \sum_{j=1}^n \varphi\circ T^j}$. We have ${\varphi = h + g\circ T - g}$, and so

$\displaystyle S_n\varphi = S_n h + \sum_{j=1}^n (g\circ T^{j+1} - g\circ T^j) = S_nh + g\circ T^{n+1} - g.$

This yields

$\displaystyle \frac{S_n\varphi}{\sigma\sqrt n} = \frac{S_n h}{\sigma\sqrt n} + \frac{g\circ T^{n+1} - g}{\sigma\sqrt n},$

and the last term goes to 0 in probability, which yields the central limit theorem for ${S_n\varphi}$.

Remark 1 There is a technical problem we have glossed over, which is that the sequence of ${\sigma}$-algebras ${T^{-n}\mathop{\mathcal B}}$ is decreasing, not increasing as is required by the definition of a martingale. One solution to this is to pass to the natural extension ${\hat T}$ and to consider the functions ${h\circ \hat{T}^{-j}}$ and the ${\sigma}$-algebras ${\hat{\mathop{\mathcal F}}_j = \hat{T}^j\mathop{\mathcal B}}$. Another solution is to use reverse martingales, but we do not discuss this here.

Example 8 Let ${X=[0,1]}$ and ${T\colon X\rightarrow X}$ be an intermittent type (Manneville–Pomeau) map given by

$\displaystyle T(x) = \begin{cases} x(1 - (2x)^\alpha) & x\in [0,\frac 12), \\ 2x-1 & x\in [\frac 12,1], \end{cases}$

where ${0<\alpha<1}$ is a fixed parameter. It can be shown that ${T}$ has a unique absolutely continuous invariant probability measure ${\mu}$, and that the transfer operator ${\mathop{\mathcal P}}$ has the following contraction property: for every ${\varphi\in L^2}$ with ${\int \varphi\,d\mu=0}$, there is ${C\in{\mathbb R}}$ such that ${\|\mathop{\mathcal P}^n\varphi\|_2 \leq Cn^{-\gamma}}$, where ${\gamma = \frac 12(\frac 1\alpha -1)}$.

For small values of ${\alpha}$, this shows that ${\|\mathop{\mathcal P}^n\varphi\|_2}$ is summable, and consequently ${\mu}$ satisfies the CLT by the above discussion.

Posted in Uncategorized | 2 Comments

## The Perron-Frobenius theorem and the Hilbert metric

In the last post, we introduced basic properties of convex cones and the Hilbert metric. In this post, we look at how these tools can be used to obtain an explicit estimate on the rate of convergence in the Perron–Frobenius theorem.

1. Perron–Frobenius theorem

We start by stating a version of the Perron–Frobenius theorem. Let ${A}$ be a ${d\times d}$ stochastic matrix, where here we use this to mean that the entries of ${A}$ are non-negative, and every column sums to 1: ${A_{ij}\in[0,1]}$ for all ${i,j}$, and ${\sum_{i=1}^d A_{ij} = 1}$ for all ${j}$. Thus the columns of ${A}$ are probability vectors.

Such a matrix ${A}$ describes a weighted random walk on ${d}$ sites: if the walker is presently at site ${j}$, then ${A_{ij}}$ gives the probability that he will move to site ${i}$ at the next step. Thus if we interpret a probability vector ${v}$ as giving the probability of the walker being at site ${j}$ with probability ${v_j}$, then ${v\mapsto Av}$ gives the evolution of this probability under one step of the random walk.

Now one version of the Perron–Frobenius theorem is as follows: If ${A}$ is a stochastic matrix with ${A>0}$ (that is, ${A_{ij}>0}$ for all ${i,j}$), then there is exactly one probability vector ${\pi}$ that is an eigenvector for ${A}$. Moreover, the eigenvalue associated to this eigenvector is 1, the eigenvalue 1 is simple, and all other eigenvalues have modulus ${<1}$. In particular, given any ${v\in[0,\infty)^2}$ we have ${A^n v \rightarrow \pi}$ exponentially quickly.

The eigenvector ${\pi}$ is the stationary distribution for the random walk (Markov chain) given by ${A}$, and the convergence result states that any initial distribution converges to the stationary distribution under iteration of the process.

The assumption that ${A>0}$ is quite strong: for the random walk, this says that the walker can get from any site to any other site in a single step. A more general condition is that ${A}$ is primitive: that is, there exists ${N\in{\mathbb N}}$ such that ${A^N>0}$. This says that there is a time ${N}$ such that by taking ${N}$ steps, the walker can get from any site to any other site. The same result as above holds in this case too.

In fact, the result holds in the even more general case when ${A}$ is irreducible: for every ${i,j}$ there exists ${N}$ such that ${(A^N)_{ij}>0}$. This says that the walker can get from every site to every other site, but removes the assumption that there is a single time ${N}$ that works for all site. For example, consider a random walk on a chessboard, where the walker is allowed to move one square horizontally or vertically at each step. Then for a sufficiently large even value of ${N}$, the walker can get from any white square to any other white square, but to get to a black square requires an odd value of ${N}$.

2. Rate of decay using convex cones

As stated above, the Perron–Frobenius theorem does not give any result on the rate with which ${A^n v}$ converges to ${\pi}$. One way to give an estimate on this rate is to use convex cones and the Hilbert metric, which were discussed in the last post.

2.1. A cone and a metric

Let ${{\mathop{\mathcal C}}}$ be the convex cone ${[0,\infty)^d\subset{\mathbb R}^d}$. We want an estimate on the diameter of ${A({\mathop{\mathcal C}})}$ in the Hilbert metric ${d_{\mathop{\mathcal C}}}$. Recall that this metric is given by ${d_{\mathop{\mathcal C}}(v,w)=\log(\beta/\alpha)}$, where

\displaystyle \begin{aligned} \beta &= \inf \{\mu>0 \mid \mu v - w \in {\mathop{\mathcal C}}\},\\ \alpha &= \sup\{\lambda>0\mid w-\lambda v\in {\mathop{\mathcal C}}\}. \end{aligned}

Another way of interpreting the cone ${{\mathop{\mathcal C}}}$ is in terms of the partial order it places on ${V}$, which is given by ${v\preceq w \Leftrightarrow w-v\in {\mathop{\mathcal C}}\cup\{0\}}$. We see that ${\beta}$ and ${\alpha}$ can be characterised as

$\displaystyle \alpha = \sup\{\lambda \mid \lambda w \preceq v\},\qquad \beta = \inf\{\mu \mid v\preceq \mu w\}.$

In our present example, we see that the cone ${{\mathop{\mathcal C}}=[0,\infty)^d}$ induces the partial order ${v\preceq w \Leftrightarrow v_i\leq w_i\ \forall i}$. Thus

$\displaystyle \alpha = \sup \{\lambda \mid \lambda w_i\leq v_i\ \forall i\} = \min_{1\leq i\leq d} \frac{v_i}{w_i}, \ \ \ \ \ (1)$

and similarly ${\beta = \max_{1\leq i\leq d} \frac{v_i}{w_i}}$.

2.2. Diameter of ${A({\mathop{\mathcal C}})}$

Now we need to determine the diameter ${\Delta}$ of ${A({\mathop{\mathcal C}})}$ in the Hilbert metric ${d_{\mathop{\mathcal C}}}$. If ${\Delta<\infty}$, then the theorem of Birkhoff from the previous post will imply that ${d_{\mathop{\mathcal C}}}$ contracts distances by a factor of ${\tanh(\Delta/4)<1}$.

Let ${e_i}$ be the standard basis vectors in ${{\mathbb R}^d}$. Because ${d_{\mathop{\mathcal C}}}$ is projective we can compute ${\Delta}$ by considering ${d_{\mathop{\mathcal C}}(Av,Aw)}$ where ${\sum v_i = \sum w_j = 1}$. Using the triangle inequality, we have

\displaystyle \begin{aligned} d_{\mathop{\mathcal C}}(Av,Aw) &= d_{\mathop{\mathcal C}}\left(A\sum v_i e_i, A\sum w_j e_j\right) = d_{\mathop{\mathcal C}}\left(\sum v_i (Ae_i), \sum w_j (Ae_j) \right) \\ &\leq \sum_{i,j} v_i w_j d_{\mathop{\mathcal C}}(Ae_i,Ae_j) \leq \max_{i,j} d_{\mathop{\mathcal C}}(Ae_i,Ae_j), \end{aligned}

so it suffices to consider ${d_{\mathop{\mathcal C}}(Ae_i,Ae_j)}$ for ${1\leq i,j\leq d}$. But ${Ae_i}$ is just the ${i}$th column of the matrix ${A}$, so writing ${A=[v^1 \cdots v^n]}$, where ${v^i}$ is the ${i}$th column vector, we see that

$\displaystyle \Delta \leq \max_{i,j} d_{\mathop{\mathcal C}}(v^i,v^j). \ \ \ \ \ (2)$

2.3. Contraction under multiplication by ${A}$

Now we have a very concrete procedure for estimating the amount of contraction in the ${d_{\mathop{\mathcal C}}}$ metric under multiplication by ${A}$:

1. estimate ${\Delta}$ using (2) and the expression for ${d_{\mathop{\mathcal C}}}$ in (1) and the discussion preceding it;
2. get a contraction rate of ${\tanh(\Delta/4)<1}$.

From (1) and the discussion preceding it, the distance ${d_{\mathop{\mathcal C}}(v^i,v^j)}$ is given as

$\displaystyle d_{\mathop{\mathcal C}}(v^i,v^j) = \log\beta - \log\alpha = \log\left(\max_{1\leq k\leq d} \frac{v^i_k}{v^j_k}\cdot \max_{1\leq k\leq d} \frac{v^j_k}{v^i_k}\right). \ \ \ \ \ (3)$

Let ${\Lambda = \tanh(\Delta/4)}$. To write an explicit estimate for ${\Lambda}$, we use

$\displaystyle \Lambda = \frac{e^{\Delta/4} - e^{-\Delta/4}}{e^{\Delta/4} + e^{-\Delta/4}} = \frac{1 - e^{-\Delta/2}}{1+e^{-\Delta/2}} \leq \frac{1-s}{1+s}, \ \ \ \ \ (4)$

where ${s<1}$ is any estimate we can obtain satisfying ${e^{-\Delta/2}\geq s}$. From (3) and (2), we have

$\displaystyle e^{-\Delta/2} \geq \max_{i,j}\sqrt{ \min_k\left(\frac {v_k^i}{v_k^j}\right) \min_k \left(\frac{v_k^j}{v_k^i}\right)} =:s. \ \ \ \ \ (5)$

This allows us to obtain estimates on ${d_{\mathop{\mathcal C}}(A^nv, A^nw)}$. However, we want to estimate ${d(A^nv,A^nw)}$ in a more familiar metric, such as one coming from a norm. We can relate the two by observing that if ${v,w\in(0,1]^d}$, then

\displaystyle \begin{aligned} d_{\mathop{\mathcal C}}(v,w) &= \log \max_k \left(\frac {v_k}{w_k}\right) + \log \max_k \left(\frac{w_k}{v_k}\right) \\ &\geq \max_k|\log v_k - \log w_k| \geq \max_k |v_k-w_k| = \|v-w\|_{L^\infty}, \end{aligned}

where the last inequality uses the fact that ${\log }$ has derivative ${\geq 1}$ on ${(0,1]}$. Since ${A}$ maps the unit simplex to itself (because ${A}$ is stochastic), we see that

$\displaystyle \|A^nv-A^nw\|_{L^\infty} \leq d_{\mathop{\mathcal C}}(A^nv,A^nw) \leq C\Lambda^n, \ \ \ \ \ (6)$

where ${\Lambda}$ is given by (4) and (5), and where we can take either ${C=d_{\mathop{\mathcal C}}(v,w)}$ or ${C=\Delta/\Lambda}$ (since ${d_{\mathop{\mathcal C}}(Av,Aw)\leq \Delta}$), whichever gives the better bound. Since all norms on ${{\mathbb R}^d}$ are equivalent, we have a similar bound in any norm.

3. Nonnegative matrices

The analysis in the previous section required ${A}$ to be positive (${A_{ij}>0}$ for all ${i,j}$). A more general condition is that ${A}$ is nonnegative and primitive: that is, ${A_{ij}\geq 0}$ for all ${i,j}$, and moreover there exists ${N}$ such that ${A^N>0}$.

If ${A_{ij}=0}$ for some ${i,j}$, then it is easy to see from the calculations in the previous section that ${A({\mathop{\mathcal C}})}$ has infinite diameter in the Hilbert metric, so the above arguments do not apply directly. However, they do apply to ${A^N}$ when ${A^N>0}$, and so we fix ${N}$ for which this is true, and we obtain ${\Lambda<1}$ such that ${d_{\mathop{\mathcal C}}(A^Nv, A^Nw) \leq \Lambda d_{\mathop{\mathcal C}}(v,w)}$ for all ${v,w\in{\mathop{\mathcal C}}}$.

Moreover, let ${L\in{\mathbb R}}$ be such that ${\|A^r\|\leq L}$ for all ${0\leq r. Then for any ${n\in{\mathbb N}}$ we can write ${A^n = A^{kN+r}}$ for some ${0\leq r, so that

$\displaystyle \|A^nv-A^nw\| = \|A^r(A^{kN}v - A^{kN}w)\| \leq LC \Lambda^k,$

where ${C}$ is as in (6). Thus we conclude that asymptotically, ${A^nv}$ approaches the eigenvector with contraction rate ${\Lambda^{1/N}}$.

To see this in action, consider a Markov chain with transition matrix

$\displaystyle A=\begin{pmatrix} \frac 12 & 1 \\ \frac 12 & 0\end{pmatrix}.$

That is, from the first state the walker transitions to either state with probability 1/2, while from the second state the walker always returns to the first state. Since the transition from the second state to itself is forbidden, ${A({\mathop{\mathcal C}})}$ has infinite diameter. However, the two-step transition matrix is

$\displaystyle A^2 = \begin{pmatrix} \frac 34 & \frac 12 \\ \frac 14 & \frac 12 \end{pmatrix},$

for which we can compute

$\displaystyle s = \sqrt{\frac{1/4}{1/2} \cdot \frac{1/2}{3/4}} = \frac 1{\sqrt{6}} \quad \Rightarrow\quad \Lambda \leq \frac{\sqrt{6}-1}{\sqrt{6}+1}.$

Thus the estimate on ${A^2}$ gives us a definite rate of contraction, which the estimate from ${A}$ does not.

It can be useful to use the estimate on ${A^N}$ even when ${A>0}$. For example, if we consider the Markov chain with transition matrix

$\displaystyle A = \begin{pmatrix} \frac 15 & \frac 9{10} \\ \frac 45 & \frac 1{10} \end{pmatrix},$

then we have

$\displaystyle s = \sqrt{\frac{1/5}{9/10}\cdot \frac{1/10}{4/5}} = \sqrt{\frac 29 \cdot \frac 18} = \frac 16 \quad\Rightarrow\quad \Lambda\leq \frac 57 \approx .714$

as the rate of contraction, while considering

$\displaystyle A^2 = \begin{pmatrix} \frac{19}{25} & \frac{27}{100} \\ \frac{6}{25} & \frac{73}{100} \end{pmatrix}$

gives

$\displaystyle s = \sqrt{\frac{27/100}{19/25}\cdot\frac{6/25}{73/100}} \approx .3418 \quad\Rightarrow\quad \Lambda \leq \frac{.6582}{1.3418}\approx .4906 \approx (.7)^2,$

a better estimate than we obtained from considering ${A}$ itself.

## Convex cones and the Hilbert metric

Having spent some time discussing spectral methods and coupling techniques as tools for studying the statistical properties of dynamical systems, we turn now to a third approach, based on convex cones and the Hilbert metric. This post is based on Will Ott’s talk from March 25.

1. Basic definitions

Let ${V}$ be a vector space over the reals. Ultimately we will be most interested in the case when ${V}$ is a function space, such as ${L^1}$ or ${BV}$, but for now we make the definitions in the general context.

Definition 1 A subset ${C\subset V}$ is a convex cone (or positive cone) if

1. ${C\cap (-C) = \emptyset}$;
2. ${\lambda C = C}$ for each ${\lambda>0}$;
3. ${C}$ is convex; and
4. for all ${f,g\in C}$ and ${\alpha\in {\mathbb R}}$, we have the following property: if ${\alpha_n\rightarrow \alpha}$ and ${g-\alpha_n f\in C}$ for every ${n}$, then ${g-\alpha f\in C\cup \{0\}}$.

The first three conditions are very geometric and in some sense guarantee that ${C}$ “looks like a cone should look”. The last condition is more topological; if ${V}$ is a topological vector space and ${C\cup \{0\}}$ is a closed subset of ${V}$, then this condition holds, but we stress that the condition itself is actually weaker than this and is phrased without reference to any topology on ${V}$.

Example 1 Let ${V=BV([0,1],{\mathbb R})}$ be the space of all real-valued functions on the unit interval with bounded variation, and let ${C = \{ \varphi\in V \mid \varphi\geq 0, \varphi\not\equiv 0\}}$. Then ${C}$ is a convex cone.

We see immediately from this example that the notion of convex cone is relevant to the sorts of questions we want to ask about invariant measures of a dynamical system, because this set ${C}$ is exactly the set of density functions that arises when we are searching for an absolutely continuous invariant measure.

This suggests that we will ultimately want to consider the action of some operator ${L\colon C\rightarrow C}$, and in particular may want to find a fixed point of this action (for a suitable operator ${L}$). One of the most powerful methods for finding a fixed point is to find a metric in which ${L}$ acts as a contraction, and this is accomplished by the Hilbert metric, which we now introduce.

Definition 2 Fix a convex cone ${C\subset V}$. Given ${\varphi,\psi\in C}$, let

\displaystyle \begin{aligned} \beta(\varphi,\psi) &= \inf \{\mu>0 \mid \mu\varphi - \psi\in C\},\\ \alpha(\varphi,\psi) &= \sup \{\lambda>0 \mid \psi - \lambda\varphi \in C\}, \end{aligned} \ \ \ \ \ (1)

with ${\alpha=0}$ and/or ${\beta=\infty}$ if the corresponding set is empty. The cone distance between ${\varphi}$ and ${\psi}$ is

$\displaystyle d_C(\varphi,\psi) = \log \left( \frac{\beta(\varphi,\psi)}{\alpha(\varphi,\psi)}\right). \ \ \ \ \ (2)$

The distance ${d_C}$ is also called the Hilbert (projective) metric.

Several remarks are now in order. First we observe that although ${V}$ may be infinite-dimensional, the distance ${d_C(\varphi,\psi)}$ is completely determined in terms of the two-dimensional subspace spanned by ${\varphi}$ and ${\psi}$, and in particular by the points shown in Figure 1 — in the figure, the lines ${0A}$ and ${0B}$ are the boundary of this two-dimensional cross-section of ${C}$. The lines ${0X}$ and ${Y\psi}$ are parallel, as are the lines ${0A}$ and ${\psi X}$; then we have

$\displaystyle \alpha = \frac{|\psi Y|}{|0\varphi|} \text{ and } \beta = \frac{|0X|}{|0\varphi|}.$

Fig 1

An alternate description of ${d_C}$ is available in terms of this more geometric description. Let ${\ell}$ be the line through ${\varphi}$ and ${\psi}$, and let ${A,B}$ be the points where this line intersects the boundary of ${C}$. We see from Figure 1 that the triangles ${BY\psi}$ and ${B0\varphi}$ are similar, so

$\displaystyle \alpha = \frac{|\psi Y|}{|0\varphi|} = \frac{|B\psi|}{|B\varphi|}.$

Furthermore, ${\varphi 0A}$ and ${\varphi X\psi}$ are similar, so

$\displaystyle \beta = \frac{|0X|}{|0\varphi|} = 1 + \frac{|\varphi X|}{|0\varphi|} = 1 + \frac{|\psi\varphi|}{|A\varphi|} = \frac{|A\psi|}{|A\varphi|}.$

Thus ${d_C}$ can be given in terms of the cross-ratio of the points ${\varphi,\psi,A,B}$:

$\displaystyle \frac \beta\alpha = \frac{|A\psi|}{|A\varphi|}\frac{|B\varphi|}{|B\psi|} = (\varphi,\psi;A,B).$

We have

$\displaystyle d_C(\varphi,\psi) = \log(\varphi,\psi;A,B). \ \ \ \ \ (3)$

Note that it is possible that the line ${\ell}$ does not intersect the boundary of ${C}$ twice; this corresponds to the case when either ${\alpha=0}$ or ${\beta=\infty}$ (or both) in (1), and in this case ${d_C(\varphi,\psi)=\infty}$.

This situation occurs, for example, when we take ${V=BV([0,1],{\mathbb R})}$ and ${C}$ as in the example above, and consider ${\varphi,\psi\in C}$ with disjoint supports — that is, ${\varphi(x)\psi(x)=0}$ for all ${x}$. In this case ${\alpha=0}$ and ${\beta=\infty}$ so the cone distance between ${\varphi}$ and ${\psi}$ is infinite.

Because of this phenomenon, ${d_C}$ is not a true metric. Moreover, we observe that ${d_C}$ is projective: ${d_C(\varphi,\lambda\varphi)=0}$ for every ${\lambda>0}$.

An important property of the Hilbert metric is the following theorem, due to Birkhoff, which states that a linear map from one convex cone to another is a contraction whenever its image has finite diameter.

Theorem 3 Let ${C_1\subset V_1}$ and ${C_2\subset V_2}$ be convex cones, and let ${L\colon V_1\rightarrow V_2}$ be a linear map such that ${L(C_1)\subset C_2}$. (This is a sort of `positivity’ condition.) Let

$\displaystyle \Delta = \sup_{\hat\varphi,\hat\psi\in L(C_1)} d_{C_2}(\hat\varphi,\hat\psi).$

Then for all ${\varphi,\psi\in C_1}$, we have

$\displaystyle d_{C_2}(L\varphi,L\psi)\leq \tanh\left(\frac \Delta4\right) d_{C_1}(\varphi,\psi), \ \ \ \ \ (4)$

where we use the convention that ${\tanh\infty=1}$.

We also want to relate ${d_C}$ to a more familiar norm. Say that a norm ${\|\cdot\|}$ on ${V}$ is adapted if the following is true: whenever ${\varphi,\psi\in V}$ are such that ${\varphi-\psi\in C}$ and ${\varphi+\psi\in C}$, we have ${\|\psi\|\leq\|\varphi\|}$.

Example 2 On ${BV}$, the ${L^1}$ norm is adapted, but the BV norm is not.

The following lemma, due to Liverani, Saussol, and Vaienti, relates the cone metric to an adapted norm.

Lemma 4 Let ${\|\cdot\|}$ be an adapted norm on ${V}$ and ${C\subset V}$ a convex cone. Then for all ${\varphi,\psi\in C}$ with ${\|\varphi\|=\|\psi\|>0}$, we have

$\displaystyle \|\varphi-\psi\| \leq \left(e^{d_C(\varphi,\psi)} - 1\right) \|\varphi\|. \ \ \ \ \ (5)$

Convex cones and the Hilbert metric are well suited to studying nonequilibrium open systems. Consider the following setting. Let ${X}$ be a Riemannian manifold, ${\lambda}$ volume on ${X}$, and ${\hat f_i\colon X\rightarrow X}$ a diffeomorphism. For ${m\in {\mathbb N}}$, let ${\hat F_m = \hat f_m \circ \cdots \circ \hat f_1}$. This is a nonequilibrium closed system. (Nonequilibrium because the map changes at each time step, closed because every point can be iterated arbitrarily many times.)

Now consider sets ${H_j\subset X}$, which we interpret as a “hole” at time ${j}$. The time-${m}$ survivor set is

$\displaystyle S_m = X\setminus \bigcup_{i=1}^m \hat F_i^{-1}(H_i),$

the set of points that do not fall into a hole before time ${m}$. Let ${F_m = \hat F_m|_{S_m}}$. We refer to the pair ${(F_m, H_m)}$ as a nonequilibrium open dynamical system.

We would like an analogue of decay of correlations for such systems. Let ${\varphi_0,\psi_0}$ be two probability density functions on ${X}$, and evolve these under ${(F_m, H_m)}$. We expect that ${\|\varphi_t\|_{L^1(\lambda)} < 1}$ because there is a positive probability of falling into a hole.

Let ${\hat{\mathcal{P}}_j}$ be the Perron–Frobenius operator for the closed system ${\hat f_j}$ (with respect to ${\lambda}$). Then to the open system ${f_j}$ we can associate the operator

$\displaystyle \mathcal{P}_j(\varphi) = \hat{\mathcal{P}}_j(\varphi) {\mathbf{1}}_{X\setminus H_j}.$

Definition 5 We say that ${(F_m,H_m)}$ exhibits conditional memory loss in the statistical sense if for all suitably chosen ${\varphi_0, \psi_0}$, we have

$\displaystyle \lim_{t\rightarrow\infty} \left\| \frac{\varphi_t}{\|\varphi_t\|_{L^1(\lambda)}} - \frac{\psi_t}{\|\psi_t\|_{L^1(\lambda)}} \right\|_{L^1(\lambda)} = 0.$

The idea of this definition is that before comparing the probabilities, we need to first condition on the event that the trajectory survives. Next time we will investigate this property for piecewise expanding interval maps using the Lasota–Yorke inequality, where the holes ${H_j}$ are small and vary slowly.

Posted in Uncategorized | 1 Comment

## Spectral methods 3 – central limit theorem

With the previous post on convergence of random variables, the law of large numbers, and Birkhoff’s ergodic theorem as background, we return to the spectral methods discussed in the first two posts in this series. This post is based on Andrew Török’s talk from March 4 and gives a proof of the central limit theorem using the spectral gap property.

1. Central limit theorem for IID

Now we recall the statement of the central limit theorem (CLT) and give a proof in the case of IID (independent identically distributed) random variables.

The weak law of large numbers says that if ${X_n}$ is a sequence of IID random variables with ${\mathop{\mathbb E}[X_n] = 0}$, then writing ${S_n = \sum_{k=0}^{n-1}X_k}$, the time averages ${\frac 1n S_n}$ converge to ${0}$ in probability, or equivalently (since the limit is a constant), in distribution. In the case when ${\sigma^2 = \mathop{\mathbb E}[X^2]<\infty}$, the central limit theorem strengthens this to the result that the sequence ${\frac 1{\sqrt{n}} S_n}$ converges in distribution to ${N(0,\sigma^2)}$, the normal distribution with mean ${0}$ and variance ${\sigma^2}$. That is, we have

$\displaystyle \mathop{\mathbb P}\left( \frac 1{\sqrt n} S_n \leq c\right) \xrightarrow{n\rightarrow\infty} \frac 1{\sigma\sqrt{2\pi}} \int_{-\infty}^c e^{-t^2/2\sigma^2}\,dt \ \ \ \ \ (1)$

for every ${c\in {\mathbb R}}$.

This can be established by the same method as we used last time for the proof of the weak law of large numbers, by studying the characteristic functions of ${\frac 1{\sqrt n}S_n}$ and ${N(0,\sigma^2)}$. The characteristic function of ${N(0,\sigma^2)}$ is

$\displaystyle \psi(t) = e^{-\frac 12 \sigma^2 t^2}. \ \ \ \ \ (2)$

Arguing as in the proof of the weak law of large numbers in the previous post, we write ${\varphi_n}$ for the characteristic function of ${\frac 1{\sqrt n} S_n}$ and observe that

$\displaystyle \varphi_n(t) = \mathop{\mathbb E}[e^{it \frac 1{\sqrt n} (X_1+\cdots + X_n)}] = \prod_{j=1}^n \mathop{\mathbb E}[e^{\frac{it}{\sqrt n} X_j}] = \varphi\left(\frac {t}{\sqrt n}\right)^n, \ \ \ \ \ (3)$

where ${\varphi}$ is the characteristic function of the ${X_j}$ (which are identically distributed), and the second equality uses the fact that the ${X_j}$ are independent.

Now by Taylor’s theorem, we have

\displaystyle \begin{aligned} \varphi(t/\sqrt{n}) &= \mathop{\mathbb E}[e^{\frac {it}{\sqrt n}X_j}] = 1 + \frac{it}{\sqrt n} \mathop{\mathbb E}[X_j] - \frac{t^2}{2n} \mathop{\mathbb E}[X_j^2] + o(t^2) \\ &= 1 - \frac {t^2}{2n}\sigma^2 + o(t^2), \end{aligned}

using the fact that the ${X_j}$ have mean 0 and variance ${\sigma^2}$. Thus we conclude from (3) that

$\displaystyle \varphi_n(t) = \left(1 - \frac{t^2\sigma^2}{2n} + o(t^2)\right)^n \xrightarrow{n\rightarrow\infty} e^{-\frac 12 t^2\sigma^2} = \psi(t),$

which completes the proof of the CLT in the IID case.

2. CLT with spectral gap

To translate the CLT into the language of dynamical systems, we consider a space ${X}$ and a map ${T\colon X\rightarrow X}$ with an invariant measure ${\mu}$. In general there may be many ${T}$-invariant measures, and so it is important to choose a suitable measure ${\mu}$. For example, when ${X}$ is an interval and ${f}$ is piecewise expanding, we are most interested in the case when ${\mu}$ is an acip.

Given a measurable function ${f\colon X\rightarrow {\mathbb R}}$, the sequence of functions ${f}$, ${f\circ T}$, ${f\circ T^2, \dots}$ defines a sequence of identically distributed random variables on ${X}$. However, they are not independent, and so we need some information about the decay of correlations between them. In particular, we can replicate the proof from the previous section as long as the transfer operator has a spectral gap.

Let’s make this precise in the case when ${T}$ is a piecewise expanding interval map, so the Lasota–Yorke inequality we discussed in an earlier post yields a spectral gap for the transfer operator ${\mathcal{P}_T}$ acting on ${BV}$, the space of functions of bounded variation, and in particular establishes the existence of an acip ${\mu}$.

Theorem 1 Let ${T}$ be a piecewise expanding interval map and ${\mu}$ the acip constructed before. Suppose that ${\mu}$ is mixing. Then ${\mu}$ satisfies the central limit theorem as follows: given any ${f\in BV}$ with ${\int f\,d\mu = 0}$ and writing ${S_nf(x) = \sum_{k=0}^{n-1} f\circ T^k}$, we have

$\displaystyle \mu \left\{ x \mid \frac 1{\sqrt{n}} S_nf(x) \leq c \right\} \xrightarrow{n\rightarrow\infty} \frac 1{\sigma\sqrt{2\pi}} \int_{-\infty}^c e^{-x^2/2\sigma^2}\,dx \ \ \ \ \ (4)$

for all ${c\in {\mathbb R}}$, where ${\sigma}$ is given by the Green–Kubo formula

$\displaystyle \sigma^2 = \sum_{n\in{\mathbb Z}} \int_X f \cdot (f\circ T^n) \,d\mu, \ \ \ \ \ (5)$

and ${\sigma=0}$ if and only if there exists ${g\in BV}$ and ${c\in {\mathbb R}}$ such that ${f=c+g\circ T - g}$.

Before proving the theorem, we make some remarks concerning the Green–Kubo formula (5). First, note that the sum converges as soon as we establish exponential decay of correlations for functions in ${BV}$, since each integral in the sum is just the correlation function at time ${n}$. Second, note that if we replace the functions ${f\circ T^n}$ with independent random variables, then all the terms with ${n\neq 0}$ vanish, and the ${n=0}$ term is just the variance ${\mathop{\mathbb E}[X^2]}$, as in the previous section.

Note also that using (5), ${\sigma^2}$ can be written as

$\displaystyle \sigma^2 = \lim_{n\rightarrow\infty} \frac 1n \int (S_n f)^2\,d\mu.$

Now we prove the central limit theorem (4). As in the IID case, we use the characteristic functions

\displaystyle \begin{aligned} \psi(t) &= e^{-\sigma^2 t^2/2}, \\ \varphi_n(t) &= \mathop{\mathbb E}_\mu[e^{it(S_nf)/\sqrt{n}}] = \int e^{\frac{it}{\sqrt n} S_nf} \,d\mu, \end{aligned}

where ${\psi(t)}$ is the characteristic function of the normal distribution and ${\varphi_n}$ is the characteristic function of ${\frac 1{\sqrt n}S_nf}$, so it suffices to show that ${\varphi_n(t)\rightarrow \psi(t)}$ for all ${t}$.

To prove this convergence of the characteristic functions, we use the following procedure.

1. Write the characteristic functions ${\varphi_n}$ in terms of a twisted transfer operator ${\mathcal{P}_{f,t}}$, where ${f}$ is the function we are investigating in the CLT, and ${t\approx 0}$ is a small real parameter. The operator ${\mathcal{P}_{f,t}}$ is a small perturbation of the transfer operator ${\mathcal{P}_T}$.
2. Use perturbation theory of operators to show that ${\mathcal{P}_{f,t}}$ has a spectral gap and to derive asymptotics for the leading eigenvalue ${\lambda(t)}$. In particular, relate ${\lambda'(0)}$ and ${\lambda''(0)}$ to the mean and variance of the limiting distribution.

First we define the transfer operator itself by the implicit equation

$\displaystyle \int (\mathcal{P}_T g)\cdot h\,d\mu = \int g\cdot (h\circ T)\,d\mu \ \ \ \ \ (6)$

for all ${g\in L^1(\mu)}$ and ${h\in L^\infty}$. Note that this is different from the transfer operator defined by integrating with respect to Lebesgue measure in (6) — it is a worthwhile exercise to determine the precise relationship between the two.

More directly, the transfer operator can be defined by

$\displaystyle \mathcal{P}_T g(x) = \sum_{y\in T^{-1}(x)} \frac{g(y)h(y)}{|T'(y)|}, \ \ \ \ \ (7)$

where ${h}$ is the density of ${\mu}$ with respect to Lebesgue measure.

Now given ${f\in BV}$ and ${t\in{\mathbb R}}$, we define the twisted transfer operator by

$\displaystyle \mathcal{P}_{f,t}g = \mathcal{P}_T(e^{itf}g). \ \ \ \ \ (8)$

To see the utility of this definition, we first note that

$\displaystyle \int\mathcal{P}_{f,t}(g)\,d\mu = \int \mathcal{P}_T(e^{itf}g){\mathbf{1}}\,d\mu =\int e^{itf} g\,d\mu,$

and so by induction we have

$\displaystyle \int \mathcal{P}_{f,t}^n(g) \,d\mu = \int e^{itS_nf} g\,d\mu.$

In particular, considering the characteristic function ${\varphi_n}$, we have

$\displaystyle \varphi_n(t) = \int e^{\frac {it}{\sqrt n} S_n f}\,d\mu = \int \mathcal{P}_{f,\frac{t}{\sqrt n}}^n ({\mathbf{1}})\,d\mu, \ \ \ \ \ (9)$

which accomplishes the first stage of the proof — writing the characteristic function in terms of the twisted transfer operator.

For the second stage of the proof, we consider the twisted transfer operator as a perturbation of ${\mathcal{P}_T}$. From the Lasota–Yorke inequality and the fact that ${\mu}$ is mixing, we know that the spectrum of ${\mathcal{P}_T}$ has the form ${\{1\}\cup Z}$, where ${Z}$ is contained in a disc of radius ${r<1}$ centred at the origin.

By the perturbation theory of linear operators, the spectrum of ${\mathcal{P}_{f,t}}$ has the same form for small enough ${|t|}$: there is a leading eigenvalue ${\lambda(t)}$ that is close to ${1}$, and the rest of the spectrum is contained in the disc of radius ${r}$. Moreover, the leading eigenvalue satisfies (Edit: see the end of the post for a proof)

$\displaystyle \lambda'(0) = \int (if) \,d\mu = 0$

and

$\displaystyle \lambda''(0) = \lim_{n\rightarrow\infty} \frac 1n \int (S_n(if))^2\,d\mu = -\sigma^2,$

which is the origin of the expression in the Green–Kubo formula.

Now we use the Riesz functional calculus, whose general ideas we briefly recall here. Let ${X}$ be a Banach space and ${\mathcal{B}(X)}$ the space of bounded linear operators on ${X}$. Given ${S\in \mathcal{B}(X)}$, let ${\sigma\subset{\mathbb C}}$ be the spectrum of ${S}$. Then there is a unique way to associate to each analytic function ${g\colon \sigma\rightarrow{\mathbb C}}$ an operator ${g(S)}$ such that the map ${g\mapsto g(S)}$ is a homomorphism mapping the constant function to the identity operator and the identity function to ${S}$.

This mapping can be defined by integrating around a curve ${\gamma}$ surrounding the spectrum ${\sigma}$ (this is similar to the Cauchy formula from complex analysis):

$\displaystyle g(S) = \frac 1{2\pi i} \int_\gamma (S-zI)^{-1} g(z)\,d\gamma,$

where we recall that ${S-zI}$ is invertible for all ${z}$ in the resolvent ${{\mathbb C}\setminus\sigma}$. If we take ${g}$ to be the characteristic function of part of the spectrum, we obtain a projection to the eigenspace associated with that part.

In particular, considering the operator ${\mathcal{P}_{f,t}}$, we may set ${g = {\mathbf{1}}_{\lambda(t)}}$ and obtain a projection ${\Pi_t}$ onto the eigenspace of ${\lambda(t)}$. Similarly, setting ${g(z)=z{\mathbf{1}}_{Z(t)}(z)}$, where ${Z(t)}$ is the part of the spectrum contained in a disc of radius ${r<1}$, we get an operator ${R_t}$ such that

$\displaystyle \Pi_t R_t = R_t\Pi_t = 0, \qquad \|R_t\|_{BV} < r.$

Moreover, we have

$\displaystyle \mathcal{P}_{f,t} = \lambda(t) \Pi_t + R_t,$

which allows us to write the operator in (9) as

\displaystyle \begin{aligned} \mathcal{P}_{f,\frac{t}{\sqrt n}}^n &= \lambda \left( \frac t{\sqrt n}\right)^n \Pi_{\frac t{\sqrt n}} + R_{\frac t{\sqrt n}}^n \\ &= \left( 1 + \lambda'(0) \frac t{\sqrt n} + \frac{\lambda''(0)}2 \frac {t^2} n + o\left( \frac {t^2} n\right) \right)^n \Pi_\frac t{\sqrt n} + R_\frac t{\sqrt n}^n \\ &= \left( 1 - \frac{\sigma^2 t^2}{2n} +o\left( \frac {t^2} n\right) \right)^n\left(\Pi_0 + O\left( \frac t{\sqrt n}\right)\right) + R_\frac t{\sqrt n}^n \\ &\xrightarrow{n\rightarrow\infty} e^{-t^2\sigma^2/2}\Pi_0, \end{aligned}

using the fact that ${\|R_t\|_{BV} < r < 1}$. Now (9) yields

$\displaystyle \varphi_n(t) \rightarrow e^{-t^2\sigma^2/2} \int \Pi_0({\mathbf{1}})\,d\mu = e^{-t^2\sigma^2/2} = \psi(t),$

which completes the proof of the CLT.

3. Proof of formulas for derivatives of ${\lambda}$

The formulas given above for ${\lambda'(0)}$ and ${\lambda''(0)}$ were not explained. Here is a derivation of these formulas.

Let ${g_t}$ be the eigenfunction of ${\mathop{\mathcal P}_{f,t}}$ corresponding to the eigenvalue ${\lambda(t)}$. That is, ${g_t}$ satisfies

$\displaystyle \mathcal{P}_T(e^{itf} g_t) = \lambda(t) g_t.$

Multiplying by a test function ${h}$ and integrating against ${\mu}$ gives

$\displaystyle \int \mathcal{P}_T(e^{itf} g_t) h \,d\mu = \lambda(t) \int g_t h \,d\mu.$

Recalling the definition of ${\mathcal{P}_T}$, this gives

$\displaystyle \int (e^{itf} g_t) (h\circ T) \,d\mu = \lambda(t) \int g_t h \,d\mu. \ \ \ \ \ (10)$

Let ${g'_t = \frac d{dt} g_t}$ and ${g''_t = \frac {d^2}{dt^2} g_t}$. Then differentiating (10) with respect to ${t}$ gives

$\displaystyle \int (if) (e^{itf} g_t) (h\circ T) \,d\mu + \int (e^{itf} g'_t)(h\circ T)\,d\mu = \lambda'(t) \int g_t h\,d\mu + \lambda(t) \int g_t' h\,d\mu. \ \ \ \ \ (11)$

Setting ${t=0}$ and using the fact that ${\lambda(0)=1}$ and ${g_0\equiv 1}$, we get

$\displaystyle \int(if)(h\circ T) \,d\mu + \int (q)(h\circ T) \,d\mu = \lambda'(0) \int h\,d\mu + \int(q) (h)\,d\mu,$

where we write ${q = \frac d{dt}g_t|_{t=0}}$.

Putting ${h\equiv 1}$ gives the expression for ${\lambda'(0)}$. Before finding ${\lambda''(0)}$, we observe that the above equation can also be used to find ${\int qh \,d\mu}$, which will be important later on. Indeed, using the assumption that ${\int f\,d\mu = 0}$, we have ${\lambda'(0)=0}$, and so the above equation becomes

$\displaystyle \int(if)(h\circ T) \,d\mu + \int (q)(h\circ T) \,d\mu = \int(q) (h)\,d\mu. \ \ \ \ \ (12)$

Similarly, replacing ${h}$ with the test functions ${h\circ T^k}$ for ${k\geq 1}$, (12) gives

\displaystyle \begin{aligned} \int(if)(h\circ T^2) \,d\mu + \int (q)(h\circ T^2) \,d\mu &= \int(q) (h\circ T)\,d\mu, \\ \int(if)(h\circ T^3) \,d\mu + \int (q)(h\circ T^3) \,d\mu &= \int(q) (h\circ T^2)\,d\mu, \end{aligned}

and so on. Observe that ${\sum_{k\geq 1} (q)(h\circ T^k)\,d\mu}$ converges because we have exponential decay of correlations. Thus we may add the above equations (infinitely many of them) and subtract this sum from both sides to obtain

$\displaystyle \int qh\,d\mu = \sum_{k\geq 1} \int (if) (h\circ T^k)\,d\mu. \ \ \ \ \ (13)$

Now we can find the expression for ${\lambda''(0)}$. Set ${h\equiv 1}$ in (11) and differentiate to get

$\displaystyle \int \left((if)^2 (e^{itf} g_t) + 2(if) (e^{itf} g'_t) + e^{itf} g''_t)\right)\,d\mu = \int (\lambda''(t) g_t + 2\lambda'(t) g'_t + \lambda(t) g''_t)\,d\mu. \ \ \ \ \ (14)$

At ${t=0}$ we see that the terms containing ${g_t''}$ are equal, while ${\lambda'(0)=0}$ by the assumption that ${\int f\,d\mu = 0}$, and so (14) gives

$\displaystyle \int (-f^2)\,d\mu + 2\int (if)(q) \,d\mu = \lambda''(t). \ \ \ \ \ (15)$

From (13), we have

$\displaystyle \int(if)(q)\,d\mu = \sum_{k\geq 1} \int(if) (if\circ T^k)\,d\mu,$

which together with (15) suffices to complete the proof of the expression for ${\lambda''(0)}$.

Posted in Uncategorized | 6 Comments

## Laws of large numbers and Birkhoff’s ergodic theorem

In preparation for the next post on the central limit theorem, it’s worth recalling the fundamental results on convergence of the average of a sequence of random variables: the law of large numbers (both weak and strong), and its strengthening to non-IID sequences, the Birkhoff ergodic theorem.

1. Convergence of random variables

First we need to recall the different ways in which a sequence of random variables may converge. Let ${Y_n}$ be a sequence of real-valued random variables and ${Y}$ a single random variable to which we want the sequence ${Y_n}$ to “converge”. There are various ways of formalising this.

1.1. Almost sure convergence

The strongest notion of convergence is “almost sure” convergence: we write ${Y_n\xrightarrow{a.s.} Y}$ if

$\displaystyle \mathop{\mathbb P}(Y_n \rightarrow Y) = 1. \ \ \ \ \ (1)$

If ${\Omega}$ is the probability space on which the random variables are defined and ${\nu}$ is the probability measure defining ${\mathop{\mathbb P}}$, then this condition can be rewritten as

$\displaystyle \nu\{\omega\in \Omega \mid Y_n(\omega) \rightarrow Y(\omega)\} = 1. \ \ \ \ \ (2)$

1.2. Convergence in probability

A weaker notion of convergence is convergence “in probability”: we write ${Y_n\xrightarrow{p} Y}$ if

$\displaystyle \mathop{\mathbb P}(|Y_n-Y|\geq \epsilon) \rightarrow 0 \text{ for any } \epsilon>0. \ \ \ \ \ (3)$

In terms of ${\Omega}$ and ${\nu}$, this condition is

$\displaystyle \nu\{\omega\in \Omega \mid |Y_n(\omega) - Y(\omega)| \geq \epsilon\} \rightarrow 0 \ \ \ \ \ (4)$

Almost sure convergence implies convergence in probability (by Egorov’s theorem, but not vice versa. For example, let ${I_n\subset[0,1]}$ be any sequence of intervals such that for every ${x\in [0,1]}$ the sets

$\displaystyle \{n\mid x\in I_n\},\qquad \{n\mid x\notin I_n\}$

are both infinite. Let ${\Omega=[0,1]}$ and let ${Y_n = {\mathbf{1}}_{I_n}}$ be the characteristic function of the interval ${I_n}$. Then ${Y_n\xrightarrow{p} 0}$ but ${Y_n\not\xrightarrow{a.s.}0}$.

1.3. Convergence in distribution

A still weaker notion of convergence is convergence “in distribution”: we write ${Y_n\xrightarrow{d} Y}$ if, writing ${F_n, F\colon {\mathbb R}\rightarrow[0,1]}$ for the cumulative distribution functions of ${Y_n}$ and ${Y}$, we have ${F_n(t)\rightarrow F(t)}$ at all ${t}$ where ${F(t)}$ is continuous.

Convergence in probability implies convergence in distribution, but the converse fails if ${Y}$ is not a.s.-constant. Here is one broad class of examples showing this: suppose ${Y\colon \Omega\rightarrow {\mathbb R}}$ has ${\mathop{\mathbb P}(Y\in A) = \mathop{\mathbb P}(Y\in -A)}$ for every interval ${A\subset {\mathbb R}}$ (for example, this is true if ${Y}$ is normal with zero mean). Then ${-Y}$ and ${Y}$ have the same CDF, and so any sequence which converges in distribution to one of the two will also converge in distribution to the other; on the other hand, ${Y_n}$ cannot converge in probability to both ${Y}$ and ${-Y}$ unless ${Y=0}$ a.s.

2. Weak law of large numbers

Given a sequence of real-valued random variables ${X_n}$, we consider the sums

$\displaystyle S_n = X_1 + X_2 + \cdots + X_n.$

Then ${\frac 1n S_n}$ is the average of the first ${n}$ observations.

Suppose that the sequence ${X_n}$ is independent and identically distributed (IID) and that ${X_n}$ is integrable — that is, ${\mathop{\mathbb E}(|X_n|) < \infty}$. Then in particular the mean ${\mu = \mathop{\mathbb E}(X_n)}$ is finite. The weak law of large numbers says that ${\frac 1n S_n}$ converges in probability to the constant function ${\mu}$. Because the limiting distribution here is a constant, it is enough to show convergence in distribution. This fact leads to a well-known proof of the weak law of large numbers using characteristic functions.

If a random variable ${Y}$ is absolutely continuous — that is, if it has a probability density function ${f}$ — then its characteristic function ${\varphi_Y}$ is the Fourier transform of ${f}$. More generally, the characteristic function of ${Y}$ is

$\displaystyle \varphi_Y(t) = \mathop{\mathbb E}(e^{itY}). \ \ \ \ \ (5)$

Characteristic functions are related to convergence in distribution by Lévy’s continuity theorem, which says (among other things) that ${Y_n\xrightarrow{d} Y}$ if and only if ${\varphi_{Y_n}(t)\rightarrow\varphi_Y(t)}$ for all ${t\in{\mathbb R}}$. In particular, to prove the weak law of large numbers it suffices to show that the characteristic functions of ${\frac 1n S_n}$ converge pointwise to the function ${e^{it\mu}}$.

Let ${\varphi}$ be the characteristic function of ${X_n}$. (Note that each ${X_n}$ has the same characteristic function because they are identically distributed.) Let ${\varphi_n}$ be the characteristic function of ${\frac 1n S_n}$ — then

$\displaystyle \varphi_n(t) = \mathop{\mathbb E}(e^{\frac{it}{n} (X_1 + \cdots + X_n)}).$

Because the variables ${X_n}$ are independent, we have

$\displaystyle \varphi_n(t) = \prod_{j=1}^n \mathop{\mathbb E}(e^{\frac{it}n X_j}) = \varphi\left(\frac tn\right)^n. \ \ \ \ \ (6)$

By Taylor’s theorem and by linearity of expectation, we have for ${t\approx 0}$ that

$\displaystyle \varphi(t) = \mathop{\mathbb E}(e^{itX_j}) = \mathop{\mathbb E}(1 + itX_j + o(t^2)) = 1 + it\mu + o(t),$

and together with (6) this gives

$\displaystyle \varphi_n(t) = \left( 1 + \frac{it\mu}n + o(t/n)\right)^n \rightarrow e^{it\mu},$

which completes the proof.

3. Strong law of large numbers and ergodic theorem

The strong law of large numbers states that not only does ${\frac 1n S_n}$ converge to ${\mu}$ in probability, it also converges almost surely. This takes a little more work to prove. Rather than describe a proof here (a nice discussion of both laws, including a different proof of the weak law than the one above, can be found on Terry Tao’s blog), we observe that the strong law of large numbers can be viewed as a special case of the Birkhoff ergodic theorem, and then give a proof of this result. First we state the ergodic theorem (or at least, the version of it that is most relevant for us).

Theorem 1
Let ${(X,\mathcal{F},\mu)}$ be a probability space and ${f\colon X\rightarrow X}$ a measurable transformation. Suppose that ${\mu}$ is ${f}$-invariant and ergodic. Then for any ${\varphi\in L^1(\mu)}$, we have

$\displaystyle \frac 1n S_n\varphi(x) \rightarrow \int \varphi\,d\mu \ \ \ \ \ (7)$

for ${\mu}$-a.e. ${x\in X}$, where ${S_n\varphi(x) = \varphi(x) + \varphi(fx) + \cdots + \varphi(f^{n-1}x)}$.

Before giving a proof, we describe how the strong law of large numbers is a special case of Theorem 1. Let ${X_n}$ be a sequence of IID random variables ${\Omega\rightarrow {\mathbb R}}$, and define a map ${\pi\colon \Omega\rightarrow X:={\mathbb R}^{\mathbb N}}$ by

$\displaystyle \pi(\omega) = (X_1(\omega),X_2(\omega),\dots).$

Let ${\nu}$ be the probability measure on ${\Omega}$ that determines ${\mathop{\mathbb P}}$, and let ${\mu = \pi_*\nu = \nu \circ \pi^{-1}}$ be the corresponding probability measure on ${X}$.

Because the variables ${X_n}$ are independent, ${\mu}$ has the form ${\mu = \nu_1 \times \nu_2 \times \cdots}$, and because they are identically distributed, all the marginal distributions ${\nu_j}$ are the same, so in fact ${\mu=\nu^{\mathbb N}}$ for some probability distribution ${\nu}$ on ${{\mathbb R}}$.

The measure ${\mu}$ is invariant and ergodic with respect to the dynamics on ${X}$ given by the shift map ${f(x_1,x_2,x_3,\dots) = (x_2,x_3,x_4,\dots)}$ (this is an example of a Bernoulli measure). Writing ${x=(x_1,x_2,x_3,\dots)\in X}$ and putting ${\varphi(x) = x_1}$, we see that for ${x=\pi(\omega)}$ we have

$\displaystyle X_1(\omega) + \cdots + X_n(\omega) = S_n\varphi(x).$

In particular, the convergence in (7) implies the strong law of large numbers.

4. Proving the ergodic theorem

To prove the ergodic theorem, it suffices to consider a function ${\varphi\in L^1(\mu)}$ with ${\int\varphi\,d\mu=0}$ and show that the set

$\displaystyle X_\varepsilon = \left\{ x\in X \mid \varlimsup_{n\rightarrow\infty} \frac 1n S_n\varphi(x) > \varepsilon \right\}$

has ${\mu(X_\varepsilon)=0}$ for every ${\varepsilon>0}$. Indeed, the set of points where (7) fails is the (countable) union of the sets ${X_{1/k}}$ for the functions ${\pm (\varphi - \int\varphi\,d\mu)}$, and thus has ${\mu}$-measure zero if this result holds.

Note that ${X_\varepsilon}$ is ${f}$-invariant, and so by ergodicity we either have ${\mu(X_\varepsilon)=0}$ or ${\mu(X_\varepsilon)=1}$. We assume that ${\mu(X_\varepsilon)=1}$ and derive a contradiction by showing that this implies ${\int\varphi\,d\mu > 0}$.

The assumption on ${\mu(X_\varepsilon)}$ implies that ${\varlimsup_{n\rightarrow\infty} S_n(\varphi - \varepsilon)(x) = \infty}$ for ${\mu}$-a.e. ${x}$. The key step now is to use this fact to show that

$\displaystyle \int\varphi\,d\mu\geq\varepsilon; \ \ \ \ \ (8)$

this is the content of the maximal ergodic theorem.

Proving the maximal ergodic theorem requires a small trick. Let ${\psi = \varphi-\varepsilon}$ and let ${\psi_n(x) = \max \{ S_k\psi(x) \mid 0\leq k\leq n\}}$. Then

$\displaystyle \psi_{n+1} = \psi + \max\{0,\, \psi_n\circ f\}, \ \ \ \ \ (9)$

and because ${\psi_n(x)\rightarrow\infty}$ for ${\mu}$-a.e. ${x}$, this implies that ${\psi_{n+1} - \psi_n\circ f}$ converges ${\mu}$-a.e. to ${\psi}$. Now we want to argue that

$\displaystyle \int\psi\,d\mu = \lim_{n\rightarrow\infty} \int (\psi_{n+1} - \psi_n\circ f)\,d\mu, \ \ \ \ \ (10)$

because the integral on the right is equal to ${\int(\psi_{n+1} - \psi_n)\,d\mu}$ by ${f}$-invariance of ${\mu}$, and this integral in turn is non-negative because ${\psi_n}$ is non-decreasing. So if (10) holds, then we have ${\int \psi\,d\mu\geq 0}$, which implies (8).

Pointwise convergence does not always yield convergence of integrals, so to verify (10) we need the Lebesgue dominated convergence theorem. Using (9) we have

\displaystyle \begin{aligned} \psi_{n+1} - \psi_n \circ f &= \psi + \max\{0,\, -\psi_n\circ f\} \\ &\leq \psi + \max\{0,\, -\psi\circ f\}, \end{aligned}

which is integrable, and so the argument just given shows that (10) holds and in particular ${\int \varphi\,d\mu\geq\varepsilon}$, contradicting the assumption on ${\varphi}$. This proves that ${\mu(X_\varepsilon)=0}$, which as described above is enough to prove that (7) holds ${\mu}$-a.e.

Posted in Uncategorized | 3 Comments