This week’s post continues last week’s discussion of Markov chains and mixing times, and introduces the idea of coupling as a method for estimating mixing times. We remark that some nice notes on the subject of coupling (and others) can be found on Steve Lalley’s web page — of particular relevance for our purposes here are the notes on “Convergence rates of Markov chains”. A more thorough and complete reference is the book Markov Chains and Mixing Times by D. Levin, Y. Peres, and E. Wilmer.
1. Markov chains as stochastic processes
In the previous post, we characterised a Markov chain as a pair , where is a finite or countable state space and is a matrix whose entries represent the transition probabilities between the various states in . This then allowed us to interpret the Markov chain as a (deterministic) map , where is the simplex of probability measures on , and the map is given by right multiplication by the matrix .
Another way to describe a Markov chain is using the language of stochastic processes. A stochastic process is a sequence of random variables , where is a probability space. Such a process is said to be a Markov chain with state space and transition matrix if for any , we have
Note that there is a potential source of confusion here in the terminology — to each pair we can associate many distinct stochastic processes satisfying (1). Indeed, there is one such process (up to isomorphism) for every initial probability distribution (the probabilities on ). Thus whenever the term “Markov chain” is used, here or in the literature, one should be careful to determine whether or not the object defined refers to a single stochastic process with specified initial distribution, or to a collection of stochastic processes all evolving according to the same transition probabilities.
In the language of the previous post, this distinction takes the following form: the pair determines a map , and the individual stochastic processes just described correspond to fixing an initial distribution and considering the single trajectory of this map given by . Thus “Markov chain” may refer either to a single orbit of this map, or to the space of all orbits of the map.
We will not attempt to introduce new terminology to resolve this ambiguity here. Rather, we shall let the context indicate which interpretation is meant — “the Markov chain ” will refer to a single stochastic process (including a fixed initial distribution), while “the Markov chain ” will refer to the set of all stochastic processes distributed according to (1). We will sometimes say that “ is a Markov chain over ” if it is a stochastic process satisfying (1).
Although it is customary to refer to a random process simply by the random variable , we remark that a key ingredient in the process is the probability distribution on the measurable space . In our setting we can always take to be the space of infinite sequences of states in , and then is simply the map that picks out the th coordinate of the sequence. We will adopt this convention throughout and will write for the probability distribution on that is implicit in every mention of the random variable , and we will sometimes refer to as “a Markov chain over ”.
Consider a Markov chain . We want to understand the mixing time of this Markov chain — that is, if is the stationary distribution and we write for the distribution associated to starting in state (with probability 1) and evolving the Markov chain for steps, then we want to understand the function
where we recall that is the total variation distance .
Roughly speaking, the idea behind coupling is to run two copies of the Markov chain simultaneously in such a way that each copy obeys the original transition probabilities, but the two copies nevertheless “communicate” in such a way that they eventually mirror each other. Then the mixing time can be estimated in terms of the probability that the two copies take at least time before this mirroring begins.
Remark 1 One can also consider couplings where is not required to be a Markov chain, but only a stochastic process with state space whose marginal distributions are Markov chains over . We will not use such couplings, though.
Recall that we use the convention that the random variables and are defined as the th coordinate projections on , and is defined similarly on . Thus all of the information about the Markov chains , , is really carried by the probability distributions , on and on .
The easiest way to produce a coupling is to let and be any Markov chains over , and then let be the product measure on . This corresponds to running two copies of the Markov chain simultaneously and completely independently, without any interaction. However, this is far from being the only coupling available, and indeed there are other couplings that are far more useful for our applications. An important part of the power of the coupling method is that we can choose any distribution which has and as its marginals — that is,
This allows us to introduce some dependence between and , and in particular to carry out the following general scheme:
- Define a coupling such that we eventually have with probability 1 — that is, as .
- Bound the total variation distance that appears in (2) in terms of , where is a Markov chain starting in state and starts in the stationary distribution .
- Use this bound to estimate the mixing time .
Before discussing how to bound total variation distance in terms of convergence times for couplings, we describe an example that illustrates how the two Markov chains and can be chosen to have some dependence.
Let be the Markov chain describing the top-down shuffle on a deck of cards, which we discussed last time. That is, is the set of all permutations of the cards, and the transition probabilities are given by if can be reached from by removing a single card from the deck and placing it on top, and otherwise. One can define a coupling as follows: the Markov chains and each progress from one step to the next by selecting a random card from the deck and placing it on top. Let progress from one step to the next by selecting a single card from the deck at random, and then moving that card to the top of the deck for both and . Then it is clear that both and evolve according to the transition probabilities , but the probability measure is not the direct product of and , because the chains do not evolve independently.
Note that it is the card, and not the position, which is the same between the two decks — in particular, after this process happens once, both decks and have the same top card. After it happens twice, they have the same top two cards — unless at the second step we happen to pick the same card we did at the first. In general, we may write for the number of cards at the top of the deck which we know to agree between and . Then and evolves according to the following rule: if a new card is picked at step (that has not been picked before), then , and otherwise .
In particular, we see that once every card in the deck has been chosen at least once, we have , so that . Later we will get an estimate on the probability that every card has been chosen by time , which will let us estimate . First we give an argument that this latter probability gives us a bound on the total variation distance.
3. Total variation distance and couplings
Ultimately we want to estimate , where and are the distributions at time that result from the initial distributions and . The key is the following lemma.
Proof: Given any , we have
In particular, if we take to be the initial distribution concentrated on a single state , and to be the stationary distribution , then Lemma 1 gives us the estimate
and so we have proved the following result.
4. Application to card shuffling
Now we can estimate the mixing time for the card shuffling example. Based on Proposition 2, we can estimate by first estimating , which as we saw above is the same as the probability that not every card has been selected by time . The problem of determining how long it takes for this to happen is known as the coupon collector’s problem.
If we run the Markov chain for steps, then the probability that a specific card has not yet been selected to be moved to the top of the deck is
Thus the probability that not every card has been selected is , and we get
Note that this is independent of the starting distributions for and . We want to choose such that this bound is , since then Proposition 2 will give . So we solve the inequality for , and obtain
which gives the rough bound
(Note that this bound depends on being reasonably large.) For example, when and , we get , indicating that 361 shuffles (about 7 times the size of the deck) is enough to guarantee that for any event we specify, the probability of that event when drawing from our shuffled deck is within of the probability when drawing from a truly random deck.