De Bruijn graphs and entropy at finite scales

Let {\mathcal{A}} be a finite set, which we call an alphabet, and let {x \in \mathcal{A}^{\mathbb N}} be an infinite sequence of letters from {A}. It is natural to ask how complex the sequence {x} is: for example, if the alphabet is {\{H,T\}}, then we expect a typical sequence produced by flipping a coin to be in some sense more complex than the sequence {x=HHHHH\dots}.

One important way of making this notion precise is the entropy of the shift space generated by {x}, a notion coming from symbolic dynamics. Let {a_k(x)} be the number of words of length {k} (that is, elements of {A^k}) that appear as subwords of {x}. Clearly we have {1\leq a_k(x) \leq (\#\mathcal{A})^k}. Roughly speaking, the entropy of {x} is the exponential growth rate of {a_k(x)}. More precisely, we write

\displaystyle  h(x) := \varlimsup_{k\rightarrow\infty} \tfrac 1k \log a_k(x).

Of course, in practice it is often the case that one does not have an infinite sequence, but merely a very long one. For example, it has been suggested that entropy (and a related quantity, the topological pressure) can play a role in the analysis of DNA sequences; see [D. Koslicki, “Topological entropy of DNA sequences”, Bioinformatics 27(8), 2011, p. 1061–1067] and [D. Koslicki and D.J. Thompson, “Coding sequence density estimation via topological pressure”, J. Math. Biol. 70, 2015, p. 45–69]. In this case we have {\mathcal{A} = \{A,C,G,T\}} and are dealing with sequences {x} whose length is large, but finite.

Given a sequence {x} with length {n}, one can try to get some reasonable `entropy-like’ quantity by fixing some {k} and putting {h(x) = \frac 1k\log a_k(x)}. But what should we take {k} to be? If we take {k} to be too small we will get an overestimate (with {k=1} we will probably just find out that {x} contains every letter of {\mathcal{A}}), but if we take {k} too small we get an underestimate (with {k=n} we have {a_k(x)=1} so {h=0}).

The convention proposed by Koslicki in the first paper above is to let {k} be the largest number such that there is some word of length {n} that contains every word of length {k}. If this actually happens, then {h(x)} achieves its maximum value {\log \#\mathcal{A}}; if some words do not appear, then {h(x)<\log \#\mathcal{A}}.

What is the relationship between {n} and {k} that guarantees existence of a word of length {n} containing every word of length {k}? Let {p=\#\mathcal{A}} and note that there are {p^k} words of length {k}; if {w\in \mathcal{A}^n} contains every such word then we must have {n\geq p^k + k-1}, since the length-{k} subwords of {w} are precisely {(w_1\cdots w_k)}, {(w_2\cdots w_k)}, \dots, {(w_{n-k+1} \cdots w_n)}, so we must have {n-k+1 \geq p^k}.

The converse implication is a little harder, though. Given {k}, let {n=p^k + k-1}. Is it necessarily true that there is a word {w\in \mathcal{A}^n} that contains every subword of length {k}? After all, once {w_1\cdots w_k} is determined, there are not many possibilities for the word {w_2\cdots w_{k+1}}; can we navigate these restrictions successfully?

It is useful to rephrase the problem in the language of graph theory (what follows can be found in the proof of Lemma 1 in Koslicki’s paper). Let {G_k} be the directed graph defined as follows:

  • the vertex set is {\mathcal{A}^k}, so each vertex corresponds to a word of length {k};
  • there is an edge from {u\in \mathcal{A}^k} to {v\in \mathcal{A}^k} if and only if {u_1 v = uv_k}, that is, if {u_2\cdots u_k = v_1\cdots v_{k-1}}.

The graph {G_k} is the {k}-dimensional De Bruijn graph of {p} symbols. Recall that a Hamiltonian path in a graph is a path that visits each vertex exactly once. Thus the question above, regarding existence of a word in {\mathcal{A}^n} that contains every word in {\mathcal{A}^k}, where {n=p^k + k-1}, is equivalent to asking for the existence of a Hamiltonian path in the De Bruijn graph.

There is a correspondence between {G_k} and {G_{k-1}}; vertices in {G_k} correspond to edges in {G_{k-1}} (since both are labeled by elements of {\mathcal{A}^k}). Thus a Hamiltonian path in {G_k} corresponds to an Eulerian path in {G_{k-1}}; that is, a path that visits every edge exactly once.

This correspondence is very useful, since in general the problem of determining whether a Hamiltonian path exists is hard (NP-complete), while it is easy to check existence of an Eulerian path in a directed graph: a sufficient condition is that every vertex have in-degree equal to its out-degree (and a slight weakening of this condition is both necessary and sufficient). This is the case for De Bruijn graphs, where every vertex has {p=\#\mathcal{A}} edges coming in and going out. Thus {G_{k-1}} has an Eulerian path, which corresponds to a Hamiltonian path for {G_k}. This answers the original question, demonstrating that for every {k}, there is a word {w} of length {n=p^k+k-1} such that {w} contains every word of length {k} as a subword.


About Vaughn Climenhaga

I'm an assistant professor of mathematics at the University of Houston. I'm interested in dynamical systems, ergodic theory, thermodynamic formalism, dimension theory, multifractal analysis, non-uniform hyperbolicity, and things along those lines.
This entry was posted in Uncategorized. Bookmark the permalink.

One Response to De Bruijn graphs and entropy at finite scales

  1. Jairo B. says:

    Nice post. A related (equivalent?) notion of “recursive complexity” of a finite cyclic word appears in a 1992 paper by U.M. Maurer:
    He looks for the least k such that each symbol of the word can be computed from the previous k symbols; equivalently, the least k such that the word corresponds to a simple cycle on the graph G_k.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s