## De Bruijn graphs and entropy at finite scales

Let ${\mathcal{A}}$ be a finite set, which we call an alphabet, and let ${x \in \mathcal{A}^{\mathbb N}}$ be an infinite sequence of letters from ${A}$. It is natural to ask how complex the sequence ${x}$ is: for example, if the alphabet is ${\{H,T\}}$, then we expect a typical sequence produced by flipping a coin to be in some sense more complex than the sequence ${x=HHHHH\dots}$.

One important way of making this notion precise is the entropy of the shift space generated by ${x}$, a notion coming from symbolic dynamics. Let ${a_k(x)}$ be the number of words of length ${k}$ (that is, elements of ${A^k}$) that appear as subwords of ${x}$. Clearly we have ${1\leq a_k(x) \leq (\#\mathcal{A})^k}$. Roughly speaking, the entropy of ${x}$ is the exponential growth rate of ${a_k(x)}$. More precisely, we write

$\displaystyle h(x) := \varlimsup_{k\rightarrow\infty} \tfrac 1k \log a_k(x).$

Of course, in practice it is often the case that one does not have an infinite sequence, but merely a very long one. For example, it has been suggested that entropy (and a related quantity, the topological pressure) can play a role in the analysis of DNA sequences; see [D. Koslicki, “Topological entropy of DNA sequences”, Bioinformatics 27(8), 2011, p. 1061–1067] and [D. Koslicki and D.J. Thompson, “Coding sequence density estimation via topological pressure”, J. Math. Biol. 70, 2015, p. 45–69]. In this case we have ${\mathcal{A} = \{A,C,G,T\}}$ and are dealing with sequences ${x}$ whose length is large, but finite.

Given a sequence ${x}$ with length ${n}$, one can try to get some reasonable `entropy-like’ quantity by fixing some ${k}$ and putting ${h(x) = \frac 1k\log a_k(x)}$. But what should we take ${k}$ to be? If we take ${k}$ to be too small we will get an overestimate (with ${k=1}$ we will probably just find out that ${x}$ contains every letter of ${\mathcal{A}}$), but if we take ${k}$ too small we get an underestimate (with ${k=n}$ we have ${a_k(x)=1}$ so ${h=0}$).

The convention proposed by Koslicki in the first paper above is to let ${k}$ be the largest number such that there is some word of length ${n}$ that contains every word of length ${k}$. If this actually happens, then ${h(x)}$ achieves its maximum value ${\log \#\mathcal{A}}$; if some words do not appear, then ${h(x)<\log \#\mathcal{A}}$.

What is the relationship between ${n}$ and ${k}$ that guarantees existence of a word of length ${n}$ containing every word of length ${k}$? Let ${p=\#\mathcal{A}}$ and note that there are ${p^k}$ words of length ${k}$; if ${w\in \mathcal{A}^n}$ contains every such word then we must have ${n\geq p^k + k-1}$, since the length-${k}$ subwords of ${w}$ are precisely ${(w_1\cdots w_k)}$, ${(w_2\cdots w_k)}$, \dots, ${(w_{n-k+1} \cdots w_n)}$, so we must have ${n-k+1 \geq p^k}$.

The converse implication is a little harder, though. Given ${k}$, let ${n=p^k + k-1}$. Is it necessarily true that there is a word ${w\in \mathcal{A}^n}$ that contains every subword of length ${k}$? After all, once ${w_1\cdots w_k}$ is determined, there are not many possibilities for the word ${w_2\cdots w_{k+1}}$; can we navigate these restrictions successfully?

It is useful to rephrase the problem in the language of graph theory (what follows can be found in the proof of Lemma 1 in Koslicki’s paper). Let ${G_k}$ be the directed graph defined as follows:

• the vertex set is ${\mathcal{A}^k}$, so each vertex corresponds to a word of length ${k}$;
• there is an edge from ${u\in \mathcal{A}^k}$ to ${v\in \mathcal{A}^k}$ if and only if ${u_1 v = uv_k}$, that is, if ${u_2\cdots u_k = v_1\cdots v_{k-1}}$.

The graph ${G_k}$ is the ${k}$-dimensional De Bruijn graph of ${p}$ symbols. Recall that a Hamiltonian path in a graph is a path that visits each vertex exactly once. Thus the question above, regarding existence of a word in ${\mathcal{A}^n}$ that contains every word in ${\mathcal{A}^k}$, where ${n=p^k + k-1}$, is equivalent to asking for the existence of a Hamiltonian path in the De Bruijn graph.

There is a correspondence between ${G_k}$ and ${G_{k-1}}$; vertices in ${G_k}$ correspond to edges in ${G_{k-1}}$ (since both are labeled by elements of ${\mathcal{A}^k}$). Thus a Hamiltonian path in ${G_k}$ corresponds to an Eulerian path in ${G_{k-1}}$; that is, a path that visits every edge exactly once.

This correspondence is very useful, since in general the problem of determining whether a Hamiltonian path exists is hard (NP-complete), while it is easy to check existence of an Eulerian path in a directed graph: a sufficient condition is that every vertex have in-degree equal to its out-degree (and a slight weakening of this condition is both necessary and sufficient). This is the case for De Bruijn graphs, where every vertex has ${p=\#\mathcal{A}}$ edges coming in and going out. Thus ${G_{k-1}}$ has an Eulerian path, which corresponds to a Hamiltonian path for ${G_k}$. This answers the original question, demonstrating that for every ${k}$, there is a word ${w}$ of length ${n=p^k+k-1}$ such that ${w}$ contains every word of length ${k}$ as a subword.