Let be a finite set, which we call an alphabet, and let be an infinite sequence of letters from . It is natural to ask how complex the sequence is: for example, if the alphabet is , then we expect a typical sequence produced by flipping a coin to be in some sense more complex than the sequence .
One important way of making this notion precise is the entropy of the shift space generated by , a notion coming from symbolic dynamics. Let be the number of words of length (that is, elements of ) that appear as subwords of . Clearly we have . Roughly speaking, the entropy of is the exponential growth rate of . More precisely, we write
Of course, in practice it is often the case that one does not have an infinite sequence, but merely a very long one. For example, it has been suggested that entropy (and a related quantity, the topological pressure) can play a role in the analysis of DNA sequences; see [D. Koslicki, “Topological entropy of DNA sequences”, Bioinformatics 27(8), 2011, p. 1061–1067] and [D. Koslicki and D.J. Thompson, “Coding sequence density estimation via topological pressure”, J. Math. Biol. 70, 2015, p. 45–69]. In this case we have and are dealing with sequences whose length is large, but finite.
Given a sequence with length , one can try to get some reasonable `entropy-like’ quantity by fixing some and putting . But what should we take to be? If we take to be too small we will get an overestimate (with we will probably just find out that contains every letter of ), but if we take too small we get an underestimate (with we have so ).
The convention proposed by Koslicki in the first paper above is to let be the largest number such that there is some word of length that contains every word of length . If this actually happens, then achieves its maximum value ; if some words do not appear, then .
What is the relationship between and that guarantees existence of a word of length containing every word of length ? Let and note that there are words of length ; if contains every such word then we must have , since the length- subwords of are precisely , , \dots, , so we must have .
The converse implication is a little harder, though. Given , let . Is it necessarily true that there is a word that contains every subword of length ? After all, once is determined, there are not many possibilities for the word ; can we navigate these restrictions successfully?
It is useful to rephrase the problem in the language of graph theory (what follows can be found in the proof of Lemma 1 in Koslicki’s paper). Let be the directed graph defined as follows:
- the vertex set is , so each vertex corresponds to a word of length ;
- there is an edge from to if and only if , that is, if .
The graph is the -dimensional De Bruijn graph of symbols. Recall that a Hamiltonian path in a graph is a path that visits each vertex exactly once. Thus the question above, regarding existence of a word in that contains every word in , where , is equivalent to asking for the existence of a Hamiltonian path in the De Bruijn graph.
There is a correspondence between and ; vertices in correspond to edges in (since both are labeled by elements of ). Thus a Hamiltonian path in corresponds to an Eulerian path in ; that is, a path that visits every edge exactly once.
This correspondence is very useful, since in general the problem of determining whether a Hamiltonian path exists is hard (NP-complete), while it is easy to check existence of an Eulerian path in a directed graph: a sufficient condition is that every vertex have in-degree equal to its out-degree (and a slight weakening of this condition is both necessary and sufficient). This is the case for De Bruijn graphs, where every vertex has edges coming in and going out. Thus has an Eulerian path, which corresponds to a Hamiltonian path for . This answers the original question, demonstrating that for every , there is a word of length such that contains every word of length as a subword.