Consider a Lempel-Ziv compression scheme in which we process a given text left to right, each time finding the longest prefix of the unprocessed input that is completely contained in the processed input and writing down where the previous occurrence was.
Formally, suppose a string is given and we processed
and we are about to process
th position. Find the longest prefix of
that is a substring of
. If we the longest such prefix is of positive length, output the length
and position of the previous occurrence and set
. If the longest prefix is of size zero output
and set
. Denote by
the number of codewords produced by the above algorithm on input
.
Let and
be two strings. An easy observation shows that
, i.e., the compressed size of a file never decreases when we add more stuff to the end. My question is whether adding more stuff at the beginning of a file decreases its compressed size; i.e., whether
holds or not.
Well, consider the sequence of strings
for
![]()
(where the operation between literals is string concatenation.) For instance . We see that
. Create a sequence of strings
where
is obtained by removing the first symbol in
. It is not hard to show that
. In particular we have
for any
. So we can reduce the compressed size of a file by adding more stuff at the beginning. Now here is my question
Question: Find the asymptotically smallest function
such that there exists a constant
for which
holds for all strings
.
I do not know what the optimal is, but in the rest of this post I will prove that
. The proof has two parts. First we show that adding a prefix longer than
won’t help reduce the compressed size. In the second part, we show that for all prefixes of a length bounded by a polynomial in
, taking
as stated above suffices.
Lemma 1: For any given
and
with
, there exists an
such that
and
.
This lemma simply says that if there were counter examples to , then there are counter examples with
.
In the Lempel-Ziv algorithm that we are using now, there are two types of codewords output. The insert codewords that are output when the next symbol is seen for the first time, so that we have to insert that symbol and copy codewords that are output when the next few chars occurred before and we refer to them.
Proof:Suppose string ,
and Lempel-Ziv codewords
for
are given. Consider the decompression of
using the codewords and assume that we processed the first codeword
after which
is completely constructed. If some symbols of
are generated by
paint these symbols to gray. Now continue decompression by processing
. If a codeword copies some symbols from
, paint to gray all those symbols in
that are copied. Let
. Now the number of gray symbols is at most
.
Consider the codewords , from which
is produced. Now as long as there is a gray symbol we do the following: Find a gray symbol
. If the symbol is produced by an insert codeword, paint
to black. Else, if the symbol is produced by a copy codeword, paint to gray the symbol from which
is copied. Further paint
to black. Since there were at most
gray points at the beginning and
, when the process terminates there will be at most
black symbols.
Create a string by removing from
all symbols that are not painted to black. Now we claim that
. To show this we go through the codewords
left to right and create at most
codewords
that decompress into
. Suppose we processed
and now will process
and produced codewords
so far. We maintain two invariants
- All black symbols in the prefix of
that is produced by codewords
are also produced by the codewords
- The black points produced by the codewords
are all consecutive and are in the same order as they are in
.
If there is no black point in , do nothing. Else, if
is an insert codeword, we output a similar insert codeword
. In the last case, where
is a copy codeword that contains black points, we know by invariants 1 and 2 that the black points in
occur consecutively in the text produced by
. We output a codeword that copies them, right after the last symbol produced by
. It is easily observed that invariants 1 and 2 are maintained. QED
The second part of the proof will make use of some “high technology”. We will use a characterization of the LZ compression which is derived by a technique called locally consistent parsing. This characterization and lcp technique is due to Cenk Sahinalp and Uzi Vishkin. We will use the result as black box however a good explanation can be found in papers Substring Compression Problems and The string edit distance matching problem with moves by Muthu and Cormode.
It turns out that for any given string we can construct a parse tree that
- changes very little when we do some edit operations on text
- the number of unique substrings that the nodes of the tree span approximates the LZ compressibility well.
Let be the number of unique substrings that the nodes of the tree span. We have
. Now we state second part of the proof.
Lemma 2: For any
satisfying
for a polynomial
, it holds that
.
Proof: We have . Also
. Since
, we have
. QED
So in this post, we establish
Proposition 1: For all string
and
, it holds that
The sequence establishes a constant gap, i.e.,
. I would very much like to find a sequence that produces super-constant gap or prove that
is sufficient.