Improving Compression by Prefixing

Consider a Lempel-Ziv compression scheme in which we process a given text left to right, each time finding the longest prefix of the unprocessed input that is completely contained in the processed input and writing down where the previous occurrence was.

Formally, suppose a string w[1..n] is given and we processed w[1..i-1] and we are about to process ith position. Find the longest prefix of w[i..n] that is a substring of w[1..i-1]. If we the longest such prefix is of positive length, output the length l and position of the previous occurrence and set i = i + l. If the longest prefix is of size zero output w[i] and set i = i + 1. Denote by LZ(w) the number of codewords produced by the above algorithm on input w.

Let x and w be two strings. An easy observation shows that LZ(w) \leq LZ(wx), i.e., the compressed size of a file never decreases when we add more stuff to the end. My question is whether adding more stuff at the beginning of a file decreases its compressed size; i.e., whether LZ(w) \leq LZ(xw) holds or not.

Well, consider the sequence of strings

g_0 = 0

g_n = g_{n-1} g_{n-1} n for n>0

(where the operation between literals is string concatenation.) For instance g_2 = 0010012 . We see that LZ(g_n) = 2n + 1. Create a sequence of strings g'_n where g'_i is obtained by removing the first symbol in g_i. It is not hard to show that LZ(g'_n) = 3n - 1. In particular we have LZ(g'_n) > LZ(0 g'_n) for any n \geq 3. So we can reduce the compressed size of a file by adding more stuff at the beginning. Now here is my question

Question: Find the asymptotically smallest function f such that there exists a constant c for which LZ(w) \leq LZ(xw)\cdot f(|w|) \cdot c holds for all strings x, w.

I do not know what the optimal f is, but in the rest of this post I will prove that f \in O(\log n\log^* n). The proof has two parts. First we show that adding a prefix longer than |w|^2 won’t help reduce the compressed size. In the second part, we show that for all prefixes of a length bounded by a polynomial in |w|, taking f as stated above suffices.

Lemma 1: For any given w and x with LZ(x) \leq LZ(w), there exists an x' such that LZ(x'w) \leq LZ(xw) and |x'| \leq |w|^2.

This lemma simply says that if there were counter examples to LZ(w) \leq LZ(xw)\cdot f(|w|) \cdot c , then there are counter examples with |x| \leq |w|^2.

In the Lempel-Ziv algorithm that we are using now, there are two types of codewords output. The insert codewords that are output when the next symbol is seen for the first time, so that we have to insert that symbol and copy codewords that are output when the next few chars occurred before and we refer to them.

Proof:Suppose string x, w and Lempel-Ziv codewords c_1, c_2, \ldots, c_t for xw are given. Consider the decompression of xw using the codewords and assume that we processed the first codeword c_i after which x is completely constructed. If some symbols of w are generated by c_i paint these symbols to gray. Now continue decompression by processing c_{i+1}, \ldots. If a codeword copies some symbols from x, paint to gray all those symbols in x that are copied. Let |w|=n. Now the number of gray symbols is at most n.

Consider the codewords c_1, c_2, \ldots, c_i, from which x is produced. Now as long as there is a gray symbol we do the following: Find a gray symbol s. If the symbol is produced by an insert codeword, paint s to black. Else, if the symbol is produced by a copy codeword, paint to gray the symbol from which s is copied. Further paint s to black. Since there were at most n gray points at the beginning and LZ(x)\leq LZ(w), when the process terminates there will be at most n^2 black symbols.

Create a string x' by removing from x all symbols that are not painted to black. Now we claim that LZ(x') \leq LZ(x). To show this we go through the codewords c_1, c_2, \ldots c_i left to right and create at most i codewords d_1, d_2, \ldots, d_j that decompress into x'. Suppose we processed c_1, \ldots, c_{i-1} and now will process c_i and produced codewords d_1, d_2, \ldots d_{j-1} so far. We maintain two invariants

  1. All black symbols in the prefix of x that is produced by codewords c_1, \ldots, c_{i-1} are also produced by the codewords d_1, \ldots d_{j-1}
  2. The black points produced by the codewords d_1, \ldots d_{j-1} are all consecutive and are in the same order as they are in x.

If there is no black point in c_i, do nothing. Else, if c_i is an insert codeword, we output a similar insert codeword d_j. In the last case, where c_i is a copy codeword that contains black points, we know by invariants 1 and 2 that the black points in c_i occur consecutively in the text produced by d_1, \ldots d_j. We output a codeword that copies them, right after the last symbol produced by d_1, \ldots d_j. It is easily observed that invariants 1 and 2 are maintained. QED

The second part of the proof will make use of some “high technology”. We will use a characterization of the LZ compression which is derived by a technique called locally consistent parsing. This characterization and lcp technique is due to Cenk Sahinalp and Uzi Vishkin. We will use the result as black box however a good explanation can be found in papers Substring Compression Problems and The string edit distance matching problem with moves by Muthu and Cormode.

It turns out that for any given string we can construct a parse tree that

  • changes very little when we do some edit operations on text
  • the number of unique substrings that the nodes of the tree span approximates the LZ compressibility well.

Let UL(x) be the number of unique substrings that the nodes of the tree span. We have LZ(x) \leq UL(x) \leq LZ(x) \cdot O(\log |x| \log^*|x|). Now we state second part of the proof.

Lemma 2: For any x satisfying |x| < p(|w|) for a polynomial p, it holds that LZ(w) \leq LZ(xw)\cdot O(\log |w|\log^*|w|).

Proof: We have LZ(w) \leq \mathit{UL}(w) \leq \mathit{UL(xw)} + O(\log |xw| \log^*|xw|). Also \mathit{UL(xw)} \leq  LZ(xw) \cdot O(\log |xw| \log^*|xw|). Since |x| \leq p(|w|), we have LZ(w) \leq LZ(xw) \cdot O(\log |w| \log^*|w|). QED

So in this post, we establish

Proposition 1: For all string x and w, it holds that LZ(w) \leq LZ(xw)\cdot O(\log|w|\log^*|w|)

The sequence g'_n establishes a constant gap, i.e., LZ(0g'_n) / LZ(g'_n) \geq 3/2 - \epsilon. I would very much like to find a sequence that produces super-constant gap or prove that f = O(1) is sufficient.

Advertisement

Leave a Comment

Filed under Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s