Improving Compression by Prefixing

Consider a Lempel-Ziv compression scheme in which we process a given text left to right, each time finding the longest prefix of the unprocessed input that is completely contained in the processed input and writing down where the previous occurrence was.

Formally, suppose a string w[1..n] is given and we processed w[1..i-1] and we are about to process ith position. Find the longest prefix of w[i..n] that is a substring of w[1..i-1]. If we the longest such prefix is of positive length, output the length l and position of the previous occurrence and set i = i + l. If the longest prefix is of size zero output w[i] and set i = i + 1. Denote by LZ(w) the number of codewords produced by the above algorithm on input w.

Let x and w be two strings. An easy observation shows that LZ(w) \leq LZ(wx), i.e., the compressed size of a file never decreases when we add more stuff to the end. My question is whether adding more stuff at the beginning of a file decreases its compressed size; i.e., whether LZ(w) \leq LZ(xw) holds or not.

Well, consider the sequence of strings

g_0 = 0

g_n = g_{n-1} g_{n-1} n for n>0

(where the operation between literals is string concatenation.) For instance g_2 = 0010012 . We see that LZ(g_n) = 2n + 1. Create a sequence of strings g'_n where g'_i is obtained by removing the first symbol in g_i. It is not hard to show that LZ(g'_n) = 3n - 1. In particular we have LZ(g'_n) > LZ(0 g'_n) for any n \geq 3. So we can reduce the compressed size of a file by adding more stuff at the beginning. Now here is my question

Question: Find the asymptotically smallest function f such that there exists a constant c for which LZ(w) \leq LZ(xw)\cdot f(|w|) \cdot c holds for all strings x, w.

I do not know what the optimal f is, but in the rest of this post I will prove that f \in O(\log n\log^* n). The proof has two parts. First we show that adding a prefix longer than |w|^2 won’t help reduce the compressed size. In the second part, we show that for all prefixes of a length bounded by a polynomial in |w|, taking f as stated above suffices.

Lemma 1: For any given w and x with LZ(x) \leq LZ(w), there exists an x' such that LZ(x'w) \leq LZ(xw) and |x'| \leq |w|^2.

This lemma simply says that if there were counter examples to LZ(w) \leq LZ(xw)\cdot f(|w|) \cdot c , then there are counter examples with |x| \leq |w|^2.

In the Lempel-Ziv algorithm that we are using now, there are two types of codewords output. The insert codewords that are output when the next symbol is seen for the first time, so that we have to insert that symbol and copy codewords that are output when the next few chars occurred before and we refer to them.

Proof:Suppose string x, w and Lempel-Ziv codewords c_1, c_2, \ldots, c_t for xw are given. Consider the decompression of xw using the codewords and assume that we processed the first codeword c_i after which x is completely constructed. If some symbols of w are generated by c_i paint these symbols to gray. Now continue decompression by processing c_{i+1}, \ldots. If a codeword copies some symbols from x, paint to gray all those symbols in x that are copied. Let |w|=n. Now the number of gray symbols is at most n.

Consider the codewords c_1, c_2, \ldots, c_i, from which x is produced. Now as long as there is a gray symbol we do the following: Find a gray symbol s. If the symbol is produced by an insert codeword, paint s to black. Else, if the symbol is produced by a copy codeword, paint to gray the symbol from which s is copied. Further paint s to black. Since there were at most n gray points at the beginning and LZ(x)\leq LZ(w), when the process terminates there will be at most n^2 black symbols.

Create a string x' by removing from x all symbols that are not painted to black. Now we claim that LZ(x') \leq LZ(x). To show this we go through the codewords c_1, c_2, \ldots c_i left to right and create at most i codewords d_1, d_2, \ldots, d_j that decompress into x'. Suppose we processed c_1, \ldots, c_{i-1} and now will process c_i and produced codewords d_1, d_2, \ldots d_{j-1} so far. We maintain two invariants

  1. All black symbols in the prefix of x that is produced by codewords c_1, \ldots, c_{i-1} are also produced by the codewords d_1, \ldots d_{j-1}
  2. The black points produced by the codewords d_1, \ldots d_{j-1} are all consecutive and are in the same order as they are in x.

If there is no black point in c_i, do nothing. Else, if c_i is an insert codeword, we output a similar insert codeword d_j. In the last case, where c_i is a copy codeword that contains black points, we know by invariants 1 and 2 that the black points in c_i occur consecutively in the text produced by d_1, \ldots d_j. We output a codeword that copies them, right after the last symbol produced by d_1, \ldots d_j. It is easily observed that invariants 1 and 2 are maintained. QED

The second part of the proof will make use of some “high technology”. We will use a characterization of the LZ compression which is derived by a technique called locally consistent parsing. This characterization and lcp technique is due to Cenk Sahinalp and Uzi Vishkin. We will use the result as black box however a good explanation can be found in papers Substring Compression Problems and The string edit distance matching problem with moves by Muthu and Cormode.

It turns out that for any given string we can construct a parse tree that

  • changes very little when we do some edit operations on text
  • the number of unique substrings that the nodes of the tree span approximates the LZ compressibility well.

Let UL(x) be the number of unique substrings that the nodes of the tree span. We have LZ(x) \leq UL(x) \leq LZ(x) \cdot O(\log |x| \log^*|x|). Now we state second part of the proof.

Lemma 2: For any x satisfying |x| < p(|w|) for a polynomial p, it holds that LZ(w) \leq LZ(xw)\cdot O(\log |w|\log^*|w|).

Proof: We have LZ(w) \leq \mathit{UL}(w) \leq \mathit{UL(xw)} + O(\log |xw| \log^*|xw|). Also \mathit{UL(xw)} \leq  LZ(xw) \cdot O(\log |xw| \log^*|xw|). Since |x| \leq p(|w|), we have LZ(w) \leq LZ(xw) \cdot O(\log |w| \log^*|w|). QED

So in this post, we establish

Proposition 1: For all string x and w, it holds that LZ(w) \leq LZ(xw)\cdot O(\log|w|\log^*|w|)

The sequence g'_n establishes a constant gap, i.e., LZ(0g'_n) / LZ(g'_n) \geq 3/2 - \epsilon. I would very much like to find a sequence that produces super-constant gap or prove that f = O(1) is sufficient.

Leave a Comment

Filed under Uncategorized

Interval Longest Common Prefix Problem

Consider a string S over the alphabet \sigma. It is well known that the longest common prefix of any two given substrings of S can be found in constant time, after a linear time preprocessing. The preprocessing involves constructing the suffix array SA and the LCP array, calculating the inverse of the suffix array SA and preparing the LCP array for constant time range minimum queries (RMQ). After this, we can find the longest common prefix of two substrings S[i_1:j_1] and S[i_2:j_2] by

\displaystyle \mathrm{rmq}_{\mathop{LCP}}(\mathop{\overline{SA}}[i_1],\mathop{\overline{SA}}[i_2])

Of course, if the found value is greater than the length of either substring, the length of the longest common prefix is equal to the length of the shorter string. Since the end points of the given substrings play insignificant role, defining the longest common prefix problem on suffixes of S is as general. We denote by S_k the suffix of S which starts at position k. Let lcp(i,j) denote the longest common prefix of S_i and S_j. (Note that this is different from the LCP[] array above.)

In a paper called Substring Compression Problems by Cormode and Muthukrishnan, the following problem is introduced. Preprocess a given string S, so that queries of the form ilcp([i:j],k) can be answered quickly, where

\displaystyle \mathop{ilcp}([i:j],k)=\max_{i\leq l \leq j}\mathop{lcp}(l, k)

For reasons I will mention later in this post, ilcp() query is often called with k=j+1. This may admit more efficient solutions than ilcp, so we define the restricted longest common prefix problem as:

\displaystyle \mathop{rilcp}([i:j])= \mathop{ilcp}([i:j],j+1)

Let us make few observations. Note if \mathop{\overline{SA}}[a]\le\mathop{\overline{SA}}[b]\le\mathop{\overline{SA}}[c], then

\displaystyle \mathop{lcp}(a,b) \geq \mathop{lcp}(a,c) and also
\displaystyle \mathop{lcp}(b,c) \geq \mathop{lcp}(a,c)

This means that to compute a ilcp query, it is sufficient to perform two lcp operations

  1. \mathop{lcp}(k,l), where S_l is the lexicographically the greatest suffix in S_i,S_{i+1},\ldots,S_j which is lexicographically smaller than S_k
  2. \mathop{lcp}(k,r), where S_r is the lexicographically the smallest suffix in S_i,S_{i+1},\ldots,S_j which is lexicographically greater than S_k

This translates the ilcp problem to finding the greatest value smaller than k in substring [i:j] of a permutation P (which, in ilcp case, is the inverse suffix array). Further, in rilcp problem we know that k = P[j+1]. Let us consider some naive solutions. We say that an algorithm has complexity \langle f(n), g(n) \rangle if the preprocessing takes f(n) time and a query takes g(n) time. Then precomputing all possible queries results in a \langle O(n^2), O(1) \rangle algorithm for the rilcp problem. It is also possible to obtain a \langle O(n^2), O(1) \rangle algorithm for the ilcp problem, by computing n RMQ arrays upper bounded by k for k=1,2,…,n.

Yet another naive method came across my mind is to divide P into \sqrt{n \log n} size non-overlapping chunks and prepare a sorted copy of each chunk so that in a full chunk, we can search for the greatest value smaller than k, for any k in log n time. Given an interval [i:j] and a k, we need to search at most \frac{n}{\sqrt{n\log n}} full chunks and two partial chunks. To search in a partial chunk we just go through all elements one by one. This results in a \langle O(n), O(\sqrt{n\log n}) \rangle algorithm.

Apparently, neither O(n^2) space nor O(\sqrt{n\log n}) query time is feasible, so we seek for some method which uses n.polylog(n) space, while having decent query times. Here is what we do: We create a binary search tree and store in each node the sorted copy of the elements in its subtree. Any interval is concatenation of at most \log n nodes and we can search for the smallest element greater than some value in a node in \log n time, using the sorted list that we created. So, we obtain an algorithm with O(n\log n) space and O(\log^2 n) query time.

Now we will improve the query time using the fact that there are very efficient algorithms to check if there exist an element in the sublist [i:j] with value in the range [s:t]. This is a well studied problem, called orthogonal range searching. Hence, to solve the former problem we can ask if there is an element in [i:j] with value in [1:k] and depending on the answer we can proceed with the value range [k/2:k] or [1:k/2] and so on to perform a binary search. It turns out that orthogonal search can be done in loglog n time using linear space when all dimentions are integers, which leads to a \langle O(n\log n), O(\log n\log\log n) \rangle algorithm for the ilcp problem. This is also one of the results given in Substring Compression Problems.

I mentioned above that ilcp([i:j],k) is often called with k=j+1. Suppose you want to LZSS compress a substring S[i:j]. (in LZSS, the longest prefix of the unprocessed input which exists earlier in the processed input is referenced.) What we need to do is letting j_0 = i and j_{h+1}=j_{h}+\max\big\{\mathit{ilcp}([i:j_h],j_h+1),1\big\} for h = 0,1,2,… The smallest h for which j_h \geq j is the LZSS compressed length of S[i:j].

So far everything seems good except that I somehow convinced myself that rilcp can be solved much more efficiently and this problem is bugging me. There are only n^2 queries and queries that end at the same position are higly correlated. Also, the array we work on is not only an integer size array but also a permutation. These somehow suggest me that a o(\log n) time solution should be possible. I will be happy to hear about this problem.

Update (April 2009): At Mathematics of String Spaces and Algorithmic Applications, Moshe Lewenstein pointed out that there are 2-D orthogonal range searching algorithms that find the smallest element in a specified dimension in a given rectangle in O(\log\log n) time. Therefore ilcp([i,j],k) queries can be answered in doubly logarithmic time. Also, Mihai Patrascu pointed that range queries on permutations are no easier than the general case, since we can reduce general case to permutations. Since \log\log n time is tight for orthogonal range searching (by the predecessor lowerbound), the new question is whether we can answer the restricted queries, i.e., rilcp(i,j) queries in less than doubly logarithmic time.

Leave a Comment

Filed under Uncategorized