Interactive Phrase Browsing Within Compressed Text Raymond Wan and Alistair Moffat University of...

2
Interactive Phrase Browsing Within Compressed Text Raymond Wan and Alistair Moffat University of Melbourne http://www.cs.mu.oz.au/~{rwan,alistair} Introduction We describe, as an alternative to index-based retrieval, a system for phrase browsing with the following features: •No explicit index is created. •Minimal decompression of the document is required. Our system, which we call Re-Store, builds on an existing compression algorithm called Re-Pair [1]. The Re-Pair algorithm Re-Pair is an off-line dictionary- based compression algorithm that repeatedly identifies the most frequently occurring pair of adjacent symbols and replaces all occurrences of it with a new symbol. Each replacement reduces the message by one symbol. The new symbol and what it expands to are added to the dictionary. A sample application 1 of Re-Pair is shown: sequence phrase hierarchy zenzizenzizenzic z1ziz1ziz1zic 1 -> e n z12z12z12c 2 -> z i z3z3z3c 3 -> 1 2 444c 4 -> z 3 Two outputs are produced: the reduced sequence and the phrase hierarchy. The Re-Store system The Re-Store system is made up of the subsystems shown in Figure 1. Each of these subsystems contribute to at least one of the three phases of Re- Store: compression, block merging, or phrase browsing. Word-aligned Re-Pair The original Re-Pair used the frequencies of pairs of symbols to determine replacements. This resulted in phrases which began in the middle of one word and ended in the middle of the next word, a problem that makes browsing difficult. Re-Store uses a word-aligned version of Re-Pair that prevents this problem. The difference in the phrase hierarchies between the unrestricted Re-Pair and the word-aligned Re-Pair using the example “peter●piper●peter●picker” is shown: Unrestricted Re-Pair Word-aligned Re- Pair 6 -> er 6 -> er 7 -> ●p 7 -> et Figure 1: The overall Re-Store system and how subsystems interact with each other. Each subsystem is represented as a box and coloured based on its primary function: compression, block merging, or phrase browsing. 1 “zenzizenzizenzic” is an obsolete word that means “eighth power”. [Source: The Oxford English Dictionary, second edition] The Re-View coder The sequence is encoded using Re-View, which gives fast search time while sacrificing some compression. An encoded sequence contains multiple blocks with each block containing a prelude and codes. The prelude provides a description of the symbols that occur in that block by using nibble- aligned codes on differences. By restricting the maximum number of distinct symbol numbers in a block to 65,536, the codes can be encoded using 16-bit double-byte units. This approach allows for efficient searching depending on the phrase number being searched for. Also, as the frequency of the symbols in the sequence is relatively uniform, use of a simple coder instead of a minimum- redundancy coder loses only a small amount of compression effectiveness while yielding fast search times. In our experiments, we compressed 509 MB of the Wall Street Journal (WSJ) in blocks of 10 MB on a 933 MHz Pentium III with 1 GB RAM and 256 kB on-die cache. For this test file, the difference in compression between Re- View and a minimum-redundancy entropy coder was about 10%. Figure 2 shows how Re-View blocks can be skipped when searching for a phrase Figure 2: By decoding the prelude, the existence of a phrase in the block can be found. If it does not exist, the rest of the block can be skipped. Table 1: Comparison of various search methods on the compressed and uncompressed sequences. Title: system-layout.eps Creator: fig2dev Version 3.2 Patchlevel 3d Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. Title: preview.eps Creator: fig2dev Version 3.2 Patchlevel 3d Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers.

Transcript of Interactive Phrase Browsing Within Compressed Text Raymond Wan and Alistair Moffat University of...

Page 1: Interactive Phrase Browsing Within Compressed Text Raymond Wan and Alistair Moffat University of Melbourne {rwan,alistair} Introduction.

Interactive Phrase Browsing Within Compressed TextRaymond Wan and Alistair Moffat

University of Melbournehttp://www.cs.mu.oz.au/~{rwan,alistair}

Introduction

We describe, as an alternative to index-based retrieval, a system for phrase browsing with the following features:

•No explicit index is created.

•Minimal decompression of the document is required.

Our system, which we call Re-Store, builds on an existing compression algorithm called Re-Pair [1].

The Re-Pair algorithm

Re-Pair is an off-line dictionary-based compression algorithm that repeatedly identifies the most frequently occurring pair of adjacent symbols and replaces all occurrences of it with a new symbol. Each replacement reduces the message by one symbol. The new symbol and what it expands to are added to the dictionary.

A sample application1 of Re-Pair is shown:

sequence phrase hierarchy

zenzizenzizenzic

z1ziz1ziz1zic 1 -> e n

z12z12z12c 2 -> z i

z3z3z3c 3 -> 1 2

444c 4 -> z 3

Two outputs are produced: the reduced sequence and the phrase hierarchy.

The Re-Store system

The Re-Store system is made up of the subsystems shown in Figure 1. Each of these subsystems contribute to at least one of the three phases of Re-Store: compression, block merging, or phrase browsing.

Word-aligned Re-Pair

The original Re-Pair used the frequencies of pairs of symbols to determine replacements. This resulted in phrases which began in the middle of one word and ended in the middle of the next word, a problem that makes browsing difficult. Re-Store uses a word-aligned version of Re-Pair that prevents this problem.

The difference in the phrase hierarchies between the unrestricted Re-Pair and the word-aligned Re-Pair using the example “peter●piper●peter●picker” is shown:

Unrestricted Re-Pair Word-aligned Re-Pair

6 -> er 6 -> er

7 -> ●p 7 -> et

8 -> er●p 8 -> pi

9 -> et 9 -> pet

10 -> er●pi 10 -> peter

11 -> eter●pi 11 -> peter●

In the phrase hierarchy on the left, phrase #8 starts in the middle of one word and ends in the middle of the next word. Using word-aligned Re-Pair, all phrases are either substrings of words (or non-words) or entire words (or non-words).

Figure 1: The overall Re-Store system and how subsystems interact with each other. Each subsystem is represented as a box and coloured based on its primary function: compression, block merging, or phrase browsing.

1 “zenzizenzizenzic” is an obsolete word that means “eighth power”. [Source: The Oxford English Dictionary, second edition]

The Re-View coder

The sequence is encoded using Re-View, which gives fast search time while sacrificing some compression. An encoded sequence contains multiple blocks with each block containing a prelude and codes. The prelude provides a description of the symbols that occur in that block by using nibble-aligned codes on differences. By restricting the maximum number of distinct symbol numbers in a block to 65,536, the codes can be encoded using 16-bit double-byte units.

This approach allows for efficient searching depending on the phrase number being searched for. Also, as the frequency of the symbols in the sequence is relatively uniform, use of a simple coder instead of a minimum-redundancy coder loses only a small amount of compression effectiveness while yielding fast search times.

In our experiments, we compressed 509 MB of the Wall Street Journal (WSJ) in blocks of 10 MB on a 933 MHz Pentium III with 1 GB RAM and 256 kB on-die cache. For this test file, the difference in compression between Re-View and a minimum-redundancy entropy coder was about 10%.

Figure 2 shows how Re-View blocks can be skipped when searching for a phrase number. Searches conducted on the merged sequence (using Re-Merge) using various alternative methods are summarised in Table 1. Re-View offers good compression and reasonable search times. Even a search for 100 phrase numbers using Re-View is faster than completely decoding the sequence.

Figure 2: By decoding the prelude, the existence of a phrase in the block can be found. If it does not exist, the rest of the block can be skipped.

Table 1: Comparison of various search methods on the compressed and uncompressed sequences.

Title:system-layout.epsCreator:fig2dev Version 3.2 Patchlevel 3dPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Title:preview.epsCreator:fig2dev Version 3.2 Patchlevel 3dPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 2: Interactive Phrase Browsing Within Compressed Text Raymond Wan and Alistair Moffat University of Melbourne {rwan,alistair} Introduction.

Block merging

In order to compress a document of n bytes, Re-Pair requires approximately 5n words of memory [1]. One solution to this problem is to process the document in blocks.

However, a compressed document with more than one block makes phrase browsing difficult. Ideally, we would like to browse a set of phrases drawn from the entire document. This problem is solved by merging blocks using Re-Merge.

Re-Merge merges two blocks at a time in passes with each pass halving the number of blocks. Passes continue until all of the blocks have been merged into a single block. A pass may perform up to any one of the following levels:

Phrase browsing with Re-Phine

The most visible subsystem of Re-Store is the one that allows a user to browse phrases. This task is given to Re-Phine.

One of the advantages of having Re-Pair being an off-line dictionary-based algorithm is the separation of the phrase hierarchy from the sequence. Phrase browsing can be performed on the phrase hierarchy alone without any processing done on the sequence. And, as Table 2 shows, the phrase hierarchy for WSJ is around 4% of the entire compressed document. The sequence file is partially decoded only when the contexts of a phrase are required.

During the decoding of the phrase hierarchy, each phrase is placed in a node with six pointers, shown in Figure 4.

By following a phrase’s structural pointers, the two components of the phrase can be found. Also, by following parent-navigational pointers, phrases that use the current phrase as a left or right child can be found (also called right and left extending, respectively). Finally, sibling-navigational pointers can be used to locate phrases that start or end with the same left or right child as the current phrase. Figure 4 uses the phrase “bc” as the current phrase and shows samples of phrases that can be reached by following the navigational or structural pointers.

Phrase browsing using Re-Phine is equivalent to walking from one phrase to another using the edges in this structure. Figure 5 illustrates an example browsing session.

References[1] N. J. Larsson and A. Moffat. Offline dictionary-based compression. Proc. IEEE, 88(11):1722-1732, Nov. 2000.

[2] A. Moffat and R. Wan. Re-Store: A system for compressing, browsing, and searching large documents. In Proc. SPIRE’01, Nov. 2001. To appear.

1. Take the union of two phrase hierarchies by removing duplicate phrases.

2. Use the merged phrase hierarchy to locate phrases in the sequence from left to right.

3. Locate new phrases and append them to the phrase hierarchy.

The effectiveness of block merging on WSJ is shown in Figure 3 and Table 2. In the graph, block merging has been performed up to level 2 to the output produced by word-aligned Re-Pair. The input file was compressed using 10 MB blocks, creating 51 blocks and requiring 6 iterations to merge into a single block.

Block merging achieves overall savings in compression by decreasing the size of the phrase hierarchy and the Re-View codes at the cost of increasing the size of the Re-View prelude. As Table 2 shows, there is an improvement in compression of 2.01 bits per character down to 1.96.

While the compression ratio is important, recall that the motivation for performing block merging was to reduce the amount of memory used by Re-Pair while being able to browse phrases effectively.

Figure 4: A sample node in the structure constructed by the phrase browser.

Conclusion

Re-Store is a system for browsing compressed documents without building an explicit index and by performing minimal decompression. Re-Store builds on the Re-Pair compression algorithm, and includes subsystems to perform block merging (Re-Merge) and phrase browsing (Re-Phine).

A detailed description of the Re-Store system as well as an overview of related work will appear as Moffat and Wan, 2001 [2].

There still remains much work to be done on Re-Store. The parsing rules for word-aligned Re-Pair break phrases between words, but in some languages, whitespace characters do not exist. Re-Pair’s parsing rules may be improved by using punctuation marks to separate phrases.

Level 2 of Re-Merge currently searches for phases in a greedy manner. It is expected that compression can be improved if level 2 can be performed more elegantly. Level 3 has yet to be implemented and is expected to improve compression further.

Finally, subfigure 5b contains a list of phrases containing the word “American” followed by varying combinations of whitespace. This is confusing and we are currently looking into ways of improving usability.

5a. Initial phrase 5b. Right extension

5c. Left extension 5d. Right extension

5e. One of the contexts of the phrase

Figure 5: A sample browsing session starting with the initial phrase “American” in subfigure 5a and concluding with the phrase “Latin American art” in subfigure 5d. In subfigure 5e, one of the contexts in which this phrase occurs is shown. The context used is ten symbols before and ten symbols after the phrase. In subfigure 5b, the phrases in that window are all siblings of each other. This is also true for subfigures 5c and 5d.

Figure 3: Compression ratio as a function of Re-Merge iteration.

Table 2: Compression ratio and encoding/decoding time using Re-Pair and Re-Merge.

AcknowledgementsThis work was supported by the Australian Research Council. The Re-Phine graphical user interface was coded using a development version of the GTK+ library (version 1.3.5), available at http://www.gtk.org/.