Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a...

50
Grammar-Based Compression Slides by Carl Kingsford

Transcript of Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a...

Page 1: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Grammar-Based CompressionSlides by Carl Kingsford

Page 2: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Context-free Grammars

Def. A context-free grammar is a collection of rules of the form: A → x1x2x3…xk

where x1x2x3…xk are either terminal symbols (letters in the alphabet ∑) or symbols that appear on the left-hand side of some rule.

S ! ABC!A ! xxDz!B ! ww!C ! aAa!D ! zB

S ! xxDzwwaAa

S ! xxzBzwwaxxDzaS ! xxzwwzwwaxxzBzaS ! xxzwwzwwaxxzwwza

Page 3: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Ziv-Lempel as a Grammar

ilongest match to the string

starting at i

Page 4: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Ziv-Lempel as a Grammar

ilongest match to the string

starting at i

A

A →

Page 5: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Ziv-Lempel as a Grammar

ilongest match to the string

starting at i

A

A → BB →

Page 6: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Accessing S[i] in Compressed Grammar

SS → ABC

A → abBcde

B → ww

C → aaDaa

D → xBy A B C

ab B cde

ww

ww aa D aa

x B y

ww

177284

Store length of expansion of each symbol

Page 7: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Accessing S[i] in Compressed Grammar

SS → ABC

A → abBcde

B → ww

C → aaDaa

D → xBy A B C

ab B cde

ww

ww aa D aa

x B y

ww

177284

Store length of expansion of each symbol S[13]

Page 8: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Accessing S[i] in Compressed Grammar

SS → ABC

A → abBcde

B → ww

C → aaDaa

D → xBy A B C

ab B cde

ww

ww aa D aa

x B y

ww

177284

Store length of expansion of each symbol S[13]

7

Page 9: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Accessing S[i] in Compressed Grammar

SS → ABC

A → abBcde

B → ww

C → aaDaa

D → xBy A B C

ab B cde

ww

ww aa D aa

x B y

ww

177284

Store length of expansion of each symbol S[13]

7 9

Page 10: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Accessing S[i] in Compressed Grammar

SS → ABC

A → abBcde

B → ww

C → aaDaa

D → xBy A B C

ab B cde

ww

ww aa D aa

x B y

ww

177284

Store length of expansion of each symbol S[13]

7 9 17

Page 11: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Accessing S[i] in Compressed Grammar

SS → ABC

A → abBcde

B → ww

C → aaDaa

D → xBy A B C

ab B cde

ww

ww aa D aa

x B y

ww

177284

Store length of expansion of each symbol S[13]

7 9

11

17

Page 12: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Accessing S[i] in Compressed Grammar

SS → ABC

A → abBcde

B → ww

C → aaDaa

D → xBy A B C

ab B cde

ww

ww aa D aa

x B y

ww

177284

Store length of expansion of each symbol S[13]

7 9

11

17

15

Page 13: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Accessing S[i] in Compressed Grammar

SS → ABC

A → abBcde

B → ww

C → aaDaa

D → xBy A B C

ab B cde

ww

ww aa D aa

x B y

ww

177284

Store length of expansion of each symbol S[13]

7 9

11

17

15

12

Page 14: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Accessing S[i] in Compressed Grammar

SS → ABC

A → abBcde

B → ww

C → aaDaa

D → xBy A B C

ab B cde

ww

ww aa D aa

x B y

ww

177284

Store length of expansion of each symbol S[13]

7 9

11

17

15

12 14

Page 15: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Accessing S[i] in Compressed Grammar

SS → ABC

A → abBcde

B → ww

C → aaDaa

D → xBy A B C

ab B cde

ww

ww aa D aa

x B y

ww

177284

Store length of expansion of each symbol S[13]

7 9

11

17

15

12 14

13

Page 16: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Re-Pair Off-line Compression Algorithm

Larsson and Moffat, Off-Line Dictionary-Based Compression, Proceedings of the IEEE, 88(11):1722-1732 (2000).

Page 17: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Re-Pair Algorithm Schema

1. Find the pair ab that occurs most frequently in the current message.

2. Replace all occurrences of ab with a new symbol A

3. Add the rule A → ab to the grammar. 4. Repeat until no pair occurs > 1 time.

A → ab: ababcab → AAcA A → aa: aaaacaa → AAcA

5. Zero-order compress (e.g. Huffman) the resulting string 6. Encode and transmit the grammar

Page 18: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Example

Page 19: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Implementation Details

Replace(ab,A): 1. Find next occurrence of xaby (using hash and linked list of symbols) 2. Replace ab with A 3. Decrement counts of xa and by (moving entry lower in queue) 4. Increment counts of xA and Ay (moving entry higher in queue, creating them the first time)

(Larsson & Moffat)

This bin includes pairs

that occur > √n times.

Page 20: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Running Time

• Finding the most frequent pair: !• walk down the last list in the priority queue in time O(√n) and find the

most frequent pair. (Why is it O(√n) time to read the last list?) • that pair will result in at least O(√n) replacements. Why?

• Each operation of Replace(ab,A) takes O(1) time, so each replace happens in constant time.

• For a sequence of length n there can be at most O(n) replacements. Why?

• Total time to build the grammar = O(n).

Page 21: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Running Time

• Finding the most frequent pair: !• walk down the last list in the priority queue in time O(√n) and find the

most frequent pair. (Why is it O(√n) time to read the last list?) • that pair will result in at least O(√n) replacements. Why?

• Each operation of Replace(ab,A) takes O(1) time, so each replace happens in constant time.

• For a sequence of length n there can be at most O(n) replacements. Why?

• Total time to build the grammar = O(n).

Every item on the list occurs ≥ √n times, so

there can be at most √n such times.

Page 22: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Running Time

• Finding the most frequent pair: !• walk down the last list in the priority queue in time O(√n) and find the

most frequent pair. (Why is it O(√n) time to read the last list?) • that pair will result in at least O(√n) replacements. Why?

• Each operation of Replace(ab,A) takes O(1) time, so each replace happens in constant time.

• For a sequence of length n there can be at most O(n) replacements. Why?

• Total time to build the grammar = O(n).

Every item on the list occurs ≥ √n times, so

there can be at most √n such times.

Each replacement reduces the length of the sequence by 1.

Page 23: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Encoding the Grammar (Dictionary)Divide symbols into generations:

0:

1:

2:

3:

(input alphabet)0 k0 - 1

k0 k1 - 1

k1 k2 - 1

k3 - 1k2

A symbol A → XY is in the lowest generation such that X and Y are in previous generations

ki := # of symbols in generation ≤ i

Can equate a symbol in generation i with a number between ki -1 and ki-1

Page 24: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Encoding Dictionary, Idea

Need to output a sequence of pairs (a1,b1), (a2,b2),…

Note: 1. In generation j, the maximum value of any ai or bi is ≤ kj. 2. In generation j, if ai ≤ kj - 2 then bj ≥ kj - 2. Why? 3. We can order the pairs in a generation in lexicographical order

Encoding idea:m a m b m g n m n n o c

Changes slowly, so use Δ-like encoding

Use rules 1 and 2 above to figure out the range of the second coordinate and use the minimum # of bits for that range.

Page 25: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Re-Pair Timing

Decompression time very fast: not quite as good as gzip, but much better than a context-based encoder.

(Larsson & Moffat)

Page 26: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Re-Pair Compression Performance

dictionary string

in bits per character

(Larsson & Moffat)

Page 27: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Re-Pair Compression Performance

dictionary string

in bits per character

(Larsson & Moffat)

Bigger than the simple 2-bit encoding!

Page 28: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

SequiturNevill-Manning & Witten, The Computer Journal 40 (1997),

103-116.

Page 29: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Sequitur Invariants

• Online algorithm: reads string from left to right, constructing a grammar, maintaining the following invariants at each step:

• Digram uniqueness: no adjacent pair of symbols appears > 1 time in the grammar.

• Rule utility: Every rule must be used at least twice.

Page 30: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Sequitur Algorithm

for i = 1…|S|:!! append S[i] to rule S!! !! Repeatedly replace digrams that ! occur > once with their symbol!!! Repeatedly remove rules that occur ! only once.

Page 31: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Sequitur Example(From Cherniavsky & Ladner, 2004)

S = acgtcgacgt

S → acgtcg S → aAtA!A → cg

S → aAtAacg!A → cg

S → aAtAaA!A → cg

S → aAtAaA!A → cg

S → BtAB!A → cg!B → aA

S → BtABt!A → cg!B → aA

S → CAC!A → cg!B → aA!C → Bt

S → CAC!A → cg!C → aAt

Page 32: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Sequitur Example(From Cherniavsky & Ladner, 2004)

S = acgtcgacgt

S → acgtcg S → aAtA!A → cg

S → aAtAacg!A → cg

S → aAtAaA!A → cg

S → aAtAaA!A → cg

S → BtAB!A → cg!B → aA

S → BtABt!A → cg!B → aA

S → CAC!A → cg!B → aA!C → Bt

S → CAC!A → cg!C → aAt

“cg” appears twice

Page 33: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Sequitur Example(From Cherniavsky & Ladner, 2004)

S = acgtcgacgt

S → acgtcg S → aAtA!A → cg

S → aAtAacg!A → cg

S → aAtAaA!A → cg

S → aAtAaA!A → cg

S → BtAB!A → cg!B → aA

S → BtABt!A → cg!B → aA

S → CAC!A → cg!B → aA!C → Bt

S → CAC!A → cg!C → aAt

“cg” appears twice

“aA” appears twice

Page 34: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Sequitur Example(From Cherniavsky & Ladner, 2004)

S = acgtcgacgt

S → acgtcg S → aAtA!A → cg

S → aAtAacg!A → cg

S → aAtAaA!A → cg

S → aAtAaA!A → cg

S → BtAB!A → cg!B → aA

S → BtABt!A → cg!B → aA

S → CAC!A → cg!B → aA!C → Bt

S → CAC!A → cg!C → aAt

“cg” appears twice

“aA” appears twice

B appears only once on right hand sides

Page 35: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Encoding the GrammarRule S is transmitted left to right, with the following rules to handle non-terminals (NT):

• The first time a NT is encountered, it’s right-hand side is transmitted.

• The second time a NT is encountered, the pair (i, len) is transmitted that gives an index into S and length that form the righthand side of the NT. • A this point, the decoder stores j → S[i…i+len] as a rule • j is the next NT number.

• The third time a NT is encountered, a single number (j) is transmitted referring to the rule created before.

Its RHS is transmitted using these same rules

Page 36: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Compression Performance

(Nevill-Manning & Witten)

Page 37: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Grammars Useful for More Than Compression

(Nevill-Manning & Witten)

Page 38: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

DNASequiturCherniavsky and Ladner, Grammar-based Compression of DNA

Sequences, UW CSE Tech Report, 2004

Page 39: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Applying Sequitur to DNA• Reverse complements accounted for: when xy seen, RC(xy) is implicitly seen.

• Several other ideas implemented as well.

DNA Sequitur

• Baseline = 2 bits / symbol

• Grammar-based methods do not compress the file in these tests.

Page 40: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

“Compressive Genomics”Po-Ru Loh, Michael Baym, Bonnie Berger. Compressive genomics

allows computational analysis to keep pace with genomic data. Nature Biotechnology 30(7):627-630, 2012.

Page 41: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

CaBLAST

“unique” sequences

sequences similar to the unique

sequence

represented as edit scripts

Page 42: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

CaBLAST

10-mer Posn

10-mer table to seed alignments

“unique” sequences

sequences similar to the unique

sequence

represented as edit scripts

Page 43: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

CaBLAST

10-mer Posn

10-mer table to seed alignments

“unique” sequences

sequences similar to the unique

sequence

represented as edit scripts

Search: use BLAST to search query against unique sequences (use a liberal cutoff for a “match”)

for every hit of sufficient quality, expand the sequences contained it its bin and search them.

Page 44: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Building the CaBLAST databaseend of lastfragment

currentpointer

Use 10-mer at current position to find unique sequences to search If any unique sequence contains a match of ≥ 300, add the sequence between the two pointers to the database as follows:

end of match

100bp overlap into

last fragment

new unique sequence

added to matching bin If 10,000 bases go by with no

match, create a new 10,000 base unique sequence bin.

Page 45: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Building the CaBLAST databaseend of lastfragmentcurrentpointer

Use 10-mer at current position to find unique sequences to search If any unique sequence contains a match of ≥ 300, add the sequence between the two pointers to the database as follows:

end of match

100bp overlap into

last fragment

new unique sequence

added to matching bin If 10,000 bases go by with no

match, create a new 10,000 base unique sequence bin.

Page 46: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

Storing Edit Scripts

s = “substitute” i = “insert”

CnnnSSSSS

distance from last edit (in octal)

sequence to insert / substitute

Deletions are substitutions with “-“. There are 16 possible characters: s,i,A,C,G,T,N,-,0-7 → 4-bit encoding

Page 47: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

CaBLAST Compression

(Loh et al, 2012)

Page 48: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

CaBLAST Search Time

(Loh et al, 2012)

Page 49: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

CaBLAST Accuracy

Varied identity threshold required for a match during compression

(Loh et al, 2012)

Page 50: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal

More CaBLAST Performance

Database of 36 yeast genomes.(Loh et al, 2012)