Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a...
Transcript of Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a...
![Page 1: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/1.jpg)
Grammar-Based CompressionSlides by Carl Kingsford
![Page 2: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/2.jpg)
Context-free Grammars
Def. A context-free grammar is a collection of rules of the form: A → x1x2x3…xk
where x1x2x3…xk are either terminal symbols (letters in the alphabet ∑) or symbols that appear on the left-hand side of some rule.
S ! ABC!A ! xxDz!B ! ww!C ! aAa!D ! zB
S ! xxDzwwaAa
S ! xxzBzwwaxxDzaS ! xxzwwzwwaxxzBzaS ! xxzwwzwwaxxzwwza
![Page 3: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/3.jpg)
Ziv-Lempel as a Grammar
ilongest match to the string
starting at i
![Page 4: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/4.jpg)
Ziv-Lempel as a Grammar
ilongest match to the string
starting at i
A
A →
![Page 5: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/5.jpg)
Ziv-Lempel as a Grammar
ilongest match to the string
starting at i
A
A → BB →
![Page 6: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/6.jpg)
Accessing S[i] in Compressed Grammar
SS → ABC
A → abBcde
B → ww
C → aaDaa
D → xBy A B C
ab B cde
ww
ww aa D aa
x B y
ww
177284
Store length of expansion of each symbol
![Page 7: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/7.jpg)
Accessing S[i] in Compressed Grammar
SS → ABC
A → abBcde
B → ww
C → aaDaa
D → xBy A B C
ab B cde
ww
ww aa D aa
x B y
ww
177284
Store length of expansion of each symbol S[13]
![Page 8: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/8.jpg)
Accessing S[i] in Compressed Grammar
SS → ABC
A → abBcde
B → ww
C → aaDaa
D → xBy A B C
ab B cde
ww
ww aa D aa
x B y
ww
177284
Store length of expansion of each symbol S[13]
7
![Page 9: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/9.jpg)
Accessing S[i] in Compressed Grammar
SS → ABC
A → abBcde
B → ww
C → aaDaa
D → xBy A B C
ab B cde
ww
ww aa D aa
x B y
ww
177284
Store length of expansion of each symbol S[13]
7 9
![Page 10: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/10.jpg)
Accessing S[i] in Compressed Grammar
SS → ABC
A → abBcde
B → ww
C → aaDaa
D → xBy A B C
ab B cde
ww
ww aa D aa
x B y
ww
177284
Store length of expansion of each symbol S[13]
7 9 17
![Page 11: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/11.jpg)
Accessing S[i] in Compressed Grammar
SS → ABC
A → abBcde
B → ww
C → aaDaa
D → xBy A B C
ab B cde
ww
ww aa D aa
x B y
ww
177284
Store length of expansion of each symbol S[13]
7 9
11
17
![Page 12: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/12.jpg)
Accessing S[i] in Compressed Grammar
SS → ABC
A → abBcde
B → ww
C → aaDaa
D → xBy A B C
ab B cde
ww
ww aa D aa
x B y
ww
177284
Store length of expansion of each symbol S[13]
7 9
11
17
15
![Page 13: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/13.jpg)
Accessing S[i] in Compressed Grammar
SS → ABC
A → abBcde
B → ww
C → aaDaa
D → xBy A B C
ab B cde
ww
ww aa D aa
x B y
ww
177284
Store length of expansion of each symbol S[13]
7 9
11
17
15
12
![Page 14: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/14.jpg)
Accessing S[i] in Compressed Grammar
SS → ABC
A → abBcde
B → ww
C → aaDaa
D → xBy A B C
ab B cde
ww
ww aa D aa
x B y
ww
177284
Store length of expansion of each symbol S[13]
7 9
11
17
15
12 14
![Page 15: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/15.jpg)
Accessing S[i] in Compressed Grammar
SS → ABC
A → abBcde
B → ww
C → aaDaa
D → xBy A B C
ab B cde
ww
ww aa D aa
x B y
ww
177284
Store length of expansion of each symbol S[13]
7 9
11
17
15
12 14
13
![Page 16: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/16.jpg)
Re-Pair Off-line Compression Algorithm
Larsson and Moffat, Off-Line Dictionary-Based Compression, Proceedings of the IEEE, 88(11):1722-1732 (2000).
![Page 17: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/17.jpg)
Re-Pair Algorithm Schema
1. Find the pair ab that occurs most frequently in the current message.
2. Replace all occurrences of ab with a new symbol A
3. Add the rule A → ab to the grammar. 4. Repeat until no pair occurs > 1 time.
A → ab: ababcab → AAcA A → aa: aaaacaa → AAcA
5. Zero-order compress (e.g. Huffman) the resulting string 6. Encode and transmit the grammar
![Page 18: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/18.jpg)
Example
![Page 19: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/19.jpg)
Implementation Details
Replace(ab,A): 1. Find next occurrence of xaby (using hash and linked list of symbols) 2. Replace ab with A 3. Decrement counts of xa and by (moving entry lower in queue) 4. Increment counts of xA and Ay (moving entry higher in queue, creating them the first time)
(Larsson & Moffat)
This bin includes pairs
that occur > √n times.
![Page 20: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/20.jpg)
Running Time
• Finding the most frequent pair: !• walk down the last list in the priority queue in time O(√n) and find the
most frequent pair. (Why is it O(√n) time to read the last list?) • that pair will result in at least O(√n) replacements. Why?
• Each operation of Replace(ab,A) takes O(1) time, so each replace happens in constant time.
• For a sequence of length n there can be at most O(n) replacements. Why?
• Total time to build the grammar = O(n).
![Page 21: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/21.jpg)
Running Time
• Finding the most frequent pair: !• walk down the last list in the priority queue in time O(√n) and find the
most frequent pair. (Why is it O(√n) time to read the last list?) • that pair will result in at least O(√n) replacements. Why?
• Each operation of Replace(ab,A) takes O(1) time, so each replace happens in constant time.
• For a sequence of length n there can be at most O(n) replacements. Why?
• Total time to build the grammar = O(n).
Every item on the list occurs ≥ √n times, so
there can be at most √n such times.
![Page 22: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/22.jpg)
Running Time
• Finding the most frequent pair: !• walk down the last list in the priority queue in time O(√n) and find the
most frequent pair. (Why is it O(√n) time to read the last list?) • that pair will result in at least O(√n) replacements. Why?
• Each operation of Replace(ab,A) takes O(1) time, so each replace happens in constant time.
• For a sequence of length n there can be at most O(n) replacements. Why?
• Total time to build the grammar = O(n).
Every item on the list occurs ≥ √n times, so
there can be at most √n such times.
Each replacement reduces the length of the sequence by 1.
![Page 23: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/23.jpg)
Encoding the Grammar (Dictionary)Divide symbols into generations:
0:
1:
2:
3:
(input alphabet)0 k0 - 1
k0 k1 - 1
k1 k2 - 1
k3 - 1k2
A symbol A → XY is in the lowest generation such that X and Y are in previous generations
ki := # of symbols in generation ≤ i
Can equate a symbol in generation i with a number between ki -1 and ki-1
![Page 24: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/24.jpg)
Encoding Dictionary, Idea
Need to output a sequence of pairs (a1,b1), (a2,b2),…
Note: 1. In generation j, the maximum value of any ai or bi is ≤ kj. 2. In generation j, if ai ≤ kj - 2 then bj ≥ kj - 2. Why? 3. We can order the pairs in a generation in lexicographical order
Encoding idea:m a m b m g n m n n o c
Changes slowly, so use Δ-like encoding
Use rules 1 and 2 above to figure out the range of the second coordinate and use the minimum # of bits for that range.
![Page 25: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/25.jpg)
Re-Pair Timing
Decompression time very fast: not quite as good as gzip, but much better than a context-based encoder.
(Larsson & Moffat)
![Page 26: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/26.jpg)
Re-Pair Compression Performance
dictionary string
in bits per character
(Larsson & Moffat)
![Page 27: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/27.jpg)
Re-Pair Compression Performance
dictionary string
in bits per character
(Larsson & Moffat)
Bigger than the simple 2-bit encoding!
![Page 28: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/28.jpg)
SequiturNevill-Manning & Witten, The Computer Journal 40 (1997),
103-116.
![Page 29: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/29.jpg)
Sequitur Invariants
• Online algorithm: reads string from left to right, constructing a grammar, maintaining the following invariants at each step:
• Digram uniqueness: no adjacent pair of symbols appears > 1 time in the grammar.
• Rule utility: Every rule must be used at least twice.
![Page 30: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/30.jpg)
Sequitur Algorithm
for i = 1…|S|:!! append S[i] to rule S!! !! Repeatedly replace digrams that ! occur > once with their symbol!!! Repeatedly remove rules that occur ! only once.
![Page 31: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/31.jpg)
Sequitur Example(From Cherniavsky & Ladner, 2004)
S = acgtcgacgt
S → acgtcg S → aAtA!A → cg
S → aAtAacg!A → cg
S → aAtAaA!A → cg
S → aAtAaA!A → cg
S → BtAB!A → cg!B → aA
S → BtABt!A → cg!B → aA
S → CAC!A → cg!B → aA!C → Bt
S → CAC!A → cg!C → aAt
![Page 32: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/32.jpg)
Sequitur Example(From Cherniavsky & Ladner, 2004)
S = acgtcgacgt
S → acgtcg S → aAtA!A → cg
S → aAtAacg!A → cg
S → aAtAaA!A → cg
S → aAtAaA!A → cg
S → BtAB!A → cg!B → aA
S → BtABt!A → cg!B → aA
S → CAC!A → cg!B → aA!C → Bt
S → CAC!A → cg!C → aAt
“cg” appears twice
![Page 33: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/33.jpg)
Sequitur Example(From Cherniavsky & Ladner, 2004)
S = acgtcgacgt
S → acgtcg S → aAtA!A → cg
S → aAtAacg!A → cg
S → aAtAaA!A → cg
S → aAtAaA!A → cg
S → BtAB!A → cg!B → aA
S → BtABt!A → cg!B → aA
S → CAC!A → cg!B → aA!C → Bt
S → CAC!A → cg!C → aAt
“cg” appears twice
“aA” appears twice
![Page 34: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/34.jpg)
Sequitur Example(From Cherniavsky & Ladner, 2004)
S = acgtcgacgt
S → acgtcg S → aAtA!A → cg
S → aAtAacg!A → cg
S → aAtAaA!A → cg
S → aAtAaA!A → cg
S → BtAB!A → cg!B → aA
S → BtABt!A → cg!B → aA
S → CAC!A → cg!B → aA!C → Bt
S → CAC!A → cg!C → aAt
“cg” appears twice
“aA” appears twice
B appears only once on right hand sides
![Page 35: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/35.jpg)
Encoding the GrammarRule S is transmitted left to right, with the following rules to handle non-terminals (NT):
• The first time a NT is encountered, it’s right-hand side is transmitted.
• The second time a NT is encountered, the pair (i, len) is transmitted that gives an index into S and length that form the righthand side of the NT. • A this point, the decoder stores j → S[i…i+len] as a rule • j is the next NT number.
• The third time a NT is encountered, a single number (j) is transmitted referring to the rule created before.
Its RHS is transmitted using these same rules
![Page 36: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/36.jpg)
Compression Performance
(Nevill-Manning & Witten)
![Page 37: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/37.jpg)
Grammars Useful for More Than Compression
(Nevill-Manning & Witten)
![Page 38: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/38.jpg)
DNASequiturCherniavsky and Ladner, Grammar-based Compression of DNA
Sequences, UW CSE Tech Report, 2004
![Page 39: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/39.jpg)
Applying Sequitur to DNA• Reverse complements accounted for: when xy seen, RC(xy) is implicitly seen.
• Several other ideas implemented as well.
DNA Sequitur
• Baseline = 2 bits / symbol
• Grammar-based methods do not compress the file in these tests.
![Page 40: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/40.jpg)
“Compressive Genomics”Po-Ru Loh, Michael Baym, Bonnie Berger. Compressive genomics
allows computational analysis to keep pace with genomic data. Nature Biotechnology 30(7):627-630, 2012.
![Page 41: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/41.jpg)
CaBLAST
“unique” sequences
sequences similar to the unique
sequence
represented as edit scripts
![Page 42: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/42.jpg)
CaBLAST
10-mer Posn
10-mer table to seed alignments
“unique” sequences
sequences similar to the unique
sequence
represented as edit scripts
![Page 43: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/43.jpg)
CaBLAST
10-mer Posn
10-mer table to seed alignments
“unique” sequences
sequences similar to the unique
sequence
represented as edit scripts
Search: use BLAST to search query against unique sequences (use a liberal cutoff for a “match”)
for every hit of sufficient quality, expand the sequences contained it its bin and search them.
![Page 44: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/44.jpg)
Building the CaBLAST databaseend of lastfragment
currentpointer
Use 10-mer at current position to find unique sequences to search If any unique sequence contains a match of ≥ 300, add the sequence between the two pointers to the database as follows:
end of match
100bp overlap into
last fragment
new unique sequence
added to matching bin If 10,000 bases go by with no
match, create a new 10,000 base unique sequence bin.
![Page 45: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/45.jpg)
Building the CaBLAST databaseend of lastfragmentcurrentpointer
Use 10-mer at current position to find unique sequences to search If any unique sequence contains a match of ≥ 300, add the sequence between the two pointers to the database as follows:
end of match
100bp overlap into
last fragment
new unique sequence
added to matching bin If 10,000 bases go by with no
match, create a new 10,000 base unique sequence bin.
![Page 46: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/46.jpg)
Storing Edit Scripts
s = “substitute” i = “insert”
CnnnSSSSS
distance from last edit (in octal)
sequence to insert / substitute
Deletions are substitutions with “-“. There are 16 possible characters: s,i,A,C,G,T,N,-,0-7 → 4-bit encoding
![Page 47: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/47.jpg)
CaBLAST Compression
(Loh et al, 2012)
![Page 48: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/48.jpg)
CaBLAST Search Time
(Loh et al, 2012)
![Page 49: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/49.jpg)
CaBLAST Accuracy
Varied identity threshold required for a match during compression
(Loh et al, 2012)
![Page 50: Grammar-Based Compression · 2013-11-12 · Context-free Grammars Def.A context-free grammar is a collection of rules of the form: A → x1x2x3…xk where x1x2x3…xk are either terminal](https://reader030.fdocuments.us/reader030/viewer/2022040115/5e7e1f43f0bb6137f230865e/html5/thumbnails/50.jpg)
More CaBLAST Performance
Database of 36 yeast genomes.(Loh et al, 2012)