Lecture 4 Run Length Encoding
-
Upload
nidhal-el-abbadi -
Category
Science
-
view
657 -
download
1
Transcript of Lecture 4 Run Length Encoding
November 1, 2015 1 [email protected]
November 1, 2015 2 [email protected]
Run-Length Encoding
RLE Text Compression
RLE Image Compression
Golomb Coding
Example 1
Decoding
Example 2
Contents
November 1, 2015 3
Data often contains sequences of identical bytes. By replacing these
repeated byte sequences with the number of occurrences, a substantial
reduction of data can be achieved. This is known as run-length coding.
Run-Length Encoding
Example: the character C occurs eight times in a row and is compressed
to the three characters C!8:
Uncompressed data: ABCCCCCCCCDEFGGG
Run-length coded: ABC!8DEFGGG
The technique can be described precisely as follows: if a byte occurs at
least four times in a row, then the number of occurrences is counted.
The compressed data contain the byte, followed by the M-byte and the
number of occurrences of the byte.
November 1, 2015 4 [email protected]
To get an idea of the compression ratios produced by RLE, we assume a
string of N characters that needs to be compressed. We assume that the
string contains M repetitions of average length L each. Each of the M
repetitions is replaced by 3 characters (escape, count, and data), so the
size of the compressed string is N − M × L +M ×3 = N −M(L − 3) and the
compression factor is
Examples: N = 1000,M = 10, L = 4 yield
a compression factor of 1000 / [1000 − 10(4 − 3)] = 1.01.
A better result is obtained in the case N = 1000,M = 50, L = 10, where the
factor is 1000 / [1000 − 50(10 − 3)] = 1.538.
RLE Text Compression
November 1, 2015 [email protected] 5
Compressing an image using RLE is based on the observation that if we
select a pixel in the image at random, there is a good chance that its
neighbors will have the same color.
RLE Image Compression
The compressor therefore scans the bitmap row by row, looking for runs
of pixels of the same color.
The size of the compressed stream depends on the complexity of the
image. The more detail, the worse the compression.
November 1, 2015 [email protected] 6
the run length cannot be 0, it makes sense to write the [run length minus
one] on the output stream. Thus the pair (3, 87) denotes a run of four
pixels with intensity 87. This way, a run can be up to 256 pixels long.
In color images it is common to have each pixel stored as three bytes,
representing the intensities of the red, green, and blue components of the
pixel.
binary (black-and-white) images, usually consist of runs of 0’s or 1’s. As an
example, if a segment of a binary image is represented as
d =
000000001111111111100000000000000011100000000000001001111111111
and it can be compactly represented as c(d) = (9, 11, 15, 3, 13, 1, 2, 10) by
simply listing the lengths of alternate runs of 0’s and 1’s.
November 1, 2015 7 [email protected]
Run-length encoding is a simple approach to source coding when there
exists a long run of the same data, in a consecutive manner, in a data
set. As an example, the data
d = 5 5 5 5 5 5 5 19 19 19 19 19 19 19 19 19 19 19 19 0 0 0 0 0 0 0 0 8
23 23 23 23 23 23
contains long runs of 5‘s, 19’s, 0’s, 23’s, etc. Rather than coding each
sample in the run individually, the data can be represented compactly by
simply indicating the value of the sample and the length of its run when
it appears. In this manner, the data d can be run-length encoded as
(5 7) (19 12) (0 8) (8 1) (23 6).
For ease of understanding, we have shown a pair in each parentheses.
Here the first value represents the pixel, while the second indicates the
length of its run.
November 1, 2015 [email protected] 8
In some cases, the appearance of runs of symbols may not be very
apparent. But the data can possibly be preprocessed in order to aid run-
length coding.
Consider the data d = 26 29 32 35 38 41 44 50 56 62 68 78 88 98 108 118
116 114 112 110 108 106 104 102 100 98 96. We can simply preprocess
this data, by taking the sample difference e( i ) = d ( i ) - d ( i - l), to produce
the processed data E= 26 3 3 3 3 3 3 6 6 6 6 10 10 10 10 10 -2 -2 -2 -2 -2 -
2 -2 -2 -2 -2 -2. This preprocessed data can now be easily run-length
encoded as (26 1) (3 6) (6 4) (10 5) (-2 11).
November 1, 2015 9 [email protected]
Golomb Coding
Golomb coding is a lossless data compression method invented by
Solomon W. Golomb.
Symbols following a geometric distribution will have a Golomb code as an
optimal prefix code, making Golomb coding highly suitable for situations in
which the occurrence of small values in the input stream is significantly
more likely than large values.
November 1, 2015 [email protected] 10
The Golomb-Rice codes belong to a family of codes designed to encode
integers with the assumption that the larger an integer, the lower its
probability of occurrence.
The simplest code for this situation is the unary code. The unary code for
a positive integer n is simply n 1’s followed by a 0.
the code for 4 is 11110, and the code for 7 is 11111110
The unary code is the same as the Huffman code for the semi-infinite
alphabet {1, 2, 3, . . .} with probability model
Because the Huffman code is optimal, the unary code is also optimal
for this probability model.
November 1, 2015 [email protected] 11
The Golomb code is actually a family of codes parameterized by an integer m > 0.
In the Golomb code with parameter m, we represent an integer n > 0 using two
numbers q and r ,where
q is the quotient.
r is the remainder when n is divided by m.
• Let k be the smallest positive integer such that 2k ≥ 2𝑚.
• Then the corresponding code dictionary contains exactly m words of
every word length ≥ 𝑘. • as well as 2k-1 – m words of length k-1.
November 1, 2015 [email protected] 12
Example 1
We will consider the cases m=14 and m=16.
In the case of m=14 we find k=5 and 2k-1 - m = 2
((2k≥ 2𝑚) … 25≥ 2*14 )
So that there are two codewares of length 4, followed by 14 codewares of
lengths 5,6,7 etc.
November 1, 2015 [email protected] 13
In general:
q is the quotient and r is the reminder when n is divided by m.
The quotient can take on values 0,1, 2, … and is represented by the
unary code of q.
The reminder r can take on the values 0, 1, 2, … m-1.
If m is a power of two, we use the log2 m-bit binary representation of r.
If m is not a power of two, we could still use log2 m bits.
We can reduce the number of bits required if we use:
the log2 m bit binary representation of r for the first 2 𝑙𝑜𝑔2 𝑚 − 𝑚
values.
and the log2 m bit binary representation of r + 2 𝑙𝑜𝑔2 𝑚 − 𝑚 for the
rest of the values.
• x is integer part of x.
November 1, 2015 [email protected] 14
Log2 14 = 3.807 m = 14
log2 14 = 4 bits
log2 14 = 3
log2 m bits binary representation of r for the first 2 𝑙𝑜𝑔2 𝑚 − 𝑚 values.
log2 14 = 3 bits
2 𝑙𝑜𝑔2 𝑚 −𝑚 = 24 – 14 = 16 – 14 = 2
log2 m bit binary representation of r + 2 𝑙𝑜𝑔2 𝑚 − 𝑚 for the rest of the
values.
log2 14 = 4 bits
r + 2 𝑙𝑜𝑔2 𝑚 − 𝑚 = r + 24 – 14 = r + 16 – 14 = r + 2
November 1, 2015 [email protected] 15
since m=16 is a power of 2, the corresponding dictionary contains
exactly 16 words of every word length starting with length 5.
If m is a power of two, we use the log2 m-bit binary representation of r.
m=16
Log2 16 = 4 bits binary representation of r.
2k≥ 2𝑚) … 25≥ 2*16
November 1, 2015 [email protected] 16
November 1, 2015 [email protected] 17
Decoding
The dictionaries in the previous table exhibit striking patterns which
suggest a rather simple decoding procedure might be employed.
For the case m=16, when m is power of 2, the following rule for decoding
is adequate.
Start at the beginning (left end) of the word, and counts the number of
1’s preceding the first 0. lets this number be A ≥ 0.
The word consist of A+4 bits by regarded as the ordinary binary
representation of the integer R, 0 ≤ R ≤ 15.
Then the correct decoding of the word is 16 A + R.
If m is a power of two, we use the log2 m-bit binary representation of r.
November 1, 2015 [email protected] 18
For the rest of values, let the last 4 bits be the binary representation of the
integer R.
Then the correct decoding of the codeword is 14 A + R - 2.
we regard the codeword as consisting of the total of A + 4 bits.
Letting the 4 bits be the binary representation of the first integer R’ , the
correct decoding in this case is 14 A + R’ (A equal zero).
If m is not a power of two, we could still use log2 m bits.
We can reduce the number of bits required if we use:
the log2 m bit binary representation of r for the first 2 𝑙𝑜𝑔2 𝑚 − 𝑚
values.
and the log2 m bit binary representation of r + 2 𝑙𝑜𝑔2 𝑚 − 𝑚 for the
rest of the values.
The case m=14, is only slightly more complicated.
November 1, 2015 [email protected] 19
Example 2
Lets design a Golomb code for m=5
Log2 5 = 2.321 m = 5
log2 5 = 3 bits
log2 5 = 2 bits
log2 m bits binary representation of r for the first 2 𝑙𝑜𝑔2 𝑚 − 𝑚 values.
log2 5 = 2 bits
2 𝑙𝑜𝑔2 𝑚 −𝑚 = 23 – 5 = 8 – 5 = 3 values of r (that is, r = 0, 1, 2) will be
represented by the 2-bit binary representation of r.
log2 m bit binary representation of r + 2 𝑙𝑜𝑔2 𝑚 − 𝑚 for the rest of the
values that is, r = 3, 4).
log2 5 = 3 bits
r + 2 𝑙𝑜𝑔2 𝑚 − 𝑚 = r + 23 – 5 = r + 8 – 5 = r + 3
November 1, 2015 [email protected] 20
November 1, 2015 21 [email protected]