Basics of Information Theory

http://www.cs.cf.ac.uk/Dave/Multimedia/node237.html

Basics of Information Theory

According to Shannon, the entropy of an information source S is defined as:

where pi is the probability that symbol Si in S will occur.

indicates the amount of information contained in Si, i.e., the number of bits needed to code Si.

For example, in an image with uniform distribution of gray-level intensity, i.e. pi = 1/256, then the number of bits needed to code each gray level is 8 bits. The entropy of this image is 8.

Q: How about an image in which half of the pixels are white (I = 220) and half are black (I = 10)?

Pattern SubstitutionLossless Compression Algorithms (Pattern Substitution)

This is a simple form of statistical encoding.

Here we substitue a frequently repeating pattern(s) with a code. The code is shorter than than pattern giving us compression.

A simple Pattern Substitution scheme could employ predefined code (for example replace all occurrences of `The' with the code '&').

More typically tokens are assigned to according to frequency of occurrenc of patterns:

Count occurrence of tokens

Sort in Descending order

Assign some symbols to highest count tokens

A predefined symbol table may used ie assign code i to token i.

Huffman Coding

Huffman coding is based on the frequency of occurance of a data item (pixel in images). The principle is to use a lower number of bits to encode the data that occurs more frequently. Codes are stored in a Code Book which may be constructed for each image or a set of images. In all cases the code book plus encoded data must be transmitted to enable decoding.

The Huffman algorithm is now briefly summarised:

A bottom-up approach

1. Initialization: Put all nodes in an OPEN list, keep it sorted at all times (e.g., ABCDE).

2. Repeat until the OPEN list has only one node left:

(a) From OPEN pick two nodes having the lowest frequencies/probabilities, create a parent node of them.

(b) Assign the sum of the children's frequencies/probabilities to the parent node and insert it into OPEN.

(c) Assign code 0, 1 to the two branches of the tree, and delete the children from OPEN.

Adaptive Huffman Coding

The basic Huffman algorithm has been extended, for the following reasons:

(a) The previous algorithms require the statistical knowledge which is often not available (e.g., live audio, video).

(b) Even when it is available, it could be a heavy overhead especially when many tables had to be sent when a non-order0 model is used, i.e. taking into account the impact of the previous symbol to the probability of the current symbol (e.g., "qu" often come together, ...).

The solution is to use adaptive algorithms. As an example, the Adaptive Huffman Coding is examined below. The idea is however applicable to other adaptive compression algorithms.

ENCODER DECODER

------- -------

Initialize_model(); Initialize_model();

while ((c = getc (input)) != eof) while ((c = decode (input)) != eof)

{ {

encode (c, output); putc (c, output);

update_model (c); update_model (c);

} }

}

The key is to have both encoder and decoder to use exactly the same initialization and update_model routines.

Arithmetic Coding

Huffman coding and the like use an integer number (k) of bits for each symbol, hence k is never less than 1. Sometimes, e.g., when sending a 1-bit image, compression becomes impossible.

Idea: Suppose alphabet was

X, Y

and

prob(X) = 2/3

prob(Y) = 1/3

If we are only concerned with encoding length 2 messages, then we can map all possible messages to intervals in the range [0..1]:

To encode message, just send enough bits of a binary fraction that uniquely specifies the interval.

Similarly, we can map all possible length 3 messages to intervals in the range [0..1]:

Q: How to encode X Y X X Y X ?

Q: What about an alphabet with 26 symbols, or 256 symbols, ...?

In general, number of bits is determined by the size of the interval.

Examples:

first interval is 8/27, needs 2 bits -> 2/3 bit per symbol (X)

last interval is 1/27, need 5 bits

In general, need bits to represent interval of size p. Approaches optimal encoding as message length got to infinity.

Problem: how to determine probabilities?

Simple idea is to use adaptive model: Start with guess of symbol frequencies. Update frequency with each new symbol.

Another idea is to take account of intersymbol probabilities, e.g., Prediction by Partial Matching.

Implementation Notes: Can be CPU and memory intensive; patented.

update_model does two things: (a) increment the count, (b) update the Huffman tree (Fig7.2).

During the updates, the Huffman tree will be maintained its sibling property, i.e. the nodes (internal and leaf) are arranged in order of increasing weights (see figure).

When swapping is necessary, the farthest node with weight W is swapped with the node whose weight has just been increased to W+1. Note: If the node with weight W has a subtree beneath it, then the subtree will go with it.

The Huffman tree could look very different after node swapping (Fig7.2), e.g., in the third tree, node A is again swapped and becomes the #5 node. It is now encoded using only 2 bits.

Note: Code for a particular symbol changes during the adaptive coding processLempel-Ziv-Welch (LZW) Algorithm

The LZW algorithm is a very common compression technique.

Suppose we want to encode the Oxford Concise English dictionary which contains about 159,000 entries. Why not just transmit each word as an 18 bit number?

Problems:

Too many bits,

everyone needs a dictionary,

only works for English text.

Solution: Find a way to build the dictionary adaptively.

Original methods due to Ziv and Lempel in 1977 and 1978. Terry Welch improved the scheme in 1984 (called LZW compression).

It is used in UNIX compress -- 1D token stream (similar to below)

It used in GIF comprerssion -- 2D window tokens (treat image as with Huffman Coding Above).

Reference: Terry A. Welch, "A Technique for High Performance Data Compression", IEEE Computer, Vol. 17, No. 6, 1984, pp. 8-19.

The LZW Compression Algorithm can summarised as follows:

w = NIL;

while ( read a character k )

{

if wk exists in the dictionary

w = wk;

else

add wk to the dictionary;

output the code for w;

w = k;

}

Original LZW used dictionary with 4K entries, first 256 (0-255) are ASCII codes.

Example:

Input string is "^WED^WE^WEE^WEB^WET".

w k output index symbol

-----------------------------------------

NIL ^

^ W ^ 256 ^W

W E W 257 WE

E D E 258 ED

D ^ D 259 D^

^ W

^W E 256 260 ^WE

E ^ E 261 E^

^ W

^W E

^WE E 260 262 ^WEE

E ^

E^ W 261 263 E^W

W E

WE B 257 264 WEB

B ^ B 265 B^

^ W

^W E

^WE T 260 266 ^WET

T EOF T

A 19-symbol input has been reduced to 7-symbol plus 5-code output. Each code/symbol will need more than 8 bits, say 9 bits.

Usually, compression doesn't start until a large number of bytes (e.g., > 100) are read in.

The LZW Decompression Algorithm is as follows:

read a character k;

output k;

w = k;

while ( read a character k )

/* k could be a character or a code. */

{

entry = dictionary entry for k;

output entry;

add w + entry[0] to dictionary;

w = entry;

}

Example (continued):

Input string is "^WEDEBT".

w k output index symbol

-------------------------------------------

^ ^

^ W W 256 ^W

W E E 257 WE

E D D 258 ED

D ^W 259 D^

E E 260 ^WE

E ^WE 261 E^

E^ 262 ^WEE

WE 263 E^W

B B 264 WEB

B ^WE 265 B^

T T 266 ^WET

Problem: What if we run out of dictionary space?

Solution 1: Keep track of unused entries and use LRU

Solution 2: Monitor compression performance and flush dictionary when performance is poor.

Implementation Note: LZW can be made really fast; it grabs a fixed number of bits from input stream, so bit parsing is very easy. Table lookup is automatic.

The Discrete Cosine Transform (DCT)

The discrete cosine transform (DCT) helps separate the image into parts (or spectral sub-bands) of differing importance (with respect to the image's visual quality). The DCT is similar to the discrete Fourier transform: it transforms a signal or image from the spatial domain to the frequency domain (Fig7.8).

DCT Encoding

The general equation for a 1D (N data items) DCT is defined by the following equation:

and the corresponding inverse 1D DCT transform is simple F-1(u), i.e.:

where

The general equation for a 2D (N by M image) DCT is defined by the following equation:

and the corresponding inverse 2D DCT transform is simple F-1(u,v), i.e.:

where

The basic operation of the DCT is as follows:

The input image is N by M;

f(i,j) is the intensity of the pixel in row i and column j;

F(u,v) is the DCT coefficient in row k1 and column k2 of the DCT matrix.

For most images, much of the signal energy lies at low frequencies; these appear in the upper left corner of the DCT.

Compression is achieved since the lower right values represent higher frequencies, and are often small - small enough to be neglected with little visible distortion.

The DCT input is an 8 by 8 array of integers. This array contains each pixel's gray scale level;

8 bit pixels have levels from 0 to 255.

Therefore an 8 point DCT would be:

where

Question: What is F[0,0]?

answer: They define DC and AC components.

The output array of DCT coefficients contains integers; these can range from -1024 to 1023.

It is computationally easier to implement and more efficient to regard the DCT as a set of basis functions which given a known input array size (8 x 8) can be precomputed and stored. This involves simply computing values for a convolution mask (8 x8 window) that get applied (summ values x pixelthe window overlap with image apply window accros all rows/columns of image). The values as simply calculated from the DCT formula. The 64 (8 x 8) DCT basis functions are illustrated in Fig7.9.

DCT basis functions

Why DCT not FFT?

DCT is similar to the Fast Fourier Transform (FFT), but can approximate lines well with fewer coefficients (Fig7.10)

DCT/FFT Comparison

Computing the 2D DCT

Factoring reduces problem to a series of 1D DCTs (Fig7.11):

apply 1D DCT (Vertically) to Columns

apply 1D DCT (Horizontally) to resultant Vertical DCT above.

or alternatively Horizontal to Vertical.

The equations are given by:

Most software implementations use fixed point arithmetic. Some fast implementations approximate coefficients so all multiplies are shifts and adds.

World record is 11 multiplies and 29 adds. (C. Loeffler, A. Ligtenberg and G. Moschytz, "Practical Fast 1-D DCT Algorithms with 11 Multiplications", Proc. Int'l. Conf. on Acoustics, Speech, and Signal Processing 1989 (ICASSP `89), pp. 988-991)

Differential Encoding

Simple example of transform coding mentioned earlier and instance of this approach.

Here:

The difference between the actual value of a sample and a prediction of that values is encoded.

Also known as predictive encoding.

Example of technique include: differential pulse code modulation, delta modulation and adaptive pulse code modulation -- differ in prediction part.

Suitable where successive signal samples do not differ much, but are not zero. E.g. Video -- difference between frames, some audio signals.

Differential pulse code modulation (DPCM) simple prediction:

fpredict(ti) = factual(ti-1)

i.e. a simple Markov model where current value is the predict next value.

So we simply need to encode:

If successive sample are close to each other we only need to encode first sample with a large number of bits:

Actual Data: 9 10 7 6

Predicted Data: 0 9 10 7

: +9, +1, -3, -1.

Delta modulation is a special case of DPCM: Same predictor function, coding error is a single bit or digit that indicates the current sample should be increased or decreased by a step.

Not Suitable for rapidly changing signals.

Adaptive pulse code modulation -- Fuller Markov model: data is extracted from a function of a series of previous values: E.g. Average of last n samples. Characteristics of sample better preserved.

Vector Quantisation

The basic outline of this approach is:

Data stream divided into (1D or 2D square) blocks -- vectors

A table or code book is used to find a pattern for each block.

Code book can be dynamically constructed or predefined.

Each pattern for block encoded as a look value in table

Compression achieved as data is effectively subsampled and coded at this level.

Video Compression

We have studied the theory of encoding now let us see how this is applied in practice.

We need to compress video (and audio) in practice since:

1.

Uncompressed video (and audio) data are huge. In HDTV, the bit rate easily exceeds 1 Gbps. -- big problems for storage and network communications. For example:

One of the formats defined for HDTV broadcasting within the United States is 1920 pixels horizontally by 1080 lines vertically, at 30 frames per second. If these numbers are all multiplied together, along with 8 bits for each of the three primary colors, the total data rate required would be approximately 1.5 Gb/sec. Because of the 6 MHz. channel bandwidth allocated, each channel will only support a data rate of 19.2 Mb/sec, which is further reduced to 18 Mb/sec by the fact that the channel must also support audio, transport, and ancillary data information. As can be seen, this restriction in data rate means that the original signal must be compressed by a figure of approximately 83:1. This number seems all the more impressive when it is realized that the intent is to deliver very high quality video to the end user, with as few visible artifacts as possible.

2.

Lossy methods have to employed since the compression ratio of lossless methods (e.g., Huffman, Arithmetic, LZW) is not high enough for image and video compression, especially when distribution of pixel values is relatively flat.

The following compression types are commonly used in Video compression:

Spatial Redundancy Removal - Intraframe coding (JPEG)

Spatial and Temporal Redundancy Removal - Intraframe and Interframe coding (H.261, MPEG)

These are discussed in the following sections.

Intra Frame Coding

The term intra frame coding refers to the fact that the various lossless and lossy compression techniques are performed relative to information that is contained only within the current frame, and not relative to any other frame in the video sequence. In other words, no temporal processing is performed outside of the current picture or frame. This mode will be described first because it is simpler, and because non-intra coding techniques are extensions to these basics. Figure 1 shows a block diagram of a basic video encoder for intra frames only. It turns out that this block diagram is very similar to that of a JPEG still image video encoder, with only slight implementation detail differences.

The potential ramifications of this similarity will be discussed later. The basic processing blocks shown are the video filter, discrete cosine transform, DCT coefficient quantizer, and run-length amplitude/variable length coder. These blocks are described individually in the sections below or have already been described in JPEG Compression.

This is a basic Intra Frame Coding Scheme is as follows:

Macroblocks are 16x16 pixel areas on Y plane of original image.

A macroblock usually consists of 4 Y blocks, 1 Cr block, and 1 Cb block.

In the example HDTV data rate calculation shown previously, the pixels were represented as 8-bit values for each of the primary colors red, green, and blue. It turns out that while this may be good for high performance computer generated graphics, it is wasteful in most video compression applications. Research into the Human Visual System (HVS) has shown that the eye is most sensitive to changes in luminance, and less sensitive to variations in chrominance. Since absolute compression is the name of the game, it makes sense that MPEG should operate on a color space that can effectively take advantage of the eyes different sensitivity to luminance and chrominance information. As such, H/261 (and MPEG) uses the YCbCr color space to represent the data values instead of RGB, where Y is the luminance signal, Cb is the blue color difference signal, and Cr is the red color difference signal.

A macroblock can be represented in several different manners when referring to the YCbCr color space. Figure7.13 below shows 3 formats known as 4:4:4, 4:2:2, and 4:2:0 video. 4:4:4 is full bandwidth YCbCr video, and each macroblock consists of 4 Y blocks, 4 Cb blocks, and 4 Cr blocks. Being full bandwidth, this format contains as much information as the data would if it were in the RGB color space. 4:2:2 contains half as much chrominance information as 4:4:4, and 4:2:0 contains one quarter of the chrominance information. Although MPEG-2 has provisions to handle the higher chrominance formats for professional applications, most consumer level products will use the normal 4:2:0 mode.

Macroblock Video Formats

Because of the efficient manner of luminance and chrominance representation, the 4:2:0 representation allows an immediate data reduction from 12 blocks/macroblock to 6 blocks/macroblock, or 2:1 compared to full bandwidth representations such as 4:4:4 or RGB. To generate this format without generating color aliases or artifacts requires that the chrominance signals be filtered.

The Macroblock is coded as follows:

Many macroblocks will be exact matches (or close enough). So send address of each block in image -> Addr

Sometimes no good match can be found, so send INTRA block -> Type

Will want to vary the quantization to fine tune compression, so send quantization value -> Quant

Motion vector -> vector

Some blocks in macroblock will match well, others match poorly. So send bitmask indicating which blocks are present (Coded Block Pattern, or CBP).

Send the blocks (4 Y, 1 Cr, 1 Cb) as in JPEG.

Quantization is by constant value for all DCT coefficients (i.e., no quantization table as in JPEG).

Inter-frame (P-frame) Coding

The previously discussed intra frame coding techniques were limited to processing the video signal on a spatial basis, relative only to information within the current video frame. Considerably more compression efficiency can be obtained however, if the inherent temporal, or time-based redundancies, are exploited as well. Anyone who has ever taken a reel of the old-style super-8 movie film and held it up to a light can certainly remember seeing that most consecutive frames within a sequence are very similar to the frames both before and after the frame of interest. Temporal processing to exploit this redundancy uses a technique known as block-based motion compensated prediction, using motion estimation. A block diagram of the basic encoder with extensions for non-intra frame coding techniques is given in Figure7.14. Of course, this encoder can also support intra frame coding as a subset.

P-Frame Coding

Starting with an intra, or I frame, the encoder can forward predict a future frame. This is commonly referred to as a P frame, and it may also be predicted from other P frames, although only in a forward time manner. As an example, consider a group of pictures that lasts for 6 frames. In this case, the frame ordering is given as I,P,P,P,P,P,I,P,P,P,P,

Each P frame in this sequence is predicted from the frame immediately preceding it, whether it is an I frame or a P frame. As a reminder, I frames are coded spatially with no reference to any other frame in the sequence.

P-coding can be summarised as follows:

An Coding Example (P-frame)

Previous image is called reference image.

Image to code is called target image.

Actually, the difference is encoded.

Subtle points:

1.

Need to use decoded image as reference image, not original. Why?

2.

We're using "Mean Absolute Difference" (MAD) to decide best block. Can also use "Mean Squared Error" (MSE) = sum(E*E)

MPEG Video

MPEG compression is essentially a attempts to over come some shortcomings of H.261 and JPEG:

Recall H.261 dependencies:

The Problem here is that many macroblocks need information is not in the reference frame.

For example:

The MPEG solution is to add a third frame type which is a bidirectional frame, or B-frame

B-frames search for macroblock in past and future frames.

Typical pattern is IBBPBBPBB IBBPBBPBB IBBPBBPBB

Actual pattern is up to encoder, and need not be regular.

MPEG Video Layers

MPEG video is broken up into a hierarchy of layers to help with error handling, random search and editing, and synchronization, for example with an audio bitstream. From the top level, the first layer is known as the video sequence layer, and is any self-contained bitstream, for example a coded movie or advertisement. The second layer down is the group of pictures, which is composed of 1 or more groups of intra (I) frames and/or non-intra (P and/or B) pictures that will be defined later. Of course the third layer down is the picture layer itself, and the next layer beneath it is called the slice layer. Each slice is a contiguous sequence of raster ordered macroblocks, most often on a row basis in typical video applications, but not limited to this by the specification. Each slice consists of macroblocks, which are 16x16 arrays of luminance pixels, or picture data elements, with 2 8x8 arrays of associated chrominance pixels. The macroblocks can be further divided into distinct 8x8 blocks, for further processing such as transform coding. Each of these layers has its own unique 32 bit start code defined in the syntax to consist of 23 zero bits followed by a one, then followed by 8 bits for the actual start code. These start codes may have as many zero bits as desired preceding them.

The MPEG Video Bitstream

The MPEG Video Bitstream is summarised as follows:

Public domain tool mpeg_stat and mpeg_bits will analyze a bitstream.

Sequence Information

1.

Video Params include width, height, aspect ratio of pixels, picture rate.

2.

Bitstream Params are bit rate, buffer size, and constrained parameters flag (means bitstream can be decoded by most hardware)

3.

Two types of QTs: one for intra-coded blocks (I-frames) and one for inter-coded blocks (P-frames).

Group of Pictures (GOP) information

1.

Time code: bit field with SMPTE time code (hours, minutes, seconds, frame).

2.

GOP Params are bits describing structure of GOP. Is GOP closed? Does it have a dangling pointer broken?

Picture Information

1.

Type: I, P, or B-frame?

2.

Buffer Params indicate how full decoder's buffer should be before starting decode.

3.

Encode Params indicate whether half pixel motion vectors are used.

Slice information

1.

Vert Pos: what line does this slice start on?

2.

QScale: How is the quantization table scaled in this slice?

Macroblock information

1.

Addr Incr: number of MBs to skip.

2.

Type: Does this MB use a motion vector? What type?

3.

QScale: How is the quantization table scaled in this MB?

4.

Coded Block Pattern (CBP): bitmap indicating which blocks are coded.

Simple Audio Compression Methods

Traditional lossless compression methods (Huffman, LZW, etc.) usually don't work well on audio compression (the same reason as in image compression).

The following are some of the Lossy methods applied to audio compression:

Silence Compression - detect the "silence", similar to run-length coding

Adaptive Differential Pulse Code Modulation (ADPCM)

e.g., in CCITT G.721 - 16 or 32 Kbits/sec.

(a) encodes the difference between two consecutive signals,

(b) adapts at quantization so fewer bits are used when the value is smaller.

It is necessary to predict where the waveform is headed -> difficult

Apple has proprietary scheme called ACE/MACE. Lossy scheme that tries to predict where wave will go in next sample. About 2:1 compression.

Linear Predictive Coding (LPC) fits signal to speech model and then transmits parameters of model. Sounds like a computer talking, 2.4 kbits/sec.

Code Excited Linear Predictor (CELP) does LPC, but also transmits error term - audio conferencing quality at 4.8 kbits/sec

Basics of Information Theory

Documents

Transcript of Basics of Information Theory