VIDEO AND AUDIO COMPRESSION THE MPEGs STANDARDSchens/courses/cis6931/2001/Fernadez.pdf- The...

VIDEO AND AUDIO COMPRESSION

THE MPEGs STANDARDS

T. Alejandro Mendoza

Lesley Jacques Rigoberto Fernandez

Florida International University School of Computer Science

CIS 6931 Advanced Topics of Information Processing Dr. Shu-Ching Chen

Spring, 2001

2

Table of Contents

The MPEG-1 Standard 4

Introduction 4

MPEG Video Compression Techniques 5

Stream Structure 5

Video Stream Data Hierarchy 5

Group of Pictures (GOP) 5

Picture 6

Macroblock 6

Block 6

Audio Stream Data Hierarchy 6

Intra-Picture Coding 7

Picture Types 7

Intra Pictures 7

Predicted Pictures 7

Bidirectional Pictures 7

Video Stream Composition 8

Motion Compensation 8

Intra-picture (Transform) Coding 9

Synchronization 10

System Clock Reference 10

MPEG-1 Audio Standard 10

Analog-to-Digital 11

Audio Encoding 11

MPEG-1 Audio Encoding Diagram 12

Operating Mode 13

References 14


Introduction 15

MPEG Standards 15

The following MPEG standards exist: 16

3

Overview of MPEG-2 16

MPEG-2 Video Compression 17

Video Stream Data Hierarchy 17

Picture Types 19

Profile and Levels 21

Details of profiles 21

MPEG-2 Video profiles 22

Details of levels 23

Interlaced Video and Picture Structures 23

MPEG-2 standard features 24

MPEG-2 Advance Audio Coding 25

MPEG-2 audio standardize in comparison to MPEG-1 25

Differences between MPEG-2 AAC and its predecessor ISO/MPEG Audio Layer-3 26

References 27


Brief Overview of MPEG Standards Progression and Naming. 28

Overview of concept and need for MPEG-4 29

Features of MPEG-4. 30

Object Based System 30

Transportation /Communications 32

Security Features 33

Low Bit-rate Applications 33

Image Mapping 35

Structured Audio and Sound Capability 36

Summary 37

Reference: 38

4

The MPEG-1 Standard

Introduction

The major goal of video compression is to represent a video source with as few bits as possible

while preserving the level of quality required for the given video application. The bit-rate

reduction can only be possible by removing redundant information from the video during the

coding process and reinserting it during the decoding process. In video signals, there is a large

amount of redundancy between frames that can be classified as statistical and psycho-visual

redundancy. Since any given pixel is likely to be similar to those pixels surrounding it, and any

given frame is likely to be similar to those frames surrounding it. The statistical redundancy

results from the fact that pixel values are correlated with their neighbors in spatial and temporal

directions. The psycho-visual redundancy is a consequence of the human visual system (HVS)

sensitivity. The human vision has a limited response to fine spatial or temporal detail. Bit-rate

reduction is possible by allowing distortions that should not be visible to human eyes. The MPEG

standard is a generic standard independent of a particular application and of the delivery media,

which is optimized for quality and size of compress video. MPEG does not specify any particular

techniques; it only defines the format for encoding video information in B and P frames and the

algorithm for reconstructing the pixel during decompression. The compressed stream should look

as good as possible when decompressed; the optimal goal is to be indistinguishable from the raw

video. On the other hand, the encoded stream should be as small as possible. The size of the

encoded stream is important if the multimedia application must fit within a certain fixed space, as

an example the limited space provided by a typical CD-ROM. The MPEG standard was

developed in response to the growing need for a common format for representing compressed

video on various digital storage media such as CDs, DATs, hard disks, and optical drives.

Applications using compressed video on digital storage media need to be able to perform a

number of operations in addition to normal forward playback of the video sequence.

5

MPEG Video Compression Techniques

MPEG-1 uses several algorithms and techniques to accomplish compression.

Subsampling of the chrominance information to match the sensitivity of the human visual system

(HVS), quantisation, motion compensation to exploit temporal redundancy, discrete cosine

transform (DCT) to exploit spatial redundancy, variable length coding (VLC), and picture

interpolation. The descriptions of these techniques are as follows:

Stream Structure

In its most general form, and MPEG system stream is made up of two layers:

- The system layer contains timing and other information needed to demultiplex the audio

and video during playback

- The compression layer which includes the audio and video streams

The decoding process involves extracting the timing information from the MPEG system stream

and sending it to the other system components. The system decoder also demultiplexes the

video and audio streams from the system stream then sends each to the appropriate decoder.

The video decoder decompresses the video stream. The audio decoder decompresses the audio

stream.

Video Stream Data Hierarchy

The MPEG standard defines a hierarchy of data structures in the video. Video sequence begins

with a sequence header (may contain additional sequence headers), includes one or more

groups of pictures, and ends with an end-of-sequence code.

Group of Pictures (GOP)

A header and a series of one or more pictures intended to allow random access into the

sequence.

6

Picture

A picture is the primary coding unit of a video sequence. A picture consists of three rectangular

matrices representing luminance/brightness (Y) and two chrominance/color (Cb and Cr) values.

The Y matrix has an even number of rows and columns. The Cb and Cr matrices are one-half

the size of the Y matrix in each direction, horizontal and vertical. For every four-luminance value,

there are two associated chrominance values: one Cb value and one Cr value.

Slice

A slice is one or more contiguous macroblocks. The order of the macroblock within a slice is from

left-to-right and from top-to-bottom. Slices are important in the handling of errors. If the bit-stream

contains an error, the decoder can skip to the start of the next slice. Having more slices in the bit-

stream allows better error concealment, but uses bits that could otherwise be used to improve

picture quality.

Macroblock

A 16-pixel by 16-line section of luminance components and the corresponding 8-pixel by 8-line

section of the two-chrominance components. A macroblock contains four Y blocks, one Cb block

and one Cr block.

Block

A block is an 8-pixel by 8-line set of values of a luminance or a chrominance component. Note

that a luminance block corresponds to one fourth as large a portion of the displayed image as

does a chrominance block.

Audio Stream Data Hierarchy

The MPEG standard defines a hierarchy of data structures that accept, decode and produce

digital audio output. The MPEG audio stream, like the MPEG video stream, consists of a series

of packets. Each audio packet contains and audio packet header and one or more audio frames.

In the audio stream structure each packet header contains the following information:

- Packet start code: Identifies the packet as being an audio packet

- Packet length: Indicates the number of bytes in the audio packet

An audio frame contains the following information:

7

- Audio frame header: Contains synchronization, ID, bit rate, and sampling frequency

information

- Error-checking code: Contains error-checking information

- Audio data: Contains information use to reconstruct the sampled audio data

- Ancillary data: Contains user-defined data

Intra-Picture Coding

Much of the information in a picture within a video sequence is similar to information in a previous

or subsequent picture. The MPEG standard takes advantage of this temporal redundancy by

representing some pictures in terms of their differences from other reference pictures, or what is

know as inter-picture coding.

Picture Types

The MPEG standard specifically defines three types of picture: intra (I), predict (P), and bi-

directional (B).

Intra Pictures

Intra pictures, or I-pictures, are coded using only information present in the picture itself. I-picture

provides potential random access points into the compressed video data. I-pictures use only

transform coding and provide moderate compression. I-pictures typically use about two bits per

coded pixel.

Predicted Pictures

Predicted pictures, or P-pictures, are coded with respect to the nearest previous I-picture or P-

picture. This technique is called forward prediction. Like I-picture, P-pictures serve as a

prediction reference for B-pictures and future P-pictures. However, P-pictures use motion

compensation to provide more compression that is possible with I-pictures. Unlike I-pictures, P-

pictures can propagate coding errors because P-pictures are predicted from previous reference I-

pictures or P-pictures.

Bidirectional Pictures

Bidirectional pictures, or B-pictures, are pictures that use both a past and future picture as a

reference. B-pictures provide the most compression and do not propagate errors because they

8

are never used as a reference. Bidirectional prediction also decreases the effect of noise by

averaging two pictures.

Video Stream Composition

The MPEG algorithm allows the encoder to choose the frequency and location of I-pictures. This

choice is based on the application’s need for random accessibility and the location of scene cuts

in the video sequence. In applications where random access is important, I-pictures are typically

used two times a second. The encoder also chooses the number of B-pictures between any pair

of reference I-pictures or P-pictures. This choice is based on factors such as the amount of

memory in the encoder and the characteristics of the material being coded. For example, a large

class of scenes has two bidirectional pictures separating successive reference pictures. The

MPEG encoder reorders pictures in the video stream to present the pictures in the most efficient

sequence. In particular, the reference pictures needed to reconstruct B-pictures are sent before

the associated B-pictures.

Motion Compensation

Motion compensation is a technique for enhancing the compression of P-pictures and B-pictures

by eliminating temporal redundancy. Motion compensation typically improves compression by

about a factor of three compared to intra-picture coding. Motion compensation algorithm works at

the macroblock level.

When a macroblock is compressed by motion compensation, the compressed file contains this

information:

- The spatial vector between the reference macroblock(s) and the macroblock being coded

(motion vectors).

- The content difference between the reference macroblock(s) and the macroblock being

coded (error terms)

Not all information in a picture can be predicted from a previous picture. Consider a scene in

which a door opens: The visual details of the room behind the door cannot be predicted from a

9

previous frame in which the door was closed. When a case such as this arises, a macroblock in

a P-picture cannot be efficiently represented by motion compensation, it is coded in the same

way as a macroblock in an I-picture using transform coding techniques.

Four coding are therefore possible for each macroblock in a B-picture:

- Intra coding: No motion compensation

- Forward prediction: The previous reference picture is used as a reference

- Backward prediction: The next picture is used as a reference

- Bidirectional prediction: Two reference pictures are used, the previous reference picture

and the next reference picture

Backward prediction can be used to predict uncovered areas that do not appear in previous

pictures.

Intra-picture (Transform) Coding

The MPEG transform-coding algorithm includes these steps:

- Discrete cosine transform (DCT)

- Quantization

- Run-length encoding

Both image blocks and prediction-error blocks have high spatial redundancy. To reduce this

redundancy, the MPEG algorithm transforms 8 X 8 blocks of pixels or 8 X 8 blocks of error terms

from the spatial domain to the frequency domain with the Discrete Cosine Transform (DCT).

Next, the algorithm quantizes the frequency coefficients. Quantization is the process of

approximating each frequency coefficient as one of a limited number of allowed values. The

encoder chooses a quantization matrix that determines how each frequency coefficient in the 8 X

8 block is quantized. Human perception of quantization error is lower for high spatial frequencies,

so high frequencies quantized more roughly, with fewer allowed values, than low frequencies.

The combination of DCT and quantization results in many of the frequency coefficients being

zero, especially the coefficients for high spatial frequencies. To take maximum advantage of this,

the coefficients are organized in a zigzag order to produce long runs of zeroes. The coefficients

are then converted to a series of run-amplitude pairs, each pair indicating a number of zero

10

coefficients and the amplitude of a non-zero coefficient. These run-amplitude pairs are then

coded with a variable-length code, which uses shorter codes for commonly occurring pairs and

longer codes for less common pairs. Some blocks of pixels need to be coded more accurately

than others. For example, blocks with smooth intensity gradients need accurate coding to avoid

visible block boundaries. To deal with this inequality between blocks, the MPEG algorithm allows

the amount of quantization to be modified for each macroblock of pixels. This mechanism can

also be used to provide smooth adaptation to a particular bit rate.

Synchronization

The MPEG standard provides a timing mechanism that ensures synchronization of audio and

video. The standard includes two parameters: the system clock reference (SCR) and the

presentation timestamp (PTS). The MPEG specified “system clock” runs at 90 kHz. System

clock reference and presentation timestamp values are coded in MPEG bitstream using 33 bits,

which can represent any clock cycle in a 24-hour period.

System Clock Reference

A System Clock Reference (SCR) is a snapshot of the encoder system clock that is placed into

the system layer of the bitstream. During decoding, these values are used to update the system

clock counter. Presentation timestamp are samples of the encoder system clock that are

associated with video or audio presentation units. A presentation unit is a decoded video picture

or a decoded audio time sequence. The Presentation Time Sequence (PTS) represents the time

at which the video picture is to be displayed or the starting playback time for the audio time

sequence. The decoder either skips or repeats picture display to ensure that the Presentation

Time Sequence (PTS) is within one picture’s worth of 90kHz clock tics of the System Clock

Reference (SCR) when a picture is displayed. If the PTS is earlier, has a smaller value, than the

current SCR, the decoder discards the picture. If the PTS is later, has a larger value, than the

current SCR, the decoder repeats the display of the picture.

MPEG-1 Audio Standard

The basic task of the MPEG audio standard is to compress the digital audio data in a way that the

compressed file is as small as possible and the reconstructed (decoded) audio sounds exactly, or

11

as close as possible, to the original audio before compression. Other, requirement of the MPEG

audio standard include low complexity (to enable software decoders or inexpensive hardware

decoders with low power consumption) and flexibility for different application scenarios.

MPEG-1 audio standardizes three different coding schemes for digitized sound waves called

Layers I, II, and III. It does not standardize the encoder, but rather standardizes the type of

information that an encoder has to produce and write to and MPEG-1 conformant bit stream. As

well as the way in which the decoder has to parse, decompress, and re-synthesize this

information in order to regain the encoded sound. Leyer I is the simplest, it only uses the psycho-

acoustic model. Psycho-acoustic deals with the way the human brain perceives sound. Layer II

adds more advanced bit allocation techniques and greater accuracy. Layer III adds a hybrid

filterbank and non-uniform quantization plus advanced features like Huffman coding, 18 times

higher frequency resolution and bit reservoir technique. Layer I, II, and III gives increasing

quality/compression ratios with increasing complexity and demands on processing power. The

MPEG audio specs specify that a valid Layer III decoder should be able to decode any Layer I, II,

III MPEG Audio stream. A Layer II decoder shall be able to decode Layer I and Layer II streams.

Analog-to-Digital

Sound is store in a computer by converting analog waves to digital waves. Sound is pressure

differences in air, when pick up by a microphone and fed through an amplifier this becomes

voltage levels. The voltage is sample by the computer a number of times per second. For CD-

audio quality you need to sample 44,100 times per second and each sample has a resolution of

16 bits. In stereo this gives you 1.4 Mbit per second and that it is why compression is needed

due to the high data requirements. To compress audio MPEG tries to remove the irrelevant parts

of the signal and the redundant parts of the signal. Parts of the sound we do not hear can be

thrown away. To do this MPEG Audio uses psycho-acoustic principles.

Audio Encoding

The primary psycho-acoustic effect that MPEG-1 audio coder uses is called “auditory masking”,

where parts of a signal are not audible due to the function of the human auditory system. For

example, if there is a sound that consists mainly of one frequency, all other sounds that consists

12

of a closed by frequency but are much quieter will not be heard. The parts of the signal that are

masked are commonly call “irrelevant”, as opposed to parts of the signal that are removed by a

lossless coding operation, which are termed “redundant”. In order to remove this irrelevancy, the

encoder contains a psycho-acoustic model. This psycho-acoustic model analyzes the input

signals within consecutive time blocks and determines for each block the spectral components of

the input audio signal by applying a frequency transform. Then it models the masking properties

of the human auditory system, and estimates the just noticeable noise level for each frequency

band, sometimes called the threshold of masking. In parallel, the input signal is fed through a

time-to-frequency mapping, resulting in spectrum components for subsequent coding. In its

quantisation and coding stage, the encoder tries to allocate the available number of data bits in a

way that meets both the bit-rate and masking requirements taking into account the calculated

threshold of masking. The information on how the bits are distributed over the spectrum is

contained in the bit-stream as side information. The decoder is much less complex because it

does not require a psycho-acoustic model and bit allocation procedure. Its only task is to

reconstruct and audio signal from the encoded spectral components and associated side

information.

MPEG-1 Audio Encoding Diagram

The Audio encoding diagram consist of the following blocks:

- Frequency Transform:

13

A filter bank is used to decompose the input signal into sub-sample spectral components

(time/frequency domain). Together with the corresponding filter bank in the decoder it

forms and an analysis/synthesis system.

- Psycho-acoustic model:

Using either the time domain input signal and/or the output of the analysis filter bank, an

estimate of the actual (time and frequency dependent) masking threshold is computed

using rules known from psycho-acoustics.

- Quantization and coding:

The spectral components are quantized and coded with the aim of keeping the noise,

which is introduced by quantizing, below the masking threshold. Depending on the

algorithm, this step is done in very different ways, from simple block compounding to

analysis-by-synthesis systems using additional noiseless compression.

- Encoding of bitstream:

A bitstream formatter is used to assemble the bitstream, which typically consists of the

quantized and coded spectral coefficients and some side information such as bit

allocation information.

In order to be applicable to a number of very different application scenarios, MPEG defined a

data representation including a number of options

Operating Mode

MPEG-1 audio works for both mono and stereo signals. A technique called joint stereo coding

can be used to do more efficient combined coding of the left and right channels of a stereophonic

audio signal. Layer III allows both mid/side stereo coding and, for lower bit-rates, intensity stereo

coding. Intensity stereo coding allows for lower bit-rates but brings the danger of a changing the

sound image. The operating modes are:

- Single channel

- Dual channel (two independent channels, for example containing different language

versions of the audio)

- Stereo (no joint stereo coding) and Joint stereo

14

References

• D. Yen Pan, Digital Audio Compression, Digital Technical Journal, Vol. 5, No. 2, 1997

• L. Chiariglione, The development of an integrated audiovisual coding standard: MPEG,

Proceedings of the IEEE, Vol. 83, No. 2, February 1995.

• J. L. Mitchell, et al., MPEG Video: Compression Standard, Chapman and Hall, New York,

NY, 1997

• T. Sikora, "MPEG Digital Video--Coding Standards," IEEE Signal Processing Magazine,

Vol. 14, No. 5, September 1997, pp. 82--100.

• D. Gall. MPEG: A video compression standard for multimedia applications.

Communications of the ACM, 34(4): 46 - 58, April 1991.

• H.C. Liu and G.L. Zick. Scene decomposition of MPEG compressed video. In Proc. of the

SPIE, volume 2419, pages 26 - 37, 1995.

• D. Pan. A tutorial on MPEG audio compression. IEEE Multimedia Journal, Summer 1995.

• Peter Noll, "MPEG digital audio coding," IEEE Signal Processing Magazine, vol. 14, no.

5, pages 59 - 81, September 1997.

• K. Brandenburg and Marina Bosi. Overview of MPEG audio: Current and future

standards for low bit-rate audio coding. J. Audio Eng. Soc., January 1997

15

The MPEG-2 Standard

Introduction

Advances in network technology offer new opportunities for multimedia communications systems

that were not possible years ago, with video compression being the core technology. In fact,

compression is part of almost any video, audio storage or transmission process. It is known that

the size of an uncompressed video is very large. It has a higher data rate, which is too high for

user level applications, and is a problem for the CPU and communication. The ultimate goal of

video source coding is bit-rate reduction. The performance of video compression techniques

depends on the amount of redundancy contained in the image data. Compression consists of

removing redundancies and encoding the true information signal in a form appropriate to

application requirements. Dependent on the applications requirements could be implemented two

different methods, lossless and lossy coding of the video data. The aim of lossless coding is to

reduce image or video data for storage and transmission while retaining the quality of the original

images; the decoded image quality is required to be identical to the image quality prior to

encoding. In contrast the aim of lossy coding techniques is to meet a given a given target bit rate

for storage and transmission; it does not assure that the data received is exactly the same that

was sent. The efficient utilization of compression systems will be an essential element of

delivered quality.

MPEG Standards

MPEG (Moving Pictures Experts Group) is a group of people that meet under ISO (the

International Standards Organization) to generate standards for digital video (sequences of

images in time) and audio compression. In particular, they define a compressed bit stream, which

implicitly defines a decompressor. However, the compression methods are up to the individual

manufacturers, and that is where proprietary advantage is obtained within the scope of a publicly

available international standard.

16

The following MPEG standards exist:

MPEG-1, a standard for storage and retrieval of moving pictures and audio on storage media.

MPEG-2, a standard for digital television.

MPEG-4, a standard for multimedia applications.

MPEG-7, a content representation standard for information searches.

Overview of MPEG-2

MPEG-2 was finalized in 1994 and resulted in the International Standard ISO/IEC 13818-3, which

was published in 1995. MPEG-2 AAC was finalized in 1997 and published as the International

Standard ISO/IEC 13818-7. MPEG-2 has been very successful in defining a specification to serve

a range of applications, bit rates, qualities and services. The picture quality through an MPEG-2

codec depends on the complexity and predictability of the source pictures. Real-time coders and

decoders have demonstrated generally good quality standard-definition pictures at bit rates

around 6 Mbit/s. The MPEG-2 concept is similar to MPEG-1, but includes extensions to cover a

wider range of applications. The primary application targeted during the MPEG-2 definition

process was the all-digital transmission of broadcast TV quality video at coded bit rates between

4 and 9 Mbit/sec. However, the MPEG-2 syntax has been found to be efficient for other

applications such as those at higher bit rates and sample rates (e.g. HDTV). The most significant

enhancement over MPEG-1 is the addition of syntax for efficient coding of interlaced video (e.g.

16x8 block size motion compensation). Several other subtler are included which have a

noticeable improvement on coding efficiency, even for progressive video. Other key features of

MPEG-2 are the scalable extensions which permit the division of a continuous video signal into

two or more coded bit streams representing the video at different resolutions, picture quality, or

picture rates.

17

MPEG-2 Video Compression

The MPEG-2 video compression algorithm achieves very high rates of compression by exploiting

the redundancy in video information. MPEG-2 removes both the temporal redundancy and spatial

redundancy, which are present in motion video.

Temporal redundancy arises when successive frames of video display images of the same

scene. It is common for the content of the scene to remain fixed or to change only slightly

between successive frames.

Spatial redundancy occurs because pats of the picture (called pels) are often replicated (with

minor changes) within a single frame of video.

Some parts of a video frame may have low spatial redundancy (e.g. complex picture content),

while other parts may have low temporal redundancy (e.g. fast moving sequences). The

compressed video stream is therefore naturally of variable bit rate, where as transmission links

frequently require fixed transmission rates. The key to controlling the transmission rate is to order

the compressed data in a buffer in order of decreasing detail. Compression may be performed by

selectively discarding some of the information. A minimal impact on overall picture quality can be

achieved by throwing away the most detailed information, while preserving the less detailed

picture content. This will ensure the overall bit rate is limited while suffering minimal impairment of

picture quality.

Video Stream Data Hierarchy

18

This video bitstream figure shows some layers such as:

GOP (Group of pictures), Pictures, Slice, Macroblock and Block.

Video Sequence may contain additional sequence header, includes one or more groups of

pictures, and ends with an end-of-sequence code.

Group of Pictures: A header and a series of one of more pictures intended to allow random

access into the sequence. In an MPEG signal the GOP is a group of pictures or frames between

successive I-frames, the others being P and/or B-frames.

Picture: It is the primary coding unit of a video sequence. A picture consists of three rectangular

matrices representing luminance (Y) and two chrominance (Cb and Cr) values. The Y matrix has

an even number of rows and columns. The Cb and Cr matrices are one half the size of the Y

matrix in each direction (horizontal and vertical).

Slice: One or more contiguous macroblocks. The order of the macroblocks within a slice is from

left-to-right and top-to-bottom. Slices are important in the handling of errors. If the bitstream

contains an error, the decoder can skip to the start of the next slice. Having more slices in the

bitstream allows better error concealment, but uses bits that could otherwise be used to improve

picture quality.

Macroblock: A group of picture blocks, usually four (16 x 16 pixels overall), which are analyzed

during MPEG coding to give an estimate of the movement of particular elements of the picture

between frames. This generates the motion vectors, which are then used to place the

macroblocks in decoded pictures. Since each chrominance component has one-half the vertical

19

and horizontal resolution of the luminance component, a macroblock consists of four Y, one Cr,

and one Cb block.

Block: Rectangular areas of pictures, the smallest coding unit in the MPEG algorithm. It usually

consists of 8 x 8 pixels in size, which is individually subjected to DCT coding as part of a digital

picture compression process. It can be one of these three types: luminance (Y), red chrominance

(Cr), or blue chrominance (Cb).

Picture Types

The MPEG standard specifically defines three types of frames. The picture type defines which

prediction modes may be used to code each block. These three type of pictures are defined as

follows:

Intra Pictures (I-Frames)

Predicted Pictures (P-Frames)

Bidirectional Predicted Pictures (B- Frames)

Intra Pictures, or I-Frames are coded without reference to other pictures, using only information

present in the picture itself. Moderate compression is achieved by reducing spatial redundancy,

20

but not temporal redundancy. They can be used periodically to provide access points in the

bitstream where decoding can begin.

Predicted Pictures, or P-Frames can use the previous I-frame or P-frame for motion

compensation and may be used as a reference for further prediction. P-Frames contain only

predictive information (not a whole picture) generated by looking at the difference between the

present frame and the previous one. Each block in a P-frame can either be predicted or intra-

coded. By reducing spatial and temporal redundancy, P-frames offer increased compression

compared to I-frames. The P-Frame can be decompressed if the preceding I-Frame could be

localized.

Bidirectional Predicted Pictures, or B-Frames can use the previous and next I-frames or P-

frames, for motion-compensation, and offer the highest degree of compression. Each block in a

B-frame can be forward, backward or bi-directionally predicted or intra-coded. To enable

backward prediction from a future frame, the coder reorders the pictures from natural display

order to bitstream order so that the B-frame is transmitted after the previous and next frames it

references. This introduces a reordering delay dependent on the number of consecutive B-

frames; the computation time is the largest in these cases.

The different picture types typically occur in a repeating sequence, termed a Group of Pictures

(GOP). A typical GOP in display order is:

B1 B2 I3 B4 B5 P6 B7 B8 P9 B10 B11 P12

The corresponding bitstream order is:

I3 B1 B2 P6 B4 B5 P9 B7 B8 P12 B10 B11

A regular GOP structure can be described with two parameters: N, which is the number of

pictures in the GOP, and M, which is the spacing of P-frames. The GOP given here is described

as N=12 and M=3. MPEG-2 does not insist on a regular GOP structure. For example, a P-frame

following a shot-change may be badly predicted since the reference picture for prediction is

completely different from the picture being predicted. Thus, it may be beneficial to code it as an I-

picture instead.

21

For a given decoded picture quality, coding using each picture type produces a different number

of bits. In a typical example sequence, a coded I-frame could be three times larger than a coded

P-frame, which was itself 50% larger than a coded B-frame.

Profile and Levels

MPEG-2 video is an extension of MPEG-1 video. MPEG-1 was targeted at coding progressively

scanned video at bit rates up to about 1.5 Mbit/s. MPEG-2 provides extra algorithmic tools for

efficiently coding interlaced video and supports a wide range of bit rates. It also provides tools for

scalable coding where useful video can be reconstructed from pieces of the total bitstream. The

total bitstream may be structured in layers, starting with a base and adding refinement layers to

reduce quantization distortion or improve resolution.

A small number of subsets of the complete MPEG-2 tool kit have been defined, known as profiles

and levels. A profile is a subset of algorithmic tools and a level identifies a set of constraints on

parameter values (such as picture size or bit rate). The profiles and levels defined to date fit

together such that a higher profile or level is superset of a lower one. A decoder, which supports

a particular profile and level is only required to support the corresponding subset of algorithmic

tools and set of parameter constraints

Details of profiles

MPEG-2 builds on the powerful video compression capabilities of the MPEG-1 standard to offer a

wide range of coding tools. These have been grouped in profiles to offer different functionalities.

Only the combinations marked with an "X" are recognized by the standard.

22

MPEG-2 Video profiles

Simple Main SNR

scalable

Spatial

scalable High Multiview 4:2:2

High level X X

High-1440 level X X X

Main level X X X X X X

Low level X X

· Simple profile uses no B-frames, and hence no backward or interpolated prediction

consequently, no picture re-ordering is required, which makes this profile suitable for low-delay

applications such as video conferencing.

· Main profile adds support for B-pictures, which improves the picture quality for a given bit-rate

but increases delay. Currently, most MPEG-2 video decoder chip-sets support main profile.

· SNR profile adds support for enhancement layers of DCT coefficient refinement using Signal-to-

Noise Ratio (SNR) scalability.

· Spatial profile adds support for enhancement layers carrying the image at different resolutions

using the spatial scalability tool.

· High profile adds support for 4:2:2 sampled video.

. The Multiview Profile (MVP) is an additional profile developed. By using existing MPEG-2

Video coding tools it is possible to encode in an efficient way tow video sequences issued from

two cameras shooting the same scene with a small angle between them.

Since the final approval of MPEG-2 Video in November 1994, one additional profile has been

developed. This uses existing coding tools of MPEG-2 Video but is capable to deal with pictures

having a color resolution of 4:2:2 and a higher bitrate. Even though MPEG-2 Video was not

developed having in mind studio applications, a set of comparison tests carried out by MPEG

confirmed that MPEG-2 Video was at least good, and in many cases even better than standards

23

or specifications developed for high bitrate or studio applications. The 4:2:2 profile was approved

in January 1996 and is now an integral part of MPEG-2 Video.

Details of levels

MPEG-2 defines four levels of coding parameter constraints. This table shows the constraints on

picture size, frame rate, bit rate and buffer size for each of the defined levels. MPEG-2 levels:

Picture size, frame-rate and bit rate constraints.

Level Max. frame,

width, pixels

Max. frame,

height, lines

Max. frame,

rate, Hz

Max. bit rate,

Mbit/s

Buffer size,

bits

Low 352 288 30 4 475136

Main 720 576 30 15 1835008

High-1440 1440 1152 60 60 7340032

High 1920 1152 60 80 9781248

In broadcasting terms, standard-definition TV requires main level and high-definition TV requires

high-1440 level. The bit rate required to achieve a particular level of picture quality approximately

scales with resolution.

Interlaced Video and Picture Structures

MPEG-2 support two scanning methods, one is progressive scanning and the other is interlaced

scanning. Interlaced scanning scans odd lines of a frame as one field (odd field), and even lines

as another field (even field). Progressive scanning scans the consecutive lines in sequential

order.

An interlaced video sequence uses on of two picture structures: frame structure and field

structure. In the frame structure, lines of two fields alternate and the two fields are coded together

as a frame. One picture header is used for two fields. In the field structure, the two fields of a

24

frame may be coded independently of each other, and the even field follows the odd field. Each of

the two fields has its picture header.

The interlaced video sequence can switch between frame structures and field structures on

picture-by-pictures basics. On the other hand, each picture in a progressive video sequence is a

frame picture.

MPEG-2 standard features

Because the MPEG-2 standard provides good compression using standard algorithms, it has

become the standard for digital TV. It has the following features:

Video compression which is backwards compatible with MPEG-1

Full-screen interlaced and/or progressive video (for TV and Computer displays)

Enhanced audio coding (high quality, mono, stereo, and other audio features)

Transport multiplexing (combining different MPEG streams in a single transmission stream)

Other services (GUI, interaction, encryption, data transmission, etc)

The list of systems which now (or will soon) use MPEG-2 is extensive and continuously growing:

digital TV (cable, satellite and terrestrial broadcast), Video on Demand, Digital Versatile Disc

(DVD), personal computing, card payment, test and measurement, etc.

MPEG-2 Audio

The major advantages of data compression are minimum memory requirement and minimum

transmission bandwidth required by the compressed signal. This method is of useful service

whenever resources are scarce or expensive. Thus digital and audio transmissions in the Internet

constitute two major domains of application within the field of audio coding; it even gains ground

in modern cinema sound systems. The international cooperation of the Fraunhofer Institute and

companies like AT&T, Sony and Dolby brought about the most efficient MPEG method for audio

data compression up until now. It is called MPEG-2 Advanced Audio Coding AAC and was

declared international standard by MPEG by the end of April 1997.

25

MPEG-2 Advance Audio Coding

Advance Audio Coding is one of the formats defined by MPEG-2 standard. It provides a low

bitrate coding for multichannel audio. In general there are five full bandwidth channels (left, right,

center, and two surround channels) as being used in the cinema today. Similar to all perceptual

coding schemes, MPEG-2 AAC basically makes use of the signal masking properties of the

human ear in order to reduce the amount of data. In this way, the quantization noise is distributed

to frequency bands in such a way that it is masked by the total signal. The MPEG-2 Audio

Standard will also extend the stereo and mono coding of MPEG-1 Audio Standard (ISO/IEC IS

11172-3) to half sampling rates (16 kHz, 22.05 kHz and 24 kHz), for improved quality for bitrate at

or below 64 kbits/s, per channel.

MPEG-2 audio standardize in comparison to MPEG-1

The first phase, MPEG-1, was dealing with mono and two-channel stereo sound coding, at

sampling frequencies commonly used for high quality audio (48, 44.1 and 32 kHz).

The second phase, MPEG-2, contains different work items.

. The extension of MPEG-1 to lower sampling frequencies (16 kHz, 22.05 kHz and 24 kHz),

providing better sound quality at very low bit rates (below 64 kbit/s for a mono channel). This

extension is easily added to a MPEG-1 audio decoder because it mainly implies inclusion of

some more tables.

. The backward-compatible extension of MPEG-1 to multichannel sound. MPEG-2 BC supports

up to 5 full bandwidth channels plus one low frequency enhancement channel. This multichannel

extension is both forward and backward compatible with MPEG-1. An MPEG-2 BC stream

adheres to the structure of an MPEG-1 bitstream such that an MPEG-2 BC stream can be read

and interpreted by an MPEG-1 audio decoder.

. A new coding scheme called Advanced Audio Coding. An AAC bitstream is not backward

compatible (i.e. cannot be read and interpreted by an MPEG-1 audio decoder).

Both MPEG-1 and the first two work items of MPEG-2 have the three layers structure. The

original MPEG-2 audio standard contained only the first work items and was finalized in 1994. In

26

order to improve coding efficiency for the 5-channel case, a non-backward-compatible audio

coding scheme was defined (AAC) and finalized in 1997.

Differences between MPEG-2 AAC and its predecessor ISO/MPEG Audio

Layer-3

Filter bank: in contrast to the hybrid filter bank of ISO/MPEG Audio Layer-3 - chosen for reasons

of compatibility but displaying certain structural weaknesses - MPEG-2 AAC uses a plain Modified

Discrete Cosine Transform (MDCT). Together with the increased window length (2048 instead of

1152 lines per transformation) the MDCT outperforms the filter banks of previous coding

methods.

Temporal Noise Shaping TNS: A true novelty in the area of time/frequency coding schemes. It

shapes the distribution of quantization noise in time by prediction in the frequency domain. Voice

signals in particular experience considerable improvement through TNS.

Prediction: A technique commonly established in the area of speech coding systems. It benefits

from the fact that a certain type of audio signal is easy to predict.

Quantization: by allowing finer control of quantization resolution, the given bit rate can be used

more efficiently.

Bit-stream format: The information to be transmitted undergoes entropy coding in order to keep

redundancy as low as possible. The optimization of these coding methods together with a flexible

bit-stream structure has made further improvement of the coding efficiency possible.

27

References

[1] MPEG Software Simulation Group, http://www.mpeg.org/index.html/MSSG/#source. MPEG-2 Video Codec, 1996. [2] IEE J Langham Thompson Prize. Electronics & Communication Engineering Journal, December 1995. [3] B.Timsari, C. Shahabi, MPEG Coding for Video Compression. University of Southern California , Computer Department. Los Angeles, CA. [4] Thomas Sikora,Heinrch-Hertz-Institute Berlin, Image Processing Deparment, MPEG-1 and MPEG-2 Digital Video Coding Standards. [5] S. Aign, "A Temporal Error Concealment Technique for I-Pictures in an MPEG-2 Video- Decoder", in SPIE Visual Communication and Image Processing’98, vol. 3309, pp. 405--416, San Jose, Jan. 1998. [6] B.G. Haskell, A. Puri, and A.N. Netravali. Digital video: an introduction to MPEG-2. Digital Multimedia Standards Series. Chapman & Hall, 1997. [7] Tanaka, M, "Digital Broadcasting Boosts Demodulator ICs," NIKKEI ELECTRONICS ASIA, vol 4, no 12, pp 42-47, December 1995.

[8] ISO/IEC 13818, 1996. Generic coding of moving pictures and associated audio information.

[9] TUDOR, P.N. and WERNER, O.H., 1997. Real-time transcoding of MPEG-2 video bits stream. International Broadcasting Convention, September, pp. 226-301. [10] MPEG-2, ISO/IEC 13818. Generic coding of moving pictures and associated audio information, Nov. 1994. [11] International Telecommunication Union, Radio Communication Study Groups. A Guide To Digital Terrestial Television Broadcasting in the VHF/UHF Bands

28

The MPEG-4 Standard Brief Overview of MPEG Standards Progression and Naming.

The standards MPEG-1 in 1992 and MPEG-2 in 1995 were defined to standardize storage of

audiovisual information on CD-ROMs and the coding of digital TV and HDTV respectively. In mid 1993

work was begun on MPEG-3, the goal being a standard for HDTV. Shortly after this the standard was

abandoned due to its similarity to the tools used in MPEG-2, and therefore the HDTV tools were

included in MPEG-2.

The growing field of multimedia began to encompass a more diverse range of applications, and was

snatched up by the telecommunications and consumer electronics industries. Application demands

became more complex requiring point to point and multi-point to multi-point communications, object

manipulation, online editing and so on. It became clear that the current standards were not flexible

enough to handle this range of applications. Consequently, in 1999 MPEG-4 became the next

member of the well-known family of ISO/MPEG coding standards. After the release of the MPEG-4

Version 1 standard, the new Version 2 Standards Addendum, was published in 2000.

Since October 1996 MPEG is working on its fourth standard, called MPEG-7. The MPEG-7 [i] standard

was originally called “Multimedia Content Description Interface”. MPEG-7 will provide a rich set of

standardized tools to describe multimedia content. Both human users and automatic systems that

process audiovisual information are within the scope of MPEG-7. The MPEG-7 work plan calls for

conclusion of conformance testing in September 2002.

29

On the horizon is MPEG-21 [ii] "Multimedia Framework", was begun in June of 2000. The aim for

MPEG-21 is to fit together the various existing elements that build an infrastructure for the delivery and

consumption of multimedia content. Where gaps exist, MPEG-21 will recommend which new

standards are required. The schedule for MPEG-21 gives a finishing date of February 2009.

Fig 1. Anatomy of an MPEG-4 Scene [iii]

Overview of concept and need for MPEG-4

The aim of the MPEG-4 standard was to provide standardized core technologies allowing efficient

storage, transmission and manipulation of video and audio data in multimedia environments in areas

that were not perceived to be well covered by the MPEG-1 and MPEG-2 standards. These are

described below [iv].

30

Advances made in MPEG-4 over the previous standards are the ability to provide content based

interaction between the user and the system. These can be viewed as the following. a) multimedia

data access tools, b) manipulation and bitstream editing, c) hybrid natural and synthetic data coding

and d) improved temporal random access [v]. The user can explore different scenarios and change

object properties interactively. The requirements for multimedia applications has become increasingly

diverse, and necessitates the handling of data such as, still images, video, stereo images, natural,

synthetic, text, medical etc. The MPEG-4 standard supports a diversity of objects to support these

requirements.

In MPEG-4 Compression has been addressed in the form of improved coding efficiency and coding of

multiple concurrent data streams. Modern devices are not only those associated with high bit-rate

applications. There are many applications at the other end of the spectrum such as Video Telephone,

operating at very low bit-rates. The MPEG-4 standard allows replay of the file at different bit-rates that

can be selected according to the hardware yet the file is only encoded one time, this can be viewed as

content based scalability.

In today’s online environment downloading and copying of files is very difficult to control. For this

reason, establishing ownership of the file content is an important issue. The MPEG-4 standard

includes methods for protection of intellectual property.

Features of MPEG-4.

Object Based System

The MPEG-1 and MPEG-2 standards were mainly concerned with data compression and deal

with the data as a single entity, for example video data is encoded as a series of rectangular

frames and non-contextual rectangular macroblocks. The MPEG-4 standard is object based

and contains Audio-visual scenes made up of audio-visual objects composed together

according to a scene description. The objects may exist singly, but a set of related objects

can be used to form an MPEG-4 scene. The objects may be semantically significant and non-

rectangular according to the application. This object-based scheme allows interaction with

individual elements within the audio-visual scene. There exist many different coding schemes

31

for individual objects that facilitate easy re-use of audio-visual content depending on the

different environments in which it is used.

Audio visual objects can vary widely in nature as shown in the list below:

a) Audio (single or multi-channel);

b) Video (arbitrary shape or rectangular);

c) Natural (natural audio or video);

d) Synthetic (text and graphics, animated faces, synthetic music);

e) 2D (Web like pages);

f) 3D (Spatialized sound, 3D virtual world);

g) Streamed (Video movie);

h) Download (audio jingle).

Interaction is accomplished by describing the video objects in a two-dimensional or even

three-dimensional space, and for audio objects a sound space. In most applications each

video object represents a semantically meaningful object in the scene. Each object is

represented by a three vector (Y,U and V) component system plus information about the

shape. This information is stored frame after frame in predefined temporal intervals. Objects

are defined once and changes made by the user are calculated locally so that there is

reduced reliance on transmission speed, and a fast response can be more easily achieved.

An interaction language for the video objects is used to send commands from the user[vi]. It is

called BIFS (Binary Format for Scenes). BIFS can be used to add and delete objects from a

scene. It can also be used to animate objects and can be used to change object properties

such as texture or colour. The roots of the BIFS language can be found in VMRL (Virtual

Reality Modeling Language) widely used for internet applications. BIFS however, has made

significant advantages and improvements. Code for BIFS is as much as 10-15 times shorter

than code for VMRL thus making the system that much more efficient. BIFS is also used for

real time streaming. That is, a scene can be played before it is fully downloaded and built on

the fly by BIFS. VMRL does not make provision for this type of application.

32

Transportation /Communications

The transportation and storage of MPEG-4 data is handled by the use of multiple streams,

rather than a single stream containing all information. Each object may have it’s own stream,

audio and visual streams are separate, streams containing BIFS are separated from the

stream that is carrying the object itself. Streams exist solely for the purpose of time-stamping

and synchronization. For example, a scaleable object may have a single stream defining

basic-quality information, and separate streams for information to provide enhancement

layers.

The stream containing BIFS commands may handle instructions for object placement, scaling,

motion. An additional stream called an “Object Descriptor” (OD) is the master stream

describing the contents of the other “Elementary Streams” (ES). These inform the system of

which decoders to use to decode the object. The concept of dividing up the information into

separate streams is fundamental in providing the interactivity allowing addition, deletion and

separate handling of objects.

In the transportation of the data a separate layer is dedicated to the synchronization of the

elementary streams. Each stream is divided into packets and then passed to the network

transport layer. Actual data transportation is left to existing networking technologies such as

ATM, RTP etc:

The MPEG-4 standard does contain a device called “FlexMux”[vii] that handles the interface

between the multiple streams and the network transport layer. This entails the synchronized

delivery of streaming information from source to destination. Different different QoS are

exploited as they are available from the network. FlexMux is specified in terms of the

synchronization layer and a delivery layer containing a two-layer multiplexer.

DMIF stands for Delivery Multimedia Integration Framework. DMIF is found in part 6 of the

MPEG-4 standard [3] and manages the first multiplexing layer. This multiplex may be

embodied by the MPEG-defined FlexMux tool, which allows grouping of Elementary Streams

(ESs) with a low multiplexing overhead.

33

The second layer or "TransMux" (Transport Multiplexing) layer models the layer that offers

transport services matching the requested QoS. Only the interface to this layer is specified by

MPEG-4. Other signaling, handling of packets etc: must be done in communication with any

suitable existing transport protocol stack such as (RTP)/UDP/IP, (AAL5)/ATM. The choice is

left to the end user/service provider, and allows MPEG-4 to be used in a wide variety of

operation environments.

Two types of timestamps are used for incoming streams, one type gives a time when the

information must be decoded, the second gives a time when the information must be

presented. The timing scheme is relative, and is referenced against a master clock at the

time of encoding.

Security Features

Theft of intellectual property is an increasingly difficult problem to combat. MPEG-4 plays a

twofold role in protection. The elementary stream used to transport the data contains its

identification and ownership. Secondly the standard provides an interface for conditional

access. This enables the use of proprietary systems that can handle Intellectual property

protection. An example of such a system is a pay Television service.

An example of digital watermarking using MPEG-4 for facial animation data sets has been

presented by Hartung, Eishart and Girod [viii]. Watermarking is the process of embedding

information about copyright and data ownership within a video stream. It was proved that the

watermark could be reliably embedded and retrieved from the data stream, even when using

different choices for watermark data rate.

Low Bit-rate Applications

In order to be a feasible proposal for the use of MPEG-4 in the world of mobile devices [ix], the

standard must extend itself to cover low bit-rate applications. These applications may need to

provide streaming at rates as low as 10 kb/s. In addition the lack of redundancy in data for

mobile applications makes this environment more error prone.

34

The feature of MPEG-4 that accommodates the need for such low transmission rates, is the

use of scaleable objects. A base layer sends the basic quality data, with additional layers that

can be added to enhance the presentation if the bit-rate is available. The fact that the data is

presented as separate objects allows the encoder to send the most important objects in a

scene first, and provide these objects with a superior level of error protection. This means

that when encoding, the provider does not need to ask the receiver what bit rate they can

handle and then make an individual encoding for each of the bit-rates. The data can be

encoded a single time, with scaleable parameters. The bit-rate can be sensed automatically

before playout or even adjusted during playout.

Spatial scalability implies that a frame or object can be decoded with a varying level of quality,

that is quantified by the peak signal to noise ratio. Temporal scalability implies that each

individual object can be decoded at different frame rates, and Object scalability implies that

any object can be accessed in the sequence. Both spatial and temporal scalability are

implemented using multiple video object layers. The base layer and the enhancement layer.

For spatial scalability the enhancement layer improves the spatial resolution and thus the

visual quality. For temporal scalability the enhancement layer can provide increased frame

rates and improve the smoothness of motion.

Sprites can be used to encode an unchanging background and can be useful for reducing the

volume of bits transmitted. For example, a background sprite defining the image of the

background need be sent only once. Subsequently, new views are created by simply sending

the new positions of four pre-defined points. The use of sprites in this manner enhances the

efficiency of compression. In addition sprite based coding is very suitable for synthetic

objects, but can also be used for objects in natural scenes that undergo rigid motion. The

MPEG-4 video model considers two kinds of sprites. Static sprites that are generated off line

and are better suited for synthetic objects and dynamic sprites that are constructed online

during the encoding process. Dynamic sprites are more suitable for the representation of

natural objects[4] .

35

Image Mapping

MPEG-4 provides tools for describing both the classical “rectangular video” and arbitrary shapes.

Two methods of including arbitrary shapes are included. The first method deals with so called

Binary shapes. In a binary shape a predefined pixel either is, or is not part of an object. The

pixel may be defined in terms of its colour, brightness or other parameter. The results of this may

be somewhat primitive in nature presenting jagged or fuzzy outlines. However the advantage of

this technique lies in its simplicity. The second method is knows as an “alpha” shape, or “gray

scale” shape. The predefinition of the pixel also contains a transparency value, thus providing a

tool for smooth blending of the shape with the background or other shapes.

In addition to standards for the description of shapes MPEG-4 alows the mapping of images onto

computer generated shapes, thus generating synthetic objects. An image can be mapped onto a

two or three-dimensional mesh, then commands can be used to deform the mesh and produce the

effect of motion. For example, faces can be mapped onto a mesh to form avatars that can be

used as an online stand-in for a human or a synthesized being [x,xi] .

The video encoding procedure has several steps. First the objects are formed from the incoming

video bitstream. The resulting indiviual object bitstreams are then coded. The individual object

properties are encoded by a coding control unit which decides which objects are to be transmitted,

the number of layers it has and the level of scalability. The individual video object bitstreams are

then sent through a multiplexer to merge them into a single bitstream as output.

There are particular characteristics that are associated with the MPEG-4 video object structure.

The texture is represented in YUV colour coordinates, and up to 12 bits may be used to represent

a pixel component value. Shape and texture are to be available for each “snapshot” of the video

object. A “snapshot” is called a Video Object Plane. A video Object Plane is implemented in

blocks of 16x16 pixels, thus providing backwards compatibility with previous standards. The

shape information for a video object polane is defined explicitly since it is expected that the object

is non rectagular. This is called the -plane, it is specified using two components, one is a binary

36

array that defines the bounding box of the video object plane and specifies whether or not an input

pixel belongs or does not belong to the object. The second is a transparency value that is defined

as 0 being transparent and 255 being opaque. This is called the gray scale shape.

Other terminology used in the structure is “Video Session” which is a group of video objects that

form the scene. A “Video object layer” contains information to support the temporal and spatial

scalability of the video object. The “Video Object Plane” is a time-sampled version of any one

video object (as mentioned previously, a “snapshot”).

Motion detection is achieved using a similar technique to earlier standards with the exception that

each object is again encoded individually. There are three types of planes, the I-Video Object

Plane, the P-Video Object Plane and the B-Video object plane. As with frames the I-VOP is

encoded independently the P-VOP is predicted using other previously decoded P Planes, and the

B-VOP is the bidirectional interpolated plane based on I and P planes. Motion estimation is

performed using 16x16 or 8x8 macroblocks in the same way as previous MPEG standards[xii].

Structured Audio and Sound Capability

Csound is a popular synthesis language and was used by Massachusetts Institute of Technology

to develop NetSound. Net sound provides the fundamentals of MPEG-4 provisions for “structured

audio”. Structured audio is a format for describing methods of synthesis [xiii] .

Descriptors for such as oscillators and digital filters are specified as signal-processing elements

and small networks are chosen to create the specific sounds. Ultimately each network can be

used to define what is termed as an “instrument” regardless of whether it synthesizes a traditional

instrument such as a violin, or a fire alarm. SAOL (Structured Audio Orchestra Language) or

SASL (Structured Audio Scoring Language) are used to combine sound objects and “score” the

production.

Popular audio options such as MIDI and other mass-market synthesizers are included, but their

repertoire of instruments is severely limited by comparison.

37

MPEG-4 uses a similar concept with sound as it does with video, in that individual sound objects

are placed in a sound space. Thus different mixes can be defined for different listening conditions

and hardware. It also means that if a user moves a video object in a scene, the audio soundtrack

can be moved along with it since the speech and the background soundtrack can be included as

separate audio objects.

“Environmental Spatialization” in MPEG-4 Version 2, is so called because it can change how a

sound object is heard depending on the room definition sent to the decoder. This works locally at

the terminal, thus saving on bit transmission.

Summary

Although work was begun with the goal of standardizing low end bit rate encoding, the MPEG-4

standard has made the step into object based multimedia and has standardized the concept of

both audio and interactive objects combining together to form a multimedia presentation. In

comparison testing between MPEG-1, MPEG-2, H.263 and MPEG-4, at various different bitrates,

the MPEG-4 standard gave the highest image quality test results for video encoding[4] .

38

Reference:

1 Overview of the MPEG-7 Standard (version 4.0) ISO/IEC JTC1/SC29/WG11, La Baule, Paris, October 2000 2 ISO/IEC TR 18034-1:2001(E) Part 1: Vision, Technologies and Strategy, MPEG, Document: ISO/IEC JTC1/SC29/WG11 N3939 3 R. Koenen.: MPEG-4: Multimedia for our time. IEEE Spectrum, 36(2):26--33, February 1999. 4 Liang J: New Trends in Multimedia Standards MPEG-4 and JPEG2000. Informing Science Special issue on Multimedia Informing Technologies-Part 1 Volume 2 No 4, 1999. 5 Ebrahimi T,. MPEG video verification model : A video encoding/decoding algorithm based on content representation. : Image Communication, Special Issue on MPEG4, October 1997. 6 Kalva H. MPEG-4 Systems and Applications Hari Kalva Columbia University 1312 S.W. Mudd Building New York NY 10027... 7 ISO MPEG-4 Standard Version 1 (ISO/IEC-14496) 8 Hartung F., Eisert P. and Girod B.,: Digital Watermarking of MPEG-4 Facial Animation parameters: Computers and Graphics Vol. 22, No 3. 9 A. Puri and A. Eleftheriadis, "MPEG-4: A Multimedia Coding Standard Supporting Mobile Applications" ACM Mobile Networks and Applications Journal, Special Issue on Mobile Multimedia Communications, Vol. 3, No. 1, June 1998, pp. 5-32 (invited paper). 10 Ostermann J., Beutnagel M., Fischer A. and Wang Y.: Integration of Talking Heads and Text-to-Speech Synthesizers for Visual TTS. AT&T Labs Research, Institute Eurecom/EPFL, Polytechnic University. 11 J. Ostermann, E. Haratsch, "An animation definition interface: Rapid design of MPEG-4 compliant animated faces and bodies", International Workshop on synthetic - natural hybrid coding and three dimensional imaging, pp. 216-219, Rhodes, Greece, September 5-9, 1997. 12 Fleury P., Bhattacharjee S., Piron L., Ebrahimi T., and Murat K.: MPEG-4 Video Verification Model: A Solution for Interactive Multimedia Applications. SPIE Journal of Electronic Imaging Vol. 7 N. 3 pp. 502-515, July 1998. 13 Scheirer D.E. and Vercoe B.L., SAOL: The MPEG-4 Structured Audio Orchestra Language. Computer Music Journal, 23:2 pp. 31-51, Summer 1999.

VIDEO AND AUDIO COMPRESSION THE MPEGs STANDARDSchens/courses/cis6931/2001/Fernadez.pdf- The...

Documents

Transcript of VIDEO AND AUDIO COMPRESSION THE MPEGs STANDARDSchens/courses/cis6931/2001/Fernadez.pdf- The...