VIDEO AND AUDIO COMPRESSION THE MPEGs STANDARDSchens/courses/cis6931/2001/Fernadez.pdf- The...
Transcript of VIDEO AND AUDIO COMPRESSION THE MPEGs STANDARDSchens/courses/cis6931/2001/Fernadez.pdf- The...
VIDEO AND AUDIO COMPRESSION
THE MPEGs STANDARDS
T. Alejandro Mendoza
Lesley Jacques Rigoberto Fernandez
Florida International University School of Computer Science
CIS 6931 Advanced Topics of Information Processing Dr. Shu-Ching Chen
Spring, 2001
2
Table of Contents
The MPEG-1 Standard 4
Introduction 4
MPEG Video Compression Techniques 5
Stream Structure 5
Video Stream Data Hierarchy 5
Group of Pictures (GOP) 5
Picture 6
Macroblock 6
Block 6
Audio Stream Data Hierarchy 6
Intra-Picture Coding 7
Picture Types 7
Intra Pictures 7
Predicted Pictures 7
Bidirectional Pictures 7
Video Stream Composition 8
Motion Compensation 8
Intra-picture (Transform) Coding 9
Synchronization 10
System Clock Reference 10
MPEG-1 Audio Standard 10
Analog-to-Digital 11
Audio Encoding 11
MPEG-1 Audio Encoding Diagram 12
Operating Mode 13
References 14
The MPEG-2 Standard 15
Introduction 15
MPEG Standards 15
The following MPEG standards exist: 16
3
Overview of MPEG-2 16
MPEG-2 Video Compression 17
Video Stream Data Hierarchy 17
Picture Types 19
Profile and Levels 21
Details of profiles 21
MPEG-2 Video profiles 22
Details of levels 23
Interlaced Video and Picture Structures 23
MPEG-2 standard features 24
MPEG-2 Advance Audio Coding 25
MPEG-2 audio standardize in comparison to MPEG-1 25
Differences between MPEG-2 AAC and its predecessor ISO/MPEG Audio Layer-3 26
References 27
The MPEG-4 Standard 28
Brief Overview of MPEG Standards Progression and Naming. 28
Overview of concept and need for MPEG-4 29
Features of MPEG-4. 30
Object Based System 30
Transportation /Communications 32
Security Features 33
Low Bit-rate Applications 33
Image Mapping 35
Structured Audio and Sound Capability 36
Summary 37
Reference: 38
4
The MPEG-1 Standard
Introduction
The major goal of video compression is to represent a video source with as few bits as possible
while preserving the level of quality required for the given video application. The bit-rate
reduction can only be possible by removing redundant information from the video during the
coding process and reinserting it during the decoding process. In video signals, there is a large
amount of redundancy between frames that can be classified as statistical and psycho-visual
redundancy. Since any given pixel is likely to be similar to those pixels surrounding it, and any
given frame is likely to be similar to those frames surrounding it. The statistical redundancy
results from the fact that pixel values are correlated with their neighbors in spatial and temporal
directions. The psycho-visual redundancy is a consequence of the human visual system (HVS)
sensitivity. The human vision has a limited response to fine spatial or temporal detail. Bit-rate
reduction is possible by allowing distortions that should not be visible to human eyes. The MPEG
standard is a generic standard independent of a particular application and of the delivery media,
which is optimized for quality and size of compress video. MPEG does not specify any particular
techniques; it only defines the format for encoding video information in B and P frames and the
algorithm for reconstructing the pixel during decompression. The compressed stream should look
as good as possible when decompressed; the optimal goal is to be indistinguishable from the raw
video. On the other hand, the encoded stream should be as small as possible. The size of the
encoded stream is important if the multimedia application must fit within a certain fixed space, as
an example the limited space provided by a typical CD-ROM. The MPEG standard was
developed in response to the growing need for a common format for representing compressed
video on various digital storage media such as CDs, DATs, hard disks, and optical drives.
Applications using compressed video on digital storage media need to be able to perform a
number of operations in addition to normal forward playback of the video sequence.
5
MPEG Video Compression Techniques
MPEG-1 uses several algorithms and techniques to accomplish compression.
Subsampling of the chrominance information to match the sensitivity of the human visual system
(HVS), quantisation, motion compensation to exploit temporal redundancy, discrete cosine
transform (DCT) to exploit spatial redundancy, variable length coding (VLC), and picture
interpolation. The descriptions of these techniques are as follows:
Stream Structure
In its most general form, and MPEG system stream is made up of two layers:
- The system layer contains timing and other information needed to demultiplex the audio
and video during playback
- The compression layer which includes the audio and video streams
The decoding process involves extracting the timing information from the MPEG system stream
and sending it to the other system components. The system decoder also demultiplexes the
video and audio streams from the system stream then sends each to the appropriate decoder.
The video decoder decompresses the video stream. The audio decoder decompresses the audio
stream.
Video Stream Data Hierarchy
The MPEG standard defines a hierarchy of data structures in the video. Video sequence begins
with a sequence header (may contain additional sequence headers), includes one or more
groups of pictures, and ends with an end-of-sequence code.
Group of Pictures (GOP)
A header and a series of one or more pictures intended to allow random access into the
sequence.
6
Picture
A picture is the primary coding unit of a video sequence. A picture consists of three rectangular
matrices representing luminance/brightness (Y) and two chrominance/color (Cb and Cr) values.
The Y matrix has an even number of rows and columns. The Cb and Cr matrices are one-half
the size of the Y matrix in each direction, horizontal and vertical. For every four-luminance value,
there are two associated chrominance values: one Cb value and one Cr value.
Slice
A slice is one or more contiguous macroblocks. The order of the macroblock within a slice is from
left-to-right and from top-to-bottom. Slices are important in the handling of errors. If the bit-stream
contains an error, the decoder can skip to the start of the next slice. Having more slices in the bit-
stream allows better error concealment, but uses bits that could otherwise be used to improve
picture quality.
Macroblock
A 16-pixel by 16-line section of luminance components and the corresponding 8-pixel by 8-line
section of the two-chrominance components. A macroblock contains four Y blocks, one Cb block
and one Cr block.
Block
A block is an 8-pixel by 8-line set of values of a luminance or a chrominance component. Note
that a luminance block corresponds to one fourth as large a portion of the displayed image as
does a chrominance block.
Audio Stream Data Hierarchy
The MPEG standard defines a hierarchy of data structures that accept, decode and produce
digital audio output. The MPEG audio stream, like the MPEG video stream, consists of a series
of packets. Each audio packet contains and audio packet header and one or more audio frames.
In the audio stream structure each packet header contains the following information:
- Packet start code: Identifies the packet as being an audio packet
- Packet length: Indicates the number of bytes in the audio packet
An audio frame contains the following information:
7
- Audio frame header: Contains synchronization, ID, bit rate, and sampling frequency
information
- Error-checking code: Contains error-checking information
- Audio data: Contains information use to reconstruct the sampled audio data
- Ancillary data: Contains user-defined data
Intra-Picture Coding
Much of the information in a picture within a video sequence is similar to information in a previous
or subsequent picture. The MPEG standard takes advantage of this temporal redundancy by
representing some pictures in terms of their differences from other reference pictures, or what is
know as inter-picture coding.
Picture Types
The MPEG standard specifically defines three types of picture: intra (I), predict (P), and bi-
directional (B).
Intra Pictures
Intra pictures, or I-pictures, are coded using only information present in the picture itself. I-picture
provides potential random access points into the compressed video data. I-pictures use only
transform coding and provide moderate compression. I-pictures typically use about two bits per
coded pixel.
Predicted Pictures
Predicted pictures, or P-pictures, are coded with respect to the nearest previous I-picture or P-
picture. This technique is called forward prediction. Like I-picture, P-pictures serve as a
prediction reference for B-pictures and future P-pictures. However, P-pictures use motion
compensation to provide more compression that is possible with I-pictures. Unlike I-pictures, P-
pictures can propagate coding errors because P-pictures are predicted from previous reference I-
pictures or P-pictures.
Bidirectional Pictures
Bidirectional pictures, or B-pictures, are pictures that use both a past and future picture as a
reference. B-pictures provide the most compression and do not propagate errors because they
8
are never used as a reference. Bidirectional prediction also decreases the effect of noise by
averaging two pictures.
Video Stream Composition
The MPEG algorithm allows the encoder to choose the frequency and location of I-pictures. This
choice is based on the application’s need for random accessibility and the location of scene cuts
in the video sequence. In applications where random access is important, I-pictures are typically
used two times a second. The encoder also chooses the number of B-pictures between any pair
of reference I-pictures or P-pictures. This choice is based on factors such as the amount of
memory in the encoder and the characteristics of the material being coded. For example, a large
class of scenes has two bidirectional pictures separating successive reference pictures. The
MPEG encoder reorders pictures in the video stream to present the pictures in the most efficient
sequence. In particular, the reference pictures needed to reconstruct B-pictures are sent before
the associated B-pictures.
Motion Compensation
Motion compensation is a technique for enhancing the compression of P-pictures and B-pictures
by eliminating temporal redundancy. Motion compensation typically improves compression by
about a factor of three compared to intra-picture coding. Motion compensation algorithm works at
the macroblock level.
When a macroblock is compressed by motion compensation, the compressed file contains this
information:
- The spatial vector between the reference macroblock(s) and the macroblock being coded
(motion vectors).
- The content difference between the reference macroblock(s) and the macroblock being
coded (error terms)
Not all information in a picture can be predicted from a previous picture. Consider a scene in
which a door opens: The visual details of the room behind the door cannot be predicted from a
9
previous frame in which the door was closed. When a case such as this arises, a macroblock in
a P-picture cannot be efficiently represented by motion compensation, it is coded in the same
way as a macroblock in an I-picture using transform coding techniques.
Four coding are therefore possible for each macroblock in a B-picture:
- Intra coding: No motion compensation
- Forward prediction: The previous reference picture is used as a reference
- Backward prediction: The next picture is used as a reference
- Bidirectional prediction: Two reference pictures are used, the previous reference picture
and the next reference picture
Backward prediction can be used to predict uncovered areas that do not appear in previous
pictures.
Intra-picture (Transform) Coding
The MPEG transform-coding algorithm includes these steps:
- Discrete cosine transform (DCT)
- Quantization
- Run-length encoding
Both image blocks and prediction-error blocks have high spatial redundancy. To reduce this
redundancy, the MPEG algorithm transforms 8 X 8 blocks of pixels or 8 X 8 blocks of error terms
from the spatial domain to the frequency domain with the Discrete Cosine Transform (DCT).
Next, the algorithm quantizes the frequency coefficients. Quantization is the process of
approximating each frequency coefficient as one of a limited number of allowed values. The
encoder chooses a quantization matrix that determines how each frequency coefficient in the 8 X
8 block is quantized. Human perception of quantization error is lower for high spatial frequencies,
so high frequencies quantized more roughly, with fewer allowed values, than low frequencies.
The combination of DCT and quantization results in many of the frequency coefficients being
zero, especially the coefficients for high spatial frequencies. To take maximum advantage of this,
the coefficients are organized in a zigzag order to produce long runs of zeroes. The coefficients
are then converted to a series of run-amplitude pairs, each pair indicating a number of zero
10
coefficients and the amplitude of a non-zero coefficient. These run-amplitude pairs are then
coded with a variable-length code, which uses shorter codes for commonly occurring pairs and
longer codes for less common pairs. Some blocks of pixels need to be coded more accurately
than others. For example, blocks with smooth intensity gradients need accurate coding to avoid
visible block boundaries. To deal with this inequality between blocks, the MPEG algorithm allows
the amount of quantization to be modified for each macroblock of pixels. This mechanism can
also be used to provide smooth adaptation to a particular bit rate.
Synchronization
The MPEG standard provides a timing mechanism that ensures synchronization of audio and
video. The standard includes two parameters: the system clock reference (SCR) and the
presentation timestamp (PTS). The MPEG specified “system clock” runs at 90 kHz. System
clock reference and presentation timestamp values are coded in MPEG bitstream using 33 bits,
which can represent any clock cycle in a 24-hour period.
System Clock Reference
A System Clock Reference (SCR) is a snapshot of the encoder system clock that is placed into
the system layer of the bitstream. During decoding, these values are used to update the system
clock counter. Presentation timestamp are samples of the encoder system clock that are
associated with video or audio presentation units. A presentation unit is a decoded video picture
or a decoded audio time sequence. The Presentation Time Sequence (PTS) represents the time
at which the video picture is to be displayed or the starting playback time for the audio time
sequence. The decoder either skips or repeats picture display to ensure that the Presentation
Time Sequence (PTS) is within one picture’s worth of 90kHz clock tics of the System Clock
Reference (SCR) when a picture is displayed. If the PTS is earlier, has a smaller value, than the
current SCR, the decoder discards the picture. If the PTS is later, has a larger value, than the
current SCR, the decoder repeats the display of the picture.
MPEG-1 Audio Standard
The basic task of the MPEG audio standard is to compress the digital audio data in a way that the
compressed file is as small as possible and the reconstructed (decoded) audio sounds exactly, or
11
as close as possible, to the original audio before compression. Other, requirement of the MPEG
audio standard include low complexity (to enable software decoders or inexpensive hardware
decoders with low power consumption) and flexibility for different application scenarios.
MPEG-1 audio standardizes three different coding schemes for digitized sound waves called
Layers I, II, and III. It does not standardize the encoder, but rather standardizes the type of
information that an encoder has to produce and write to and MPEG-1 conformant bit stream. As
well as the way in which the decoder has to parse, decompress, and re-synthesize this
information in order to regain the encoded sound. Leyer I is the simplest, it only uses the psycho-
acoustic model. Psycho-acoustic deals with the way the human brain perceives sound. Layer II
adds more advanced bit allocation techniques and greater accuracy. Layer III adds a hybrid
filterbank and non-uniform quantization plus advanced features like Huffman coding, 18 times
higher frequency resolution and bit reservoir technique. Layer I, II, and III gives increasing
quality/compression ratios with increasing complexity and demands on processing power. The
MPEG audio specs specify that a valid Layer III decoder should be able to decode any Layer I, II,
III MPEG Audio stream. A Layer II decoder shall be able to decode Layer I and Layer II streams.
Analog-to-Digital
Sound is store in a computer by converting analog waves to digital waves. Sound is pressure
differences in air, when pick up by a microphone and fed through an amplifier this becomes
voltage levels. The voltage is sample by the computer a number of times per second. For CD-
audio quality you need to sample 44,100 times per second and each sample has a resolution of
16 bits. In stereo this gives you 1.4 Mbit per second and that it is why compression is needed
due to the high data requirements. To compress audio MPEG tries to remove the irrelevant parts
of the signal and the redundant parts of the signal. Parts of the sound we do not hear can be
thrown away. To do this MPEG Audio uses psycho-acoustic principles.
Audio Encoding
The primary psycho-acoustic effect that MPEG-1 audio coder uses is called “auditory masking”,
where parts of a signal are not audible due to the function of the human auditory system. For
example, if there is a sound that consists mainly of one frequency, all other sounds that consists
12
of a closed by frequency but are much quieter will not be heard. The parts of the signal that are
masked are commonly call “irrelevant”, as opposed to parts of the signal that are removed by a
lossless coding operation, which are termed “redundant”. In order to remove this irrelevancy, the
encoder contains a psycho-acoustic model. This psycho-acoustic model analyzes the input
signals within consecutive time blocks and determines for each block the spectral components of
the input audio signal by applying a frequency transform. Then it models the masking properties
of the human auditory system, and estimates the just noticeable noise level for each frequency
band, sometimes called the threshold of masking. In parallel, the input signal is fed through a
time-to-frequency mapping, resulting in spectrum components for subsequent coding. In its
quantisation and coding stage, the encoder tries to allocate the available number of data bits in a
way that meets both the bit-rate and masking requirements taking into account the calculated
threshold of masking. The information on how the bits are distributed over the spectrum is
contained in the bit-stream as side information. The decoder is much less complex because it
does not require a psycho-acoustic model and bit allocation procedure. Its only task is to
reconstruct and audio signal from the encoded spectral components and associated side
information.
MPEG-1 Audio Encoding Diagram
The Audio encoding diagram consist of the following blocks:
- Frequency Transform:
13
A filter bank is used to decompose the input signal into sub-sample spectral components
(time/frequency domain). Together with the corresponding filter bank in the decoder it
forms and an analysis/synthesis system.
- Psycho-acoustic model:
Using either the time domain input signal and/or the output of the analysis filter bank, an
estimate of the actual (time and frequency dependent) masking threshold is computed
using rules known from psycho-acoustics.
- Quantization and coding:
The spectral components are quantized and coded with the aim of keeping the noise,
which is introduced by quantizing, below the masking threshold. Depending on the
algorithm, this step is done in very different ways, from simple block compounding to
analysis-by-synthesis systems using additional noiseless compression.
- Encoding of bitstream:
A bitstream formatter is used to assemble the bitstream, which typically consists of the
quantized and coded spectral coefficients and some side information such as bit
allocation information.
In order to be applicable to a number of very different application scenarios, MPEG defined a
data representation including a number of options
Operating Mode
MPEG-1 audio works for both mono and stereo signals. A technique called joint stereo coding
can be used to do more efficient combined coding of the left and right channels of a stereophonic
audio signal. Layer III allows both mid/side stereo coding and, for lower bit-rates, intensity stereo
coding. Intensity stereo coding allows for lower bit-rates but brings the danger of a changing the
sound image. The operating modes are:
- Single channel
- Dual channel (two independent channels, for example containing different language
versions of the audio)
- Stereo (no joint stereo coding) and Joint stereo
14
References
• D. Yen Pan, Digital Audio Compression, Digital Technical Journal, Vol. 5, No. 2, 1997
• L. Chiariglione, The development of an integrated audiovisual coding standard: MPEG,
Proceedings of the IEEE, Vol. 83, No. 2, February 1995.
• J. L. Mitchell, et al., MPEG Video: Compression Standard, Chapman and Hall, New York,
NY, 1997
• T. Sikora, "MPEG Digital Video--Coding Standards," IEEE Signal Processing Magazine,
Vol. 14, No. 5, September 1997, pp. 82--100.
• D. Gall. MPEG: A video compression standard for multimedia applications.
Communications of the ACM, 34(4): 46 - 58, April 1991.
• H.C. Liu and G.L. Zick. Scene decomposition of MPEG compressed video. In Proc. of the
SPIE, volume 2419, pages 26 - 37, 1995.
• D. Pan. A tutorial on MPEG audio compression. IEEE Multimedia Journal, Summer 1995.
• Peter Noll, "MPEG digital audio coding," IEEE Signal Processing Magazine, vol. 14, no.
5, pages 59 - 81, September 1997.
• K. Brandenburg and Marina Bosi. Overview of MPEG audio: Current and future
standards for low bit-rate audio coding. J. Audio Eng. Soc., January 1997
15
The MPEG-2 Standard
Introduction
Advances in network technology offer new opportunities for multimedia communications systems
that were not possible years ago, with video compression being the core technology. In fact,
compression is part of almost any video, audio storage or transmission process. It is known that
the size of an uncompressed video is very large. It has a higher data rate, which is too high for
user level applications, and is a problem for the CPU and communication. The ultimate goal of
video source coding is bit-rate reduction. The performance of video compression techniques
depends on the amount of redundancy contained in the image data. Compression consists of
removing redundancies and encoding the true information signal in a form appropriate to
application requirements. Dependent on the applications requirements could be implemented two
different methods, lossless and lossy coding of the video data. The aim of lossless coding is to
reduce image or video data for storage and transmission while retaining the quality of the original
images; the decoded image quality is required to be identical to the image quality prior to
encoding. In contrast the aim of lossy coding techniques is to meet a given a given target bit rate
for storage and transmission; it does not assure that the data received is exactly the same that
was sent. The efficient utilization of compression systems will be an essential element of
delivered quality.
MPEG Standards
MPEG (Moving Pictures Experts Group) is a group of people that meet under ISO (the
International Standards Organization) to generate standards for digital video (sequences of
images in time) and audio compression. In particular, they define a compressed bit stream, which
implicitly defines a decompressor. However, the compression methods are up to the individual
manufacturers, and that is where proprietary advantage is obtained within the scope of a publicly
available international standard.
16
The following MPEG standards exist:
MPEG-1, a standard for storage and retrieval of moving pictures and audio on storage media.
MPEG-2, a standard for digital television.
MPEG-4, a standard for multimedia applications.
MPEG-7, a content representation standard for information searches.
Overview of MPEG-2
MPEG-2 was finalized in 1994 and resulted in the International Standard ISO/IEC 13818-3, which
was published in 1995. MPEG-2 AAC was finalized in 1997 and published as the International
Standard ISO/IEC 13818-7. MPEG-2 has been very successful in defining a specification to serve
a range of applications, bit rates, qualities and services. The picture quality through an MPEG-2
codec depends on the complexity and predictability of the source pictures. Real-time coders and
decoders have demonstrated generally good quality standard-definition pictures at bit rates
around 6 Mbit/s. The MPEG-2 concept is similar to MPEG-1, but includes extensions to cover a
wider range of applications. The primary application targeted during the MPEG-2 definition
process was the all-digital transmission of broadcast TV quality video at coded bit rates between
4 and 9 Mbit/sec. However, the MPEG-2 syntax has been found to be efficient for other
applications such as those at higher bit rates and sample rates (e.g. HDTV). The most significant
enhancement over MPEG-1 is the addition of syntax for efficient coding of interlaced video (e.g.
16x8 block size motion compensation). Several other subtler are included which have a
noticeable improvement on coding efficiency, even for progressive video. Other key features of
MPEG-2 are the scalable extensions which permit the division of a continuous video signal into
two or more coded bit streams representing the video at different resolutions, picture quality, or
picture rates.
17
MPEG-2 Video Compression
The MPEG-2 video compression algorithm achieves very high rates of compression by exploiting
the redundancy in video information. MPEG-2 removes both the temporal redundancy and spatial
redundancy, which are present in motion video.
Temporal redundancy arises when successive frames of video display images of the same
scene. It is common for the content of the scene to remain fixed or to change only slightly
between successive frames.
Spatial redundancy occurs because pats of the picture (called pels) are often replicated (with
minor changes) within a single frame of video.
Some parts of a video frame may have low spatial redundancy (e.g. complex picture content),
while other parts may have low temporal redundancy (e.g. fast moving sequences). The
compressed video stream is therefore naturally of variable bit rate, where as transmission links
frequently require fixed transmission rates. The key to controlling the transmission rate is to order
the compressed data in a buffer in order of decreasing detail. Compression may be performed by
selectively discarding some of the information. A minimal impact on overall picture quality can be
achieved by throwing away the most detailed information, while preserving the less detailed
picture content. This will ensure the overall bit rate is limited while suffering minimal impairment of
picture quality.
Video Stream Data Hierarchy
18
This video bitstream figure shows some layers such as:
GOP (Group of pictures), Pictures, Slice, Macroblock and Block.
Video Sequence may contain additional sequence header, includes one or more groups of
pictures, and ends with an end-of-sequence code.
Group of Pictures: A header and a series of one of more pictures intended to allow random
access into the sequence. In an MPEG signal the GOP is a group of pictures or frames between
successive I-frames, the others being P and/or B-frames.
Picture: It is the primary coding unit of a video sequence. A picture consists of three rectangular
matrices representing luminance (Y) and two chrominance (Cb and Cr) values. The Y matrix has
an even number of rows and columns. The Cb and Cr matrices are one half the size of the Y
matrix in each direction (horizontal and vertical).
Slice: One or more contiguous macroblocks. The order of the macroblocks within a slice is from
left-to-right and top-to-bottom. Slices are important in the handling of errors. If the bitstream
contains an error, the decoder can skip to the start of the next slice. Having more slices in the
bitstream allows better error concealment, but uses bits that could otherwise be used to improve
picture quality.
Macroblock: A group of picture blocks, usually four (16 x 16 pixels overall), which are analyzed
during MPEG coding to give an estimate of the movement of particular elements of the picture
between frames. This generates the motion vectors, which are then used to place the
macroblocks in decoded pictures. Since each chrominance component has one-half the vertical
19
and horizontal resolution of the luminance component, a macroblock consists of four Y, one Cr,
and one Cb block.
Block: Rectangular areas of pictures, the smallest coding unit in the MPEG algorithm. It usually
consists of 8 x 8 pixels in size, which is individually subjected to DCT coding as part of a digital
picture compression process. It can be one of these three types: luminance (Y), red chrominance
(Cr), or blue chrominance (Cb).
Picture Types
The MPEG standard specifically defines three types of frames. The picture type defines which
prediction modes may be used to code each block. These three type of pictures are defined as
follows:
Intra Pictures (I-Frames)
Predicted Pictures (P-Frames)
Bidirectional Predicted Pictures (B- Frames)
Intra Pictures, or I-Frames are coded without reference to other pictures, using only information
present in the picture itself. Moderate compression is achieved by reducing spatial redundancy,
20
but not temporal redundancy. They can be used periodically to provide access points in the
bitstream where decoding can begin.
Predicted Pictures, or P-Frames can use the previous I-frame or P-frame for motion
compensation and may be used as a reference for further prediction. P-Frames contain only
predictive information (not a whole picture) generated by looking at the difference between the
present frame and the previous one. Each block in a P-frame can either be predicted or intra-
coded. By reducing spatial and temporal redundancy, P-frames offer increased compression
compared to I-frames. The P-Frame can be decompressed if the preceding I-Frame could be
localized.
Bidirectional Predicted Pictures, or B-Frames can use the previous and next I-frames or P-
frames, for motion-compensation, and offer the highest degree of compression. Each block in a
B-frame can be forward, backward or bi-directionally predicted or intra-coded. To enable
backward prediction from a future frame, the coder reorders the pictures from natural display
order to bitstream order so that the B-frame is transmitted after the previous and next frames it
references. This introduces a reordering delay dependent on the number of consecutive B-
frames; the computation time is the largest in these cases.
The different picture types typically occur in a repeating sequence, termed a Group of Pictures
(GOP). A typical GOP in display order is:
B1 B2 I3 B4 B5 P6 B7 B8 P9 B10 B11 P12
The corresponding bitstream order is:
I3 B1 B2 P6 B4 B5 P9 B7 B8 P12 B10 B11
A regular GOP structure can be described with two parameters: N, which is the number of
pictures in the GOP, and M, which is the spacing of P-frames. The GOP given here is described
as N=12 and M=3. MPEG-2 does not insist on a regular GOP structure. For example, a P-frame
following a shot-change may be badly predicted since the reference picture for prediction is
completely different from the picture being predicted. Thus, it may be beneficial to code it as an I-
picture instead.
21
For a given decoded picture quality, coding using each picture type produces a different number
of bits. In a typical example sequence, a coded I-frame could be three times larger than a coded
P-frame, which was itself 50% larger than a coded B-frame.
Profile and Levels
MPEG-2 video is an extension of MPEG-1 video. MPEG-1 was targeted at coding progressively
scanned video at bit rates up to about 1.5 Mbit/s. MPEG-2 provides extra algorithmic tools for
efficiently coding interlaced video and supports a wide range of bit rates. It also provides tools for
scalable coding where useful video can be reconstructed from pieces of the total bitstream. The
total bitstream may be structured in layers, starting with a base and adding refinement layers to
reduce quantization distortion or improve resolution.
A small number of subsets of the complete MPEG-2 tool kit have been defined, known as profiles
and levels. A profile is a subset of algorithmic tools and a level identifies a set of constraints on
parameter values (such as picture size or bit rate). The profiles and levels defined to date fit
together such that a higher profile or level is superset of a lower one. A decoder, which supports
a particular profile and level is only required to support the corresponding subset of algorithmic
tools and set of parameter constraints
Details of profiles
MPEG-2 builds on the powerful video compression capabilities of the MPEG-1 standard to offer a
wide range of coding tools. These have been grouped in profiles to offer different functionalities.
Only the combinations marked with an "X" are recognized by the standard.
22
MPEG-2 Video profiles
Simple Main SNR
scalable
Spatial
scalable High Multiview 4:2:2
High level X X
High-1440 level X X X
Main level X X X X X X
Low level X X
· Simple profile uses no B-frames, and hence no backward or interpolated prediction
consequently, no picture re-ordering is required, which makes this profile suitable for low-delay
applications such as video conferencing.
· Main profile adds support for B-pictures, which improves the picture quality for a given bit-rate
but increases delay. Currently, most MPEG-2 video decoder chip-sets support main profile.
· SNR profile adds support for enhancement layers of DCT coefficient refinement using Signal-to-
Noise Ratio (SNR) scalability.
· Spatial profile adds support for enhancement layers carrying the image at different resolutions
using the spatial scalability tool.
· High profile adds support for 4:2:2 sampled video.
. The Multiview Profile (MVP) is an additional profile developed. By using existing MPEG-2
Video coding tools it is possible to encode in an efficient way tow video sequences issued from
two cameras shooting the same scene with a small angle between them.
Since the final approval of MPEG-2 Video in November 1994, one additional profile has been
developed. This uses existing coding tools of MPEG-2 Video but is capable to deal with pictures
having a color resolution of 4:2:2 and a higher bitrate. Even though MPEG-2 Video was not
developed having in mind studio applications, a set of comparison tests carried out by MPEG
confirmed that MPEG-2 Video was at least good, and in many cases even better than standards
23
or specifications developed for high bitrate or studio applications. The 4:2:2 profile was approved
in January 1996 and is now an integral part of MPEG-2 Video.
Details of levels
MPEG-2 defines four levels of coding parameter constraints. This table shows the constraints on
picture size, frame rate, bit rate and buffer size for each of the defined levels. MPEG-2 levels:
Picture size, frame-rate and bit rate constraints.
Level Max. frame,
width, pixels
Max. frame,
height, lines
Max. frame,
rate, Hz
Max. bit rate,
Mbit/s
Buffer size,
bits
Low 352 288 30 4 475136
Main 720 576 30 15 1835008
High-1440 1440 1152 60 60 7340032
High 1920 1152 60 80 9781248
In broadcasting terms, standard-definition TV requires main level and high-definition TV requires
high-1440 level. The bit rate required to achieve a particular level of picture quality approximately
scales with resolution.
Interlaced Video and Picture Structures
MPEG-2 support two scanning methods, one is progressive scanning and the other is interlaced
scanning. Interlaced scanning scans odd lines of a frame as one field (odd field), and even lines
as another field (even field). Progressive scanning scans the consecutive lines in sequential
order.
An interlaced video sequence uses on of two picture structures: frame structure and field
structure. In the frame structure, lines of two fields alternate and the two fields are coded together
as a frame. One picture header is used for two fields. In the field structure, the two fields of a
24
frame may be coded independently of each other, and the even field follows the odd field. Each of
the two fields has its picture header.
The interlaced video sequence can switch between frame structures and field structures on
picture-by-pictures basics. On the other hand, each picture in a progressive video sequence is a
frame picture.
MPEG-2 standard features
Because the MPEG-2 standard provides good compression using standard algorithms, it has
become the standard for digital TV. It has the following features:
Video compression which is backwards compatible with MPEG-1
Full-screen interlaced and/or progressive video (for TV and Computer displays)
Enhanced audio coding (high quality, mono, stereo, and other audio features)
Transport multiplexing (combining different MPEG streams in a single transmission stream)
Other services (GUI, interaction, encryption, data transmission, etc)
The list of systems which now (or will soon) use MPEG-2 is extensive and continuously growing:
digital TV (cable, satellite and terrestrial broadcast), Video on Demand, Digital Versatile Disc
(DVD), personal computing, card payment, test and measurement, etc.
MPEG-2 Audio
The major advantages of data compression are minimum memory requirement and minimum
transmission bandwidth required by the compressed signal. This method is of useful service
whenever resources are scarce or expensive. Thus digital and audio transmissions in the Internet
constitute two major domains of application within the field of audio coding; it even gains ground
in modern cinema sound systems. The international cooperation of the Fraunhofer Institute and
companies like AT&T, Sony and Dolby brought about the most efficient MPEG method for audio
data compression up until now. It is called MPEG-2 Advanced Audio Coding AAC and was
declared international standard by MPEG by the end of April 1997.
25
MPEG-2 Advance Audio Coding
Advance Audio Coding is one of the formats defined by MPEG-2 standard. It provides a low
bitrate coding for multichannel audio. In general there are five full bandwidth channels (left, right,
center, and two surround channels) as being used in the cinema today. Similar to all perceptual
coding schemes, MPEG-2 AAC basically makes use of the signal masking properties of the
human ear in order to reduce the amount of data. In this way, the quantization noise is distributed
to frequency bands in such a way that it is masked by the total signal. The MPEG-2 Audio
Standard will also extend the stereo and mono coding of MPEG-1 Audio Standard (ISO/IEC IS
11172-3) to half sampling rates (16 kHz, 22.05 kHz and 24 kHz), for improved quality for bitrate at
or below 64 kbits/s, per channel.
MPEG-2 audio standardize in comparison to MPEG-1
The first phase, MPEG-1, was dealing with mono and two-channel stereo sound coding, at
sampling frequencies commonly used for high quality audio (48, 44.1 and 32 kHz).
The second phase, MPEG-2, contains different work items.
. The extension of MPEG-1 to lower sampling frequencies (16 kHz, 22.05 kHz and 24 kHz),
providing better sound quality at very low bit rates (below 64 kbit/s for a mono channel). This
extension is easily added to a MPEG-1 audio decoder because it mainly implies inclusion of
some more tables.
. The backward-compatible extension of MPEG-1 to multichannel sound. MPEG-2 BC supports
up to 5 full bandwidth channels plus one low frequency enhancement channel. This multichannel
extension is both forward and backward compatible with MPEG-1. An MPEG-2 BC stream
adheres to the structure of an MPEG-1 bitstream such that an MPEG-2 BC stream can be read
and interpreted by an MPEG-1 audio decoder.
. A new coding scheme called Advanced Audio Coding. An AAC bitstream is not backward
compatible (i.e. cannot be read and interpreted by an MPEG-1 audio decoder).
Both MPEG-1 and the first two work items of MPEG-2 have the three layers structure. The
original MPEG-2 audio standard contained only the first work items and was finalized in 1994. In
26
order to improve coding efficiency for the 5-channel case, a non-backward-compatible audio
coding scheme was defined (AAC) and finalized in 1997.
Differences between MPEG-2 AAC and its predecessor ISO/MPEG Audio
Layer-3
Filter bank: in contrast to the hybrid filter bank of ISO/MPEG Audio Layer-3 - chosen for reasons
of compatibility but displaying certain structural weaknesses - MPEG-2 AAC uses a plain Modified
Discrete Cosine Transform (MDCT). Together with the increased window length (2048 instead of
1152 lines per transformation) the MDCT outperforms the filter banks of previous coding
methods.
Temporal Noise Shaping TNS: A true novelty in the area of time/frequency coding schemes. It
shapes the distribution of quantization noise in time by prediction in the frequency domain. Voice
signals in particular experience considerable improvement through TNS.
Prediction: A technique commonly established in the area of speech coding systems. It benefits
from the fact that a certain type of audio signal is easy to predict.
Quantization: by allowing finer control of quantization resolution, the given bit rate can be used
more efficiently.
Bit-stream format: The information to be transmitted undergoes entropy coding in order to keep
redundancy as low as possible. The optimization of these coding methods together with a flexible
bit-stream structure has made further improvement of the coding efficiency possible.
27
References
[1] MPEG Software Simulation Group, http://www.mpeg.org/index.html/MSSG/#source. MPEG-2 Video Codec, 1996. [2] IEE J Langham Thompson Prize. Electronics & Communication Engineering Journal, December 1995. [3] B.Timsari, C. Shahabi, MPEG Coding for Video Compression. University of Southern California , Computer Department. Los Angeles, CA. [4] Thomas Sikora,Heinrch-Hertz-Institute Berlin, Image Processing Deparment, MPEG-1 and MPEG-2 Digital Video Coding Standards. [5] S. Aign, "A Temporal Error Concealment Technique for I-Pictures in an MPEG-2 Video- Decoder", in SPIE Visual Communication and Image Processing’98, vol. 3309, pp. 405--416, San Jose, Jan. 1998. [6] B.G. Haskell, A. Puri, and A.N. Netravali. Digital video: an introduction to MPEG-2. Digital Multimedia Standards Series. Chapman & Hall, 1997. [7] Tanaka, M, "Digital Broadcasting Boosts Demodulator ICs," NIKKEI ELECTRONICS ASIA, vol 4, no 12, pp 42-47, December 1995.
[8] ISO/IEC 13818, 1996. Generic coding of moving pictures and associated audio information.
[9] TUDOR, P.N. and WERNER, O.H., 1997. Real-time transcoding of MPEG-2 video bits stream. International Broadcasting Convention, September, pp. 226-301. [10] MPEG-2, ISO/IEC 13818. Generic coding of moving pictures and associated audio information, Nov. 1994. [11] International Telecommunication Union, Radio Communication Study Groups. A Guide To Digital Terrestial Television Broadcasting in the VHF/UHF Bands
28
The MPEG-4 Standard Brief Overview of MPEG Standards Progression and Naming.
The standards MPEG-1 in 1992 and MPEG-2 in 1995 were defined to standardize storage of
audiovisual information on CD-ROMs and the coding of digital TV and HDTV respectively. In mid 1993
work was begun on MPEG-3, the goal being a standard for HDTV. Shortly after this the standard was
abandoned due to its similarity to the tools used in MPEG-2, and therefore the HDTV tools were
included in MPEG-2.
The growing field of multimedia began to encompass a more diverse range of applications, and was
snatched up by the telecommunications and consumer electronics industries. Application demands
became more complex requiring point to point and multi-point to multi-point communications, object
manipulation, online editing and so on. It became clear that the current standards were not flexible
enough to handle this range of applications. Consequently, in 1999 MPEG-4 became the next
member of the well-known family of ISO/MPEG coding standards. After the release of the MPEG-4
Version 1 standard, the new Version 2 Standards Addendum, was published in 2000.
Since October 1996 MPEG is working on its fourth standard, called MPEG-7. The MPEG-7 [i] standard
was originally called “Multimedia Content Description Interface”. MPEG-7 will provide a rich set of
standardized tools to describe multimedia content. Both human users and automatic systems that
process audiovisual information are within the scope of MPEG-7. The MPEG-7 work plan calls for
conclusion of conformance testing in September 2002.
29
On the horizon is MPEG-21 [ii] "Multimedia Framework", was begun in June of 2000. The aim for
MPEG-21 is to fit together the various existing elements that build an infrastructure for the delivery and
consumption of multimedia content. Where gaps exist, MPEG-21 will recommend which new
standards are required. The schedule for MPEG-21 gives a finishing date of February 2009.
Fig 1. Anatomy of an MPEG-4 Scene [iii]
Overview of concept and need for MPEG-4
The aim of the MPEG-4 standard was to provide standardized core technologies allowing efficient
storage, transmission and manipulation of video and audio data in multimedia environments in areas
that were not perceived to be well covered by the MPEG-1 and MPEG-2 standards. These are
described below [iv].
30
Advances made in MPEG-4 over the previous standards are the ability to provide content based
interaction between the user and the system. These can be viewed as the following. a) multimedia
data access tools, b) manipulation and bitstream editing, c) hybrid natural and synthetic data coding
and d) improved temporal random access [v]. The user can explore different scenarios and change
object properties interactively. The requirements for multimedia applications has become increasingly
diverse, and necessitates the handling of data such as, still images, video, stereo images, natural,
synthetic, text, medical etc. The MPEG-4 standard supports a diversity of objects to support these
requirements.
In MPEG-4 Compression has been addressed in the form of improved coding efficiency and coding of
multiple concurrent data streams. Modern devices are not only those associated with high bit-rate
applications. There are many applications at the other end of the spectrum such as Video Telephone,
operating at very low bit-rates. The MPEG-4 standard allows replay of the file at different bit-rates that
can be selected according to the hardware yet the file is only encoded one time, this can be viewed as
content based scalability.
In today’s online environment downloading and copying of files is very difficult to control. For this
reason, establishing ownership of the file content is an important issue. The MPEG-4 standard
includes methods for protection of intellectual property.
Features of MPEG-4.
Object Based System
The MPEG-1 and MPEG-2 standards were mainly concerned with data compression and deal
with the data as a single entity, for example video data is encoded as a series of rectangular
frames and non-contextual rectangular macroblocks. The MPEG-4 standard is object based
and contains Audio-visual scenes made up of audio-visual objects composed together
according to a scene description. The objects may exist singly, but a set of related objects
can be used to form an MPEG-4 scene. The objects may be semantically significant and non-
rectangular according to the application. This object-based scheme allows interaction with
individual elements within the audio-visual scene. There exist many different coding schemes
31
for individual objects that facilitate easy re-use of audio-visual content depending on the
different environments in which it is used.
Audio visual objects can vary widely in nature as shown in the list below:
a) Audio (single or multi-channel);
b) Video (arbitrary shape or rectangular);
c) Natural (natural audio or video);
d) Synthetic (text and graphics, animated faces, synthetic music);
e) 2D (Web like pages);
f) 3D (Spatialized sound, 3D virtual world);
g) Streamed (Video movie);
h) Download (audio jingle).
Interaction is accomplished by describing the video objects in a two-dimensional or even
three-dimensional space, and for audio objects a sound space. In most applications each
video object represents a semantically meaningful object in the scene. Each object is
represented by a three vector (Y,U and V) component system plus information about the
shape. This information is stored frame after frame in predefined temporal intervals. Objects
are defined once and changes made by the user are calculated locally so that there is
reduced reliance on transmission speed, and a fast response can be more easily achieved.
An interaction language for the video objects is used to send commands from the user[vi]. It is
called BIFS (Binary Format for Scenes). BIFS can be used to add and delete objects from a
scene. It can also be used to animate objects and can be used to change object properties
such as texture or colour. The roots of the BIFS language can be found in VMRL (Virtual
Reality Modeling Language) widely used for internet applications. BIFS however, has made
significant advantages and improvements. Code for BIFS is as much as 10-15 times shorter
than code for VMRL thus making the system that much more efficient. BIFS is also used for
real time streaming. That is, a scene can be played before it is fully downloaded and built on
the fly by BIFS. VMRL does not make provision for this type of application.
32
Transportation /Communications
The transportation and storage of MPEG-4 data is handled by the use of multiple streams,
rather than a single stream containing all information. Each object may have it’s own stream,
audio and visual streams are separate, streams containing BIFS are separated from the
stream that is carrying the object itself. Streams exist solely for the purpose of time-stamping
and synchronization. For example, a scaleable object may have a single stream defining
basic-quality information, and separate streams for information to provide enhancement
layers.
The stream containing BIFS commands may handle instructions for object placement, scaling,
motion. An additional stream called an “Object Descriptor” (OD) is the master stream
describing the contents of the other “Elementary Streams” (ES). These inform the system of
which decoders to use to decode the object. The concept of dividing up the information into
separate streams is fundamental in providing the interactivity allowing addition, deletion and
separate handling of objects.
In the transportation of the data a separate layer is dedicated to the synchronization of the
elementary streams. Each stream is divided into packets and then passed to the network
transport layer. Actual data transportation is left to existing networking technologies such as
ATM, RTP etc:
The MPEG-4 standard does contain a device called “FlexMux”[vii] that handles the interface
between the multiple streams and the network transport layer. This entails the synchronized
delivery of streaming information from source to destination. Different different QoS are
exploited as they are available from the network. FlexMux is specified in terms of the
synchronization layer and a delivery layer containing a two-layer multiplexer.
DMIF stands for Delivery Multimedia Integration Framework. DMIF is found in part 6 of the
MPEG-4 standard [3] and manages the first multiplexing layer. This multiplex may be
embodied by the MPEG-defined FlexMux tool, which allows grouping of Elementary Streams
(ESs) with a low multiplexing overhead.
33
The second layer or "TransMux" (Transport Multiplexing) layer models the layer that offers
transport services matching the requested QoS. Only the interface to this layer is specified by
MPEG-4. Other signaling, handling of packets etc: must be done in communication with any
suitable existing transport protocol stack such as (RTP)/UDP/IP, (AAL5)/ATM. The choice is
left to the end user/service provider, and allows MPEG-4 to be used in a wide variety of
operation environments.
Two types of timestamps are used for incoming streams, one type gives a time when the
information must be decoded, the second gives a time when the information must be
presented. The timing scheme is relative, and is referenced against a master clock at the
time of encoding.
Security Features
Theft of intellectual property is an increasingly difficult problem to combat. MPEG-4 plays a
twofold role in protection. The elementary stream used to transport the data contains its
identification and ownership. Secondly the standard provides an interface for conditional
access. This enables the use of proprietary systems that can handle Intellectual property
protection. An example of such a system is a pay Television service.
An example of digital watermarking using MPEG-4 for facial animation data sets has been
presented by Hartung, Eishart and Girod [viii]. Watermarking is the process of embedding
information about copyright and data ownership within a video stream. It was proved that the
watermark could be reliably embedded and retrieved from the data stream, even when using
different choices for watermark data rate.
Low Bit-rate Applications
In order to be a feasible proposal for the use of MPEG-4 in the world of mobile devices [ix], the
standard must extend itself to cover low bit-rate applications. These applications may need to
provide streaming at rates as low as 10 kb/s. In addition the lack of redundancy in data for
mobile applications makes this environment more error prone.
34
The feature of MPEG-4 that accommodates the need for such low transmission rates, is the
use of scaleable objects. A base layer sends the basic quality data, with additional layers that
can be added to enhance the presentation if the bit-rate is available. The fact that the data is
presented as separate objects allows the encoder to send the most important objects in a
scene first, and provide these objects with a superior level of error protection. This means
that when encoding, the provider does not need to ask the receiver what bit rate they can
handle and then make an individual encoding for each of the bit-rates. The data can be
encoded a single time, with scaleable parameters. The bit-rate can be sensed automatically
before playout or even adjusted during playout.
Spatial scalability implies that a frame or object can be decoded with a varying level of quality,
that is quantified by the peak signal to noise ratio. Temporal scalability implies that each
individual object can be decoded at different frame rates, and Object scalability implies that
any object can be accessed in the sequence. Both spatial and temporal scalability are
implemented using multiple video object layers. The base layer and the enhancement layer.
For spatial scalability the enhancement layer improves the spatial resolution and thus the
visual quality. For temporal scalability the enhancement layer can provide increased frame
rates and improve the smoothness of motion.
Sprites can be used to encode an unchanging background and can be useful for reducing the
volume of bits transmitted. For example, a background sprite defining the image of the
background need be sent only once. Subsequently, new views are created by simply sending
the new positions of four pre-defined points. The use of sprites in this manner enhances the
efficiency of compression. In addition sprite based coding is very suitable for synthetic
objects, but can also be used for objects in natural scenes that undergo rigid motion. The
MPEG-4 video model considers two kinds of sprites. Static sprites that are generated off line
and are better suited for synthetic objects and dynamic sprites that are constructed online
during the encoding process. Dynamic sprites are more suitable for the representation of
natural objects[4] .
35
Image Mapping
MPEG-4 provides tools for describing both the classical “rectangular video” and arbitrary shapes.
Two methods of including arbitrary shapes are included. The first method deals with so called
Binary shapes. In a binary shape a predefined pixel either is, or is not part of an object. The
pixel may be defined in terms of its colour, brightness or other parameter. The results of this may
be somewhat primitive in nature presenting jagged or fuzzy outlines. However the advantage of
this technique lies in its simplicity. The second method is knows as an “alpha” shape, or “gray
scale” shape. The predefinition of the pixel also contains a transparency value, thus providing a
tool for smooth blending of the shape with the background or other shapes.
In addition to standards for the description of shapes MPEG-4 alows the mapping of images onto
computer generated shapes, thus generating synthetic objects. An image can be mapped onto a
two or three-dimensional mesh, then commands can be used to deform the mesh and produce the
effect of motion. For example, faces can be mapped onto a mesh to form avatars that can be
used as an online stand-in for a human or a synthesized being [x,xi] .
The video encoding procedure has several steps. First the objects are formed from the incoming
video bitstream. The resulting indiviual object bitstreams are then coded. The individual object
properties are encoded by a coding control unit which decides which objects are to be transmitted,
the number of layers it has and the level of scalability. The individual video object bitstreams are
then sent through a multiplexer to merge them into a single bitstream as output.
There are particular characteristics that are associated with the MPEG-4 video object structure.
The texture is represented in YUV colour coordinates, and up to 12 bits may be used to represent
a pixel component value. Shape and texture are to be available for each “snapshot” of the video
object. A “snapshot” is called a Video Object Plane. A video Object Plane is implemented in
blocks of 16x16 pixels, thus providing backwards compatibility with previous standards. The
shape information for a video object polane is defined explicitly since it is expected that the object
is non rectagular. This is called the -plane, it is specified using two components, one is a binary
36
array that defines the bounding box of the video object plane and specifies whether or not an input
pixel belongs or does not belong to the object. The second is a transparency value that is defined
as 0 being transparent and 255 being opaque. This is called the gray scale shape.
Other terminology used in the structure is “Video Session” which is a group of video objects that
form the scene. A “Video object layer” contains information to support the temporal and spatial
scalability of the video object. The “Video Object Plane” is a time-sampled version of any one
video object (as mentioned previously, a “snapshot”).
Motion detection is achieved using a similar technique to earlier standards with the exception that
each object is again encoded individually. There are three types of planes, the I-Video Object
Plane, the P-Video Object Plane and the B-Video object plane. As with frames the I-VOP is
encoded independently the P-VOP is predicted using other previously decoded P Planes, and the
B-VOP is the bidirectional interpolated plane based on I and P planes. Motion estimation is
performed using 16x16 or 8x8 macroblocks in the same way as previous MPEG standards[xii].
Structured Audio and Sound Capability
Csound is a popular synthesis language and was used by Massachusetts Institute of Technology
to develop NetSound. Net sound provides the fundamentals of MPEG-4 provisions for “structured
audio”. Structured audio is a format for describing methods of synthesis [xiii] .
Descriptors for such as oscillators and digital filters are specified as signal-processing elements
and small networks are chosen to create the specific sounds. Ultimately each network can be
used to define what is termed as an “instrument” regardless of whether it synthesizes a traditional
instrument such as a violin, or a fire alarm. SAOL (Structured Audio Orchestra Language) or
SASL (Structured Audio Scoring Language) are used to combine sound objects and “score” the
production.
Popular audio options such as MIDI and other mass-market synthesizers are included, but their
repertoire of instruments is severely limited by comparison.
37
MPEG-4 uses a similar concept with sound as it does with video, in that individual sound objects
are placed in a sound space. Thus different mixes can be defined for different listening conditions
and hardware. It also means that if a user moves a video object in a scene, the audio soundtrack
can be moved along with it since the speech and the background soundtrack can be included as
separate audio objects.
“Environmental Spatialization” in MPEG-4 Version 2, is so called because it can change how a
sound object is heard depending on the room definition sent to the decoder. This works locally at
the terminal, thus saving on bit transmission.
Summary
Although work was begun with the goal of standardizing low end bit rate encoding, the MPEG-4
standard has made the step into object based multimedia and has standardized the concept of
both audio and interactive objects combining together to form a multimedia presentation. In
comparison testing between MPEG-1, MPEG-2, H.263 and MPEG-4, at various different bitrates,
the MPEG-4 standard gave the highest image quality test results for video encoding[4] .
38
Reference:
1 Overview of the MPEG-7 Standard (version 4.0) ISO/IEC JTC1/SC29/WG11, La Baule, Paris, October 2000 2 ISO/IEC TR 18034-1:2001(E) Part 1: Vision, Technologies and Strategy, MPEG, Document: ISO/IEC JTC1/SC29/WG11 N3939 3 R. Koenen.: MPEG-4: Multimedia for our time. IEEE Spectrum, 36(2):26--33, February 1999. 4 Liang J: New Trends in Multimedia Standards MPEG-4 and JPEG2000. Informing Science Special issue on Multimedia Informing Technologies-Part 1 Volume 2 No 4, 1999. 5 Ebrahimi T,. MPEG video verification model : A video encoding/decoding algorithm based on content representation. : Image Communication, Special Issue on MPEG4, October 1997. 6 Kalva H. MPEG-4 Systems and Applications Hari Kalva Columbia University 1312 S.W. Mudd Building New York NY 10027... 7 ISO MPEG-4 Standard Version 1 (ISO/IEC-14496) 8 Hartung F., Eisert P. and Girod B.,: Digital Watermarking of MPEG-4 Facial Animation parameters: Computers and Graphics Vol. 22, No 3. 9 A. Puri and A. Eleftheriadis, "MPEG-4: A Multimedia Coding Standard Supporting Mobile Applications" ACM Mobile Networks and Applications Journal, Special Issue on Mobile Multimedia Communications, Vol. 3, No. 1, June 1998, pp. 5-32 (invited paper). 10 Ostermann J., Beutnagel M., Fischer A. and Wang Y.: Integration of Talking Heads and Text-to-Speech Synthesizers for Visual TTS. AT&T Labs Research, Institute Eurecom/EPFL, Polytechnic University. 11 J. Ostermann, E. Haratsch, "An animation definition interface: Rapid design of MPEG-4 compliant animated faces and bodies", International Workshop on synthetic - natural hybrid coding and three dimensional imaging, pp. 216-219, Rhodes, Greece, September 5-9, 1997. 12 Fleury P., Bhattacharjee S., Piron L., Ebrahimi T., and Murat K.: MPEG-4 Video Verification Model: A Solution for Interactive Multimedia Applications. SPIE Journal of Electronic Imaging Vol. 7 N. 3 pp. 502-515, July 1998. 13 Scheirer D.E. and Vercoe B.L., SAOL: The MPEG-4 Structured Audio Orchestra Language. Computer Music Journal, 23:2 pp. 31-51, Summer 1999.