[IEEE 2009 International Conference on Future Computer and Communication (ICFCC) - Kuala Lumpar,...

5
A NEW ALGORITHM FOR VOICE SIGNAL COMPRESSION (VSC) & ANALYSIS SUITABLE FOR LIMITED STORAGE DEVICES USING MATLAB PROF. VIJAY K CHAUDHARI (Head of Information Technology) Technocrats Institute of Technology, Bhopal (M.P.)-462021, India E-mail: [email protected] Mobile: +91-9893181833 DR. MANISH SRIVASTAVA (Head Computer Science & Engineering) Institute of Technology, Guru Ghasidas University, Bilaspur (C.G.), India Email: [email protected] Mobile: +91-9827116390 DR. R. K. SINGH Director, MATS University, Raipur (C.G.), India Email: [email protected] Mobile: +91-9893623490 SHIV KUMAR M. Tech. (IT), Technocrats Institute of Technology, Bhopal (M.P.)-462021, India E-mail: [email protected] Mobile: +91-9827318266 ABSTRACT In this paper, Voice Signal Compression (VSC) is a technique that is used to convert the voice signal into encoded form and when compression is required, it can be decoded at the closest approximation value of the original signal. This work present a new algorithm to compress voice signals by using an “Adaptive Wavelet Packet Decomposition and Psychoacoustic Model”. The main goals of this paper are: i) Transparent Compression (proposed 48% to 50%) of high quality voice signal at about 45 kbps with same extension (i.e. .wav to .wav) ii) To evaluate compressed voice signal with original voice signal with the help of distortion analysis and frequency spectrum. iii) To reduce the maximum noise from the compressed file and calculate the SNR (Signal to Noise Ratio). To do this, a filter bank is used according to psychoacoustic model criteria and computational complexity of the decoder. The bit allocation method is used that also takes the input from Psychoacoustic model. Filter bank structure generates quality of performance in the form of subband perceptual rate which is computed in the form of perceptual entropy (PE). Output can get best value reconstruction possible considering the size of the output existing at the encoder. The result is a variable-rate compression scheme for high-quality voice signal. This work is well suited to high-quality voice signal transfer for Internet and storage applications. KEYWORDS Matlab6.5, Wavelet Toolbox, Psychoacoustic Model, Algorithm 1. INTRODUCTION Humans can hear frequencies in the range from 20 Hz to 20 KHz. However, this does not mean that all frequencies are heard in the same way. One could make the assumption that a human would hear frequencies that make up speech better than others; this is a good guess. Furthermore, one could also hypothesize that hearing a tone becomes more difficult as its frequency nears either of the extremes. Human auditory systems have a perceptual property that is called auditory masking. When a strong signal makes a temporal or spectral neighborhood of weaker signal minute, this means the masking effect can be observed in the time and frequency domain. If two sounds occur at the same time and one sound is masked by another sound, this is called simultaneous masking or frequency masking. A weak sound emitted after the end of louder sound is masked by the louder sound, similarly weak sound just before a louder sound can be masked by the louder sound. These two effects are called Forward & Backward temporal masking respectively. Temporal masking attenuates exponentially from the onset (10 ms) and offset (50 ms). 2009 International Conference on Future Computer and Communication 978-0-7695-3591-3/09 $25.00 © 2009 IEEE DOI 10.1109/ICFCC.2009.82 227

Transcript of [IEEE 2009 International Conference on Future Computer and Communication (ICFCC) - Kuala Lumpar,...

Page 1: [IEEE 2009 International Conference on Future Computer and Communication (ICFCC) - Kuala Lumpar, Malaysia (2009.04.3-2009.04.5)] 2009 International Conference on Future Computer and

A NEW ALGORITHM FOR VOICE SIGNAL COMPRESSION (VSC) & ANALYSIS SUITABLE FOR LIMITED STORAGE DEVICES USING

MATLAB

PROF. VIJAY K CHAUDHARI (Head of Information Technology)

Technocrats Institute of Technology, Bhopal (M.P.)-462021, India E-mail: [email protected]

Mobile: +91-9893181833

DR. MANISH SRIVASTAVA (Head Computer Science & Engineering)

Institute of Technology, Guru Ghasidas University, Bilaspur (C.G.), India Email: [email protected]

Mobile: +91-9827116390

DR. R. K. SINGH Director, MATS University, Raipur (C.G.), India

Email: [email protected] Mobile: +91-9893623490

SHIV KUMAR

M. Tech. (IT), Technocrats Institute of Technology, Bhopal (M.P.)-462021, India E-mail: [email protected]

Mobile: +91-9827318266 ABSTRACT In this paper, Voice Signal Compression (VSC) is a technique that is used to convert the voice signal into encoded form and when compression is required, it can be decoded at the closest approximation value of the original signal. This work present a new algorithm to compress voice signals by using an “Adaptive Wavelet Packet Decomposition and Psychoacoustic Model”. The main goals of this paper are:

i) Transparent Compression (proposed 48% to 50%) of high quality voice signal at about 45 kbps with same extension (i.e. .wav to .wav)

ii) To evaluate compressed voice signal with original voice signal with the help of distortion analysis and frequency spectrum.

iii) To reduce the maximum noise from the compressed file and calculate the SNR (Signal to Noise Ratio).

To do this, a filter bank is used according to psychoacoustic model criteria and computational complexity of the decoder. The bit allocation method is used that also takes the input from Psychoacoustic model. Filter bank structure generates quality of performance in the form of subband perceptual rate which is computed in the form of perceptual entropy (PE). Output can get best value reconstruction possible considering the size of the output existing at the encoder. The result is a variable-rate compression scheme for high-quality voice signal.

This work is well suited to high-quality voice signal transfer for Internet and storage applications. KEYWORDS Matlab6.5, Wavelet Toolbox, Psychoacoustic Model, Algorithm 1. INTRODUCTION

Humans can hear frequencies in the range from 20 Hz to 20 KHz. However, this does not mean that all frequencies are heard in the same way. One could make the assumption that a human would hear frequencies that make up speech better than others; this is a good guess. Furthermore, one could also hypothesize that hearing a tone becomes more difficult as its frequency nears either of the extremes. Human auditory systems have a perceptual property that is called auditory masking. When a strong signal makes a temporal or spectral neighborhood of weaker signal minute, this means the masking effect can be observed in the time and frequency domain. If two sounds occur at the same time and one sound is masked by another sound, this is called simultaneous masking or frequency masking. A weak sound emitted after the end of louder sound is masked by the louder sound, similarly weak sound just before a louder sound can be masked by the louder sound. These two effects are called Forward & Backward temporal masking respectively. Temporal masking attenuates exponentially from the onset (10 ms) and offset (50 ms).

2009 International Conference on Future Computer and Communication

978-0-7695-3591-3/09 $25.00 © 2009 IEEE

DOI 10.1109/ICFCC.2009.82

227

Page 2: [IEEE 2009 International Conference on Future Computer and Communication (ICFCC) - Kuala Lumpar, Malaysia (2009.04.3-2009.04.5)] 2009 International Conference on Future Computer and

The masking technique is used to compute masking threshold that can be used to compress a digital signal. Masking threshold is used to reduce the SNR (Signal to Noise Ratio) and therefore number of bits. Masking threshold is done by using simultaneous masking, temporal masking and frequency response of the ear. These masking models are called Psychoacoustic Models (PAM). The original PAM is defined in figure 1.

PAM in figure-1 is referenced by: Tsung-Han Tsai, Yi-Wen Wang, Shih-Way Hung “An MDCT-Based psychoacoustic model co-processor design for MPEG-2/4 AAC audio encoder” Proceeding in the 7th International Conference on Digital Audio Effects (DAFx’04), Naples, Italy, and October 5-8, 2004. http://www.mp3-tech.org/programmer/docs/P_335.pdf PAM determines sound quality of a given encoder and influences a lot in computational complexity. PAM calculates a masking threshold, which is the maximum distortion energy masked by the signal energy for each coding portion. The psychoacoustic model is based on many studies of human perception. These studies have shown that the average human does not hear all frequencies the same. Effects due to different sounds in the environment and limitations of the human sensory system lead to facts that can be used to cut out unnecessary data in voice signal. The two main properties of the human auditory system that make up the psychoacoustic model are:

1. Absolute Threshold of hearing 2. Auditory Masking

Each provides a way of determining which portions of a signal are inaudible and indiscernible to the average human, and can thus be removed from a signal.

1.1 Absolute Threshold Of Hearing Because humans hear lower frequencies, like those making up speech, more than others, like high frequencies around 20 kHz, the ear probably has better capability in detecting differences in pitch at lower frequencies than at high ones. After many studies, scientists found that the frequency range from 20 Hz to 20,000 Hz can be broken up into critical bandwidths, which are non-uniform, non-linear, and dependent on the heard sound. Signals within one critical bandwidth are hard to separate for a human observer. A more uniform measure of frequency based on critical bandwidths is the Bark. From the earlier discussed observations, one would expect a Bark bandwidth to be smaller at low frequencies (in Hz) and larger at high ones. The Bark frequency scale can be approximated by the following equation:

Bark = 13*arctan(x) + 3.5*arctan(y) Where x = 0.00076*Hz and y = (f/7500)^2

To determine the effect of frequency on hearing ability, scientists played a sinusoidal tone at a very low power. The power was slowly raised until the subject could hear the tone. This level was the threshold at which the tone could be heard. The process was repeated for many frequencies in the human auditory range and with many subjects. As a result, the following plot was obtained.

228

Page 3: [IEEE 2009 International Conference on Future Computer and Communication (ICFCC) - Kuala Lumpar, Malaysia (2009.04.3-2009.04.5)] 2009 International Conference on Future Computer and

This experimental data can be modeled by the following equation, where f is frequency in Hertz: ATH(f) = 3.64 * (f/1000)^-0.8 - 6.5e^(-0.6*((f/1000) - 3.3)^2) + 10^-3*(f/1000)^4 (dB SPL) Thus, we can make the jump for the purposes of compression. If a signal has any frequency components with power levels that fall below the absolute threshold of hearing, then these components can be discarded, as the average listener will be unable to hear those frequencies of the signal anyway. 1.2 Auditory Masking Humans do not have the ability to hear minute differences in frequency. For example, it is very difficult to discern a 1,000 Hz signal from one that is 1,001 Hz. This becomes even more difficult if the two signals are playing at the same time. Furthermore, the 1,000 Hz signal would also affect a human's ability to hear a signal that is 1,010 Hz, or 1,100 Hz, or 990 Hz. This concept is known as masking. If the 1,000 Hz signal is strong, it will mask signals at nearby frequencies, making them inaudible to the listener. For a masked signal to be heard, its power will need to be increased to a level greater than that of a threshold that is determined by the frequency of the masker tone and its strength. It turns out that noise can be a masker as well. If noise is strong enough, it can mask a tone that would be clear otherwise. For example, a jet engine, which is very noisy, can drown out music easily. In a compression algorithm, therefore, one must determine:

1. Tone Maskers 2. Noise Maskers 3. Masking Effect

Due to these maskers if any frequency components around these maskers fall below the masking threshold, they can be discarded. 1.2.1. Tone Maskers Determining whether a frequency component is a tone (Masker) requires knowing whether it has been held constant for a period of time, as well as whether it is a sharp peak in the frequency spectrum, which indicates that it is above the ambient noise of the

signal. A frequency f (with FFT index k) is a tone if its power P[k] is:

• If 0.17 Hz < f < 5.5 kHz, the neighborhood is [k-2…k+2]

• If 5.5 kHz =< f < 11 kHz, the neighborhood is [k-3…k+3]

• If 11 kHz =< f < 20 kHz, the neighborhood is [k-6…k+6]

1.2.2. Noise Maskers If a signal is not a tone, it must be noise. Thus, one can take all frequency components that are not part of a tone's neighborhood and treat them like noise. Combining such components into maskers, though, takes a little more thought. 1.2.3. Masking Effects The maskers which have been determined affect not only the frequencies within a critical band, but also in surrounding bands. Studies show that the spreading of this masking has an approximate slope of +25 dB/Bark before and -10 dB/Bark after the masker. The spreading can be described as a function that depends on the maskee location i, the masker location j, the power spectrum (Ptm) at j, and the difference between the masker and maskee locations in Barks. Where deltaz =z(i)-z(j)

There is a slight difference in the resulting mask that depends on whether the mask is a tone or noise. As a result, the masks can be modeled by the following equations, with the same variables as described above:

2. PROBLEM IDENTIFICATION A lot of work has been done in the field of data compression and signal analysis system, which have generated a lot of results in the past few decades. When data is compressed, the file extension is also changed according to the used algorithm and the family of the original data will be changed. For example, .wav file is compressed to mp3 by using MPEG algorithm. But, what would happen if we want to compress the file without changing the extension? This work is identified in this paper. In this paper, work is proposed to compress a .wav file 48% to 50% from the source file with same extension. The distortion rate and frequency spectrum provide to search the deference in between compressed file and original file.

229

Page 4: [IEEE 2009 International Conference on Future Computer and Communication (ICFCC) - Kuala Lumpar, Malaysia (2009.04.3-2009.04.5)] 2009 International Conference on Future Computer and

In order to do this, wavelet toolbox is used instead of the traditional Modified Discrete Cosine Transform (MDCT) and following steps are performed:

1. Design a subband structure for wavelet representation of voice signals. This design also determines the computational complexity of the algorithms for each frame.

2. Design a scheme for Psychoacoustic Model 3. Design a scheme for efficient bit allocation,

which depends on the temporal resolution of the decomposition.

The basic block diagram is proposed in figure-2.

2.1. Wavelet Representation Given a wavelet packet structure, a complete tree structure filter bank is considered. Filter banks divide up a signal to help code the subbands differently. A bank is made up of an array of band pass filters that span the voice spectrum. They use information gotten from the psychoacoustic model to quantize the signals leading to a compressed bit stream representation of the signal. Implementation of the filter banks was performed in MATLAB as follows:

1. Encodes a given input signal. Windows a signal into 2048 sample frames. Finds the global masking threshold for the window, and then sends it to the filter banks.

2. Uses the psychoacoustic model to build a masking threshold for the windowed signal.

3. Break up the signal, down-samples, performs the coding, up-samples, and re-synthesizes the signal.

4. Uses the psychoacoustic model to determine the number of bits to use to quantize a signal over a given frequency range. Then quantizes the signal.

2.2. Psychoacoustic Model In order to compress data by a large factor, an algorithm must be lossy, i.e., it must throw out some of the information. In the case of Voice signal, one would assume that throwing out portions would result in a noticeable degradation in sound quality. When done blindly, this is true. However, using what is known as the psychoacoustic model helps to minimize the audible effects of lossy compression. Implementation of the psychoacoustic model in MATLAB was performed with the following steps:

1. Normalize the power spectrum of the signal to a 0 dB maximum. This is done with the following equation: X[n] = s[n]/(N*2^(b-1))

2. Break the signal into frames of 512 or 2048 (12ms).

3. Windows the signal using a Hamming window with 1/16 overlap, so each signal has 10.9 ms of new data.

4. Calculate the power spectral density (psd) using a length-512 or 2048 FFT. A power normalization term of 90.302 dB is necessary for proper computation.

5. Find the tone maskers. Once found, take the power found one index before [k-1] and after [k+1] and combine with the power at [k] to create a tone masker approximation, since the tone may actually be between the frequency samples.

6. Find noise maskers and their locations within each critical band.

7. If a masker is below the absolute threshold of hearing, it may be discarded. If two maskers are within a critical bandwidth of each other, the weaker of the two may be thrown out as well.

8. Calculate the masking threshold of each mask.

9. Sum the masking thresholds to get the overall masking threshold for all frequencies in this signal frame.

With this threshold, one can now move to quantization and bit allocation. 2.3. Bit Allocation The bit allocation proceeds with a fixed number of iterations of a zero-tree algorithm before a perceptual evaluation is done. This algorithm organizes the coefficients in a tree structure that is temporally aligned from coarse to fine. This zero-tree algorithm tries to exploit the remnants (bits and pieces) of temporal coefficients that exist in the wavelet packet coefficients. It has been used in other wavelet applications, where its aims has been mainly to exploit the structure of wavelet coefficients to transfer images progressively from coarse to fine resolutions.

230

Page 5: [IEEE 2009 International Conference on Future Computer and Communication (ICFCC) - Kuala Lumpar, Malaysia (2009.04.3-2009.04.5)] 2009 International Conference on Future Computer and

2.4. Quantization The harmonic analysis for the tonal signals generates the frequencies, amplitudes, and phases of the harmonics present in the sinusoidal portion of the signal. The frequencies are quantized using 7 or 8 bits through just noticeable differences in frequency (JNDF). The amplitudes are quantized by exploiting the masking properties of the human auditory system. Harmonics that fall below the masking threshold are removed. Because the ear is not completely insensitive to phase, 6 bits are used to encode the phase from -pi to pi. Basically there are two types of quantization:

1. Full Range Quantization 2. Narrow Range Quantization

2.4.1. Full Range Quantization The full range quantization method quantizes over the full range of allowable input values, [-1, 1], regardless of the range of the current input. This method achieves a better compression ratio because it does not need to store the extra information that is needed by the first method. However, it has less accuracy. This method has the same inputs as the other quantization method. 2.4.2. Narrow Range Quantization The narrow range quantization method allows for greater accuracy at the expense of having a poorer compression ratio. This quantization scheme uses the current set of input to determine the range of values to quantize over as well as the delta. For example, if the current input only has a range of [-.4, .4], then we quantize over this range instead of the full range of [-1, 1]. The quantization values will be much closer to the true values of the input. However, this method requires storing two extra numbers for each frame of data: the delta used and the lowest quantized value. This narrow range quantization requires two inputs: the input values, and the number of bits used for the quantization. The number of distinct numbers that can be stored is equal to 2bits. For example, 4 bits allows the storing of 16 different numbers between the maximum and the minimum value of the current input. The quantization function returns 2 vectors. One of them contains the quantization levels of the current input. Each number will be an integer between 1 and the maximum number of level s. The second vector contains the information needed for reconstruction. It contains the delta value (the numerical difference between two adjacent quantization values) and the base-line value (the lowest value in the input). With the lowest input value and the delta value, we are able to reconstruct the original input. We use dynamic quantization (narrow range quantization) that determines the range of quantization and the delta based on the current set of input data. The inputs to be quantized can range from

[-1, 1], and it is quantized with 16 bits (input has 65,536 distinct values between [-1, 1]). 3. CONCLUSION This paper is a basic extension of paper, Prof. Vijay K. Chaudhari, Dr. Manish Srivastava, Dr. R.K. Singh, Shiv Kumar, “A New Approach For Voice Spectrum Analysis (VSA) Suitable for Pervasive Computing by Using Matlab”, Proposed and send for International Conference ICPCG-08, India in year December 2008. In this paper, work is proposed for high quality voice signal compression nearly 45% to 60 % of source file at about 45 kbps with same extension (i.e. .wav to .wav). Coding is transparent at low bit rate with high quality voice signal. This work is well suited to high-quality voice signal transfer for Internet and limited storage applications.

4. FUTURE WORK From the collected data, we found that we have a long way to go in order to compete with mp3. So there is need to search other better algorithms to compete mp3. REFERENCES [1]. Jalal Karam,"Various Speech Processing

Techniques For Speech Compression And Recognition", proceedings of world academy of science, engineering and technology volume 26 December 2007 ISSN 1307-6884,© 2007 waset.org http://www.waset.org/pwaset/v26/v26-133.pdf

[2]. Tsung-Han Tsai, Yi-Wen Wang, Shih-Way Hung “An MDCT-Based psychoacoustic model co-processor design for MPEG-2/4 AAC audio encoder” Proceeding in the 7th International Conference on Digital Audio Effects (DAFx’04), Naples, Italy, October 5-8, 2004 http://www.mp3-tech.org/programmer/docs/P_335.pdf

[3]. Sarantos Psycharis “The Didactic Use Of Digital Image Lossy Compression Methods For The Vocational Training Sector”, University of Agean, Proceeding of Current Developments in Technology-Assisted Education 2006(FORMATEX 2006) http://www.formatex.org/micte2006/pdf/2065-2069.pdf

231