1 ELEN 6820 Speech and Audio Processing Prof. D. Ellis Columbia University Midterm Presentation High...

11

ELEN 6820ELEN 6820Speech and Audio ProcessingSpeech and Audio Processing

Prof. D. EllisProf. D. EllisColumbia UniversityColumbia University

Midterm PresentationMidterm Presentation

High Quality Music Metacompression Using Repeated-Segment Residuals

Asheesh KashyapSpring 2005

22 ELEN 6820

Music contains much self-similarity and repetition at various levels of detail (many repeated segments).

Remove redundancy by storing a single copy of a repeated segment, and then referencing it every time it is used.

Can be used in conjunction with other audio compression techniques, such as MP3 (hence, metacompression).

Concept has already been explored by Joseph Hazboun, “Detection of Audio Similarity for Redundancy Removal”, ELEN 6820, Spring 2004.

Midterm Presentation

High Quality Music Metacompression Using Repeated-Segment ResidualsHigh Quality Music Metacompression Using Repeated-Segment Residuals

Music CompressionMusic Compression

33 ELEN 6820 Midterm Presentation


Previous WorkPrevious Work

Hazboun’s MethodPhase I: Divide song into 1 sec segments, and correlate each segment with

every other segment. Keep values with corr > 0.78 .

Phase II: Group successive 1 sec segments together.

Phase III: Find similarity of 256 ms pairs with corr > 0.82 (fine tuning).

Phase IV: Perform alignment of segments using a 2 ms STFT correlation.

Phase V: Compare segments based on sum of spectral energy over each frequency, and discard segments with similarities < 0.995*

Phase VI: Remove overlapping segments, and define new, longer similar segment (new start and end points).

Phase VII: Encode audio stream by removing redundant segments.

* In Hazboun’s example, identical tune with different lyrics has correlation of 0.968.

44 ELEN 6820

Current methods, such as Hazboun’s method, apply simple replacement scheme for repeated segments.

Imposes high standards for audio similarity (corr > 0.995) Audibly dissimilar segments removed from consideration (conservative).

Extension A: can relax similarity constraint by storing residuals (error difference between reference and repeated segments).

Extension B: separate music and voice components (music has more self-similarity).

Validate performance using two samples from contemporary, techno and classical music.



Extensions to Previous WorkExtensions to Previous Work

55 ELEN 6820

Residuals: error difference between reference and repeated segments.



Extension A: Residuals Extension A: Residuals

-reference repeated

=

residual

Transmitting residuals allows more precise reconstruction of original waveform (higher quality), and relaxes audio similarity constraint.

Residuals should compress well, as they contain much less information than original signal (lower amplitude and / or fewer components). Basis of Basis of video compressionvideo compression.

66 ELEN 6820

Change Phase V to relax the similarity requirement from 0.995 down to 0.945 in 0.010 increments.

This should allow us to compress segments with similar music, but different vocals.

Modify Phase VII to generate residuals for repeated segments instead of removing the segment.



Extension A: Modification to Hazboun’s Code Extension A: Modification to Hazboun’s Code

Convert wave to MP3 and compare compression with baseline (i.e., converting unmodifed song from wave to MP3).

Convert MP3 back to wave files, decode and compare SNR with original decoded song.

77 ELEN 6820

Separating voice from music may result in improved compression.

Changed lyrics produce different formants, can hamper our correlation/alignment.

Challenging part Separation of music and voice is an extremely difficult problem. Compressing voice and music components separately requires two

streams or files (compression needs to be much better).

Perfect separation is not required for our purposes (our goal is compression).

Correlation and alignment performed on segments with voice removed, but encoding uses original segments (music component will be maximally compressed).



Extension B: Separating Voice from MusicExtension B: Separating Voice from Music

88

Time

Fre

qu

en

cy

Clip from U2: "Real Thing"

0 0.5 1 1.5 2 2.5 3 3.50

1000

2000

3000

4000

5000

Formants still visible in presence of music.

Use cepstral analysis to find max. peak in range 70-255 Hz (voice excitation pitch) for each timeslice.

Build a filter bank that attenuates frequencies at pitch harmonics.

Take derivative across spectrogram to minimize horizontal bands (musical notes).



Extension B: Simple AlgorithmExtension B: Simple Algorithm

1 ELEN 6820 Speech and Audio Processing Prof. D. Ellis Columbia University Midterm Presentation High...

Documents

Transcript of 1 ELEN 6820 Speech and Audio Processing Prof. D. Ellis Columbia University Midterm Presentation High...