ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 9: Time & …dpwe/e4896/lectures/E4896-L09.pdfE4896 Music...
Transcript of ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 9: Time & …dpwe/e4896/lectures/E4896-L09.pdfE4896 Music...
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
Lecture 9:Time & Pitch Scaling
Dan EllisDept. Electrical Engineering, Columbia University
[email protected] http://www.ee.columbia.edu/~dpwe/e4896/1
1. Time Scale Modification (TSM)2. Time-Domain Approaches3. The Phase Vocoder4. Sinusoidal Approach
ELEN E4896 MUSIC SIGNAL PROCESSING
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
1. Time Scale Modification (TSM)
• The basic problem in time scaling:Modify a sound to make it “quicker/slower”?to examine “detail” in speech/performanceto synchronize tracks
2
xs(t) = x(t/r)
• Why not just change the sampling rate?“slowing the tape”e.g. for r times longer (slower),
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
time/s
Original
2x slower
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
Time & Pitch• Changing speed alters time AND pitch
how do we change just one?
• Preserve pitch, change timepreserve pitch = keep local time structurebut changing global time course?
• Preserve timing, change pitchanalogous problemchange time scale, then change sampling rate
3
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
2. Time-Domain TSM• Pre-digital time/pitch scaling:
Revolving tape heads
4
Fairb
anks
et a
l., 19
54
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
Simple OLA TSM• The digital equivalent of revolving tape heads
5
2.35 2.4 2.45 2.5 2.55 2.6-0.1
0
0.1
4.7 4.75 4.8 4.85 4.9 4.95-0.1
0
0.1
1
1
1 1 2 2 3 3 4 4 5 5 6
2
2
3
3
4
4
5 6
65
time / s
time / s
ym[mL + n] = ym�1[mL + n] + w[n] · x��m
r
�L + n
�
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
SOLA / SOLAFS• Simple OLA leads to
random phase interactions during overlapso.. search for optimal alignment Km via small offset
find best Km with cross-correlation:
6
12
Km maximizes alignment of 1 and 2
ym[mL + n] = ym�1[mL + n] + w[n] · x��m
r
�L + n + Km
�
Km = arg max0�K�Ku
�Nov
n=0 ym�1[mL + n] · x��
mr
�L + n + K
���
(ym�1[mL + n])2�
(x��
mr
�L + n + K
�)2
Roucos & Wilgus 1985Hejna & Musicus 1991
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
The Importance of Time Window• Preserved “local structure”
= structure within each windowshorter windows → less structure preserved
7
2.4 2.45 2.5 2.55 2.6 2.65 2.7 2.75 2.8-0.1
0
0.1
4.8 4.85 4.9 4.95 5 5.05 5.1 5.15-0.1
0
0.1
2.4 2.45 2.5 2.55 2.6 2.65 2.7 2.75 2.8-0.1
0
0.1
4.8 4.85 4.9 4.95 5 5.05 5.1 5.15-0.1
0
0.1
tw = 25ms tw = 5ms
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
Source-Filter TSM• Decompose into excitation + filter
before TSMfilter (e.g. LPC) easy to scale - change frame rateexcitation scaled e.g. by SOLAFS.. then reapplying filter can reduce artifacts
8
f
Filteranalysis
Resamplein time
Time scalemodification
Applyfilter
Filter coefficients
Residual t
Kawahara 1997
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
3. The Phase Vocoder• TSM based on the spectrogram
just stretching it out it time gives “what we want”:but: STFT magnitude isn’t quite enough
9
Time
Frequency
0 0.2 0.4 0.6 0.8 1 1.2 1.40
1000
2000
3000
4000
Time
Frequency
0 0.2 0.4 0.6 0.8 1 1.2 1.40
1000
2000
3000
4000
Flanagan & Golden 1966Dolson 1986
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
Magnitude-only reconstruction• How to recover ?
need ...iterate
converges to a solution, but SLOWLY
10
STFT�1{|Y (ej�, t)|}�{Y (ej�, t)}
yn(t) = STFT�1{|Y (ej�, t)|�n(�, t)}�n+1(�, t) = �{STFT{yn(t)}}
0 1 2 3 4 5 6 7 83
4
5
6
7
8
Iteration number
SNR
of r
econ
stru
ctio
n (d
B)
Iterative phase estimation
Starting from quantized phaseMagnitude quantization only
Starting from random phase
7.6 dB
4.7 dB
3.4 dB
Griffin & Lim 1984
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
Phase Correction• Essential idea of the phase vocoder
phase advance will stretch by time scaling factor... preserve rate of phase change between slices
= instantaneous frequency
(from differencing adjacent frames, or directly...)
• Makes sense for a single sinusoid...
11
�̇(�, t) =d
dt�{Y (ej�, t)}
time
θ = ΔθΔT
.
Δθ' = θ·2ΔT .
ΔT
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
Phase Vocoder Results• Reconstructed phase
gives smooth evolution
“metallic” extended noise...
12
freq
/ kHz
0
2
4
6
8
freq
/ kHz
0
2
4
6
8
time / s0.5 1 1.5 2 2.5 3
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
Time Window (again)• STFT time-frequency tradeoff
determines what “scaling” means
is pitch structure within DFT window?
13
time / s0.5 1 1.5 2 2.5 3
freq
/ kHz
0
2
4
6
8
freq
/ kHz
0
2
4
6
8
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
4. Sinusoidal TSM• Phase Vocoder essentially
treats each STFT bin as a sinusoidwith magnitude and frequency time scaling simply projects that sinusoid forward
• But: bins around peak don’t behave that wayneed phase alignment to achieve interference
• Model as sinusoids explicitlypeak-picking, magnitude & frequency estimationseparate noise signal?
14
�̇(�, t)|Y (ej�, t)|
Time
Frequency
0.5 1 1.5 2 2.5 30
2000
4000
6000
8000
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
Summary• Time / Pitch scale modification
Intuitive but difficult to define
• Time domain methodsPreserve local structure
• Spectral methodsElongate features in spectrogram
15
E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16
References• M. Covell, M.Withgott, M., Slaney, “Mach1: Nonuniform Time-Scale Modification
of Speech,” Proc. ICASSP, vol 1, pp. 349-352,1998.
• Mark Dolson, “The phase vocoder: A tutorial,” Computer Music Journal, vol. 10, no. 4, pp. 14-27, 1986.
• G. Fairbanks, W. Everitt, and R. Jaeger, “Method for time of frequency compression-expansion of speech,” Transactions of the IRE Professional Group on Audio, vol.2, no.1, pp. 7-12, Jan 1954.
• J. L. Flanagan, R. M. Golden, “Phase Vocoder,” Bell System Technical Journal, pp. 1493-1509, November 1966.
• D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Tr. Acous,, Speech, and Sig. Proc., vol. 32, pp. 236–242, 1984.
• Don Hejna and Bruce R. Musicus, “The SOLAFS Time-Scale Modification Algorithm,” BBN Technical Report, July 1991.
• Hideki Kawahara, “Speech Representation and Transformation using Adaptive Interpolation of Weighted Spectrum: VOCODER Revisited,'' Proc. ICASSP, vol.2, pp.1303-1306, 1997.
• Jean Laroche, “Time and Pitch Scale Modification of Audio Signals.” chapter 7 in Applications of Digitial Signal Processing to Audio and Acoustics, ed. M. Kahrs & K. Brandenburg, Kluwer, 1998.
16