Lecture 6: Nonspeech and Music

30
EE E6820: Speech & Audio Processing & Recognition Lecture 6: Nonspeech and Music Dan Ellis <[email protected]> Michael Mandel <[email protected]> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/dpwe/e6820 February 26, 2009 1 Music & nonspeech 2 Environmental Sounds 3 Music Synthesis Techniques 4 Sinewave Synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 1 / 30

Transcript of Lecture 6: Nonspeech and Music

EE E6820: Speech & Audio Processing & Recognition

Lecture 6:Nonspeech and Music

Dan Ellis <[email protected]>Michael Mandel <[email protected]>

Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820

February 26, 2009

1 Music & nonspeech

2 Environmental Sounds

3 Music Synthesis Techniques

4 Sinewave Synthesis

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 1 / 30

Outline

1 Music & nonspeech

2 Environmental Sounds

3 Music Synthesis Techniques

4 Sinewave Synthesis

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 2 / 30

Music & nonspeech

What is ‘nonspeech’?I according to research effort: a little musicI in the world: most everything

Originnatural man-made

Info

rmat

ion

cont

ent

low

high

wind &water

animalsounds

speech music

machines & enginescontact/

collision

attributes?

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 3 / 30

Sound attributes

Attributes suggest model parameters

What do we notice about ‘general’ sound?I psychophysics: pitch, loudness, ‘timbre’I bright/dull; sharp/soft; grating/soothingI sound is not ‘abstract’:

tendency is to describe by source-events

Ecological perspectiveI what matters about sound is ‘what happened’→ our percepts express this more-or-less directly

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 4 / 30

Motivations for modeling

Describe/classifyI cast sound into model because want to use the resulting

parameters

Store/transmitI model implicitly exploits limited structure of signal

Resynthesize/modifyI model separates out interesting parameters

Sound Model parameter�space

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 5 / 30

Analysis and synthesis

Analysis is the converse of synthesis:

Sound

AnalysisSynthesis

Model / �representation

Can exist apart:I analysis for classificationI synthesis of artificial sounds

Often used together:I encoding/decoding of compressed formatsI resynthesis based on analysesI analysis-by-synthesis

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 6 / 30

Outline

1 Music & nonspeech

2 Environmental Sounds

3 Music Synthesis Techniques

4 Sinewave Synthesis

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 7 / 30

Environmental Sounds

Where sound comes from:mechanical interactions

I contact / collisionsI rubbing / scrapingI ringing / vibrating

Interest in environmental soundsI carry information about events around us

.. including indirect hintsI need to create them in virtual environments

.. including soundtracks

Approaches to synthesisI recording / samplingI synthesis algorithms

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 8 / 30

Collision soundsFactors influencing:

I colliding bodies: size, material, dampingI local properties at contact point (hardness)I energy of collision

Source-filter modelI ”source” = excitation of collision event

(energy, local properties at contact)I ”filter” = resonance and radiation of energy

(body properties)Variety of strike/scraping sounds

I resonant freqs ∼ size/shapeI damping ∼ materialI HF content in excitation/strike ∼ mallet, force

I (from Gaver, 1993)E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 9 / 30

Sound textures

What do we hear in:I a city streetI a symphony orchestra

How do we distinguish:I waterfallI rainfallI applauseI static

time / s

freq

/ Hz

Applause04

0 1 2 3 40

1000

2000

3000

4000

5000

time / s

freq

/ Hz

Rain01

0 1 2 3 40

1000

2000

3000

4000

5000

Levels of ecological description...

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 10 / 30

Sound texture modeling (Athineos)

Model broad spectral structure with LPCI could just resynthesize with noise

Model fine temporal structure in residual withlinear prediction in time domain

TD-LPy[n] = iaiy[n-i] DCT FD-LP

E[k] = ΣΣ ibiE[k-i]Whitenedresidual

Per-framespectral

parameters

Per-frametemporal envelope

parameters

ResidualspectrumSound

e[n]y[n] E[k]

I precise dual of LPC in frequencyI ‘poles’ model temporal events

-0.02

0

0.02

Temporal envelopes (40 poles, 256ms)

0.05 0.1 0.15 0.2 0.25time / sec

ampl

itude

0.010.020.03

Allows modification / synthesis?

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 11 / 30

Outline

1 Music & nonspeech

2 Environmental Sounds

3 Music Synthesis Techniques

4 Sinewave Synthesis

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 12 / 30

Music synthesis techniques

What is music?I could be anything → flexible synthesis needed!

Key elements of conventional musicI instruments→ note-events (time, pitch, accent level)→ melody, harmony, rhythm

I patterns of repetition & variation

Synthesis framework:

instruments: common framework for many notesscore: sequence of (time, pitch, level) note events

&&V?

########

cccc

Soprano

Alto

Tenor

Bass

Allegro „ ∑3

„ ∑3

„ ∑3

„ ∑3

.œ jœ Jœ jœ ŒHal - le - lu - jah!

.œ jœ jœ jœ ŒHal - le - lu - jah!.œ Jœ Jœ Jœ ŒHal - le - lu - jah!

.œ Jœ Jœ Jœ ŒHal - le - lu - jah!

.œ jœ Jœ jœ ‰ Rœ RœHal - le - lu - jah! Hal - le -

.œ jœ jœ jœ ‰ rœ rœHal - le - lu - jah! Hal - le -.œ Jœ Jœ Jœ ‰ Rœ RœHal - le - lu - jah! Hal - le -

.œ Jœ Jœ Jœ ‰ Rœ RœHal - le - lu - jah! Hal - le -

Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah! Hal - le - lu - jah! Hal -

Jœ jœ ‰ rœ rœ Jœ jœ ‰ jœlu - jah! Hal - le - lu - jah! Hal -

Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah! Hal - le - lu - jah! Hal -

Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah! Hal - le - lu - jah! Hal -

&&V?

########

S

A

T

B

7

Jœ œ Jœ œ Œle - lu - jah,

œ œ œ jœ œ Œle - lu - jah,

Jœ œ jœ œ Œle - lu - jah,œ œ œ Jœ œ Œle - lu - jah,

.œ jœ Jœ Jœ ŒHal - le - lu - jah,

.œ jœ jœ jœ ŒHal - le - lu - jah,.œ Jœ Jœ Jœ ŒHal - le - lu - jah,.œ Jœ Jœ Jœ ŒHal - le - lu - jah,

.œ jœ Jœ Jœ ‰ Rœ RœHal - le - lu - jah, Hal - le -

.œ jœ jœ jœ ‰ rœ rœHal - le - lu - jah, Hal - le -.œ Jœ Jœ Jœ ‰ Rœ RœHal - le - lu - jah, Hal - le -.œ Jœ Jœ Jœ ‰ Rœ RœHal - le - lu - jah, Hal - le -

Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah, Hal - le - lu - jah, Hal -jœ jœ ‰ rœ rœ jœ jœ ‰ jœlu - jah, Hal - le - lu - jah, Hal -

Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah, Hal - le - lu - jah, Hal -Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah, Hal - le - lu - jah, Hal -

&&V?

########

S

A

T

B

11 œ œ œ œ Œle - lu - jah,

.œ jœ# œ Œle - lu - jah,œ œ œ Jœ œ Œle - lu - jah,œ œ œ œ Œle - lu - jah,

˙ œ œfor the Lord

˙ œ œfor the Lord

˙ œ œfor the Lord˙ œ œfor the Lord

Jœ jœ .œ Jœ œGod Om - ni - po - tent

jœ jœ .œ jœ œGod Om - ni - po - tent

Jœ jœ .œ Jœ œGod Om - ni - po - tentJœ Jœ

.œ Jœ œGod Om - ni - po - tent

˙ œ ‰ Rœ Rœreign - eth, Hal - le -

˙ œ ‰rœ rœ

reign - eth, Hal - le -˙ œ ‰ rœ rœreign - eth, Hal - le -˙ œ ‰ Rœ Rœreign - eth, Hal - le -

HAN044 #1/8

M E S S I A H44. chorus

G. F. Handel

© 2000 by Score on Line S.A.

HALLELUJAH!

http://www.score-on-line .com

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 13 / 30

The nature of musical instrument notesCharacterized by instrument (register),note, loudness/emphasis, articulation...

Time

Frequency

Piano

01000200030004000

0 1 2 3 4 Time

Violin

01000200030004000

0 1 2 3 4

Time

Frequency

Clarinet

01000200030004000

0 1 2 3 4 Time

Trumpet

01000200030004000

0 1 2 3 4

Distinguish how?

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 14 / 30

Development of music synthesis

Goals of music synthesis:I generate realistic / pleasant new notesI control / explore timbre (quality)

Earliest computer systems in 1960s(voice synthesis, algorithmic)

Pure synthesis approaches:I 1970s: Analog synthsI 1980s: FM (Stanford/Yamaha)I 1990s: Physical modeling, hybrids

Analysis-synthesis methods:I sampling / wavetablesI sinusoid modelingI harmonics + noise (+ transients)

others?

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 15 / 30

Analog synthesis

The minimum to make an ‘interesting’ sound

Oscillator Filter

Envelope

PitchTrigger

Vibrato

t

t

f

+

Cutoff�freq

GainSound

++

Elements:I harmonics-rich oscillatorsI time-varying filtersI time-varying envelopeI modulation: low frequency + envelope-based

Result:I time-varying spectrum, independent pitch

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 16 / 30

FM synthesisFast frequency modulation → sidebands:cos(ωct + β sin(ωmt)) =

∑∞n=−∞ Jn(β) cos((ωc + nωm)t)

I a harmonic series if ωc = rωm

Jn(β) is a Bessel function:

0 1 2 3 4 5 6 7 8 9-0.5

0

0.5

1J0 J1 J2 J3 J4 Jn(β) ≈ 0 �

for β < n - 2

modulation index β

→ Complex harmonic spectra by varying β

time / s

freq

/ Hz

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

1000

2000

3000

4000

I ωc = 2000 Hz, ωm = 200 HzI what use?

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 17 / 30

Sampling synthesisResynthesis from real notes→ vary pitch, duration, level

Pitch: stretch (resample) waveform

0 0.002 0.004 0.006 0.008 time / s-0.2

-0.1

0

0.1

0.2

0 0.002 0.004 0.006 0.008 time / s

596 Hz 894 Hz

-0.2

-0.1

0

0.1

0.2

Duration: loop a ‘sustain’ section

-0.2

-0.1

0

0.1

0.2

-0.2

-0.1

0

0.1

0.2

0 0.1 0.2 0.3 0 0.1 0.2 0.3time / s time / s

0.204 0.2060.174 0.176

Level: cross-fade different examples

-0.2

-0.1

0

0.1

0.2

-0.2

-0.1

0

0.1

0.2

0 0.05 0.1 0.15 0 0.05 0.1 0.15time / s time / s

Soft Loud

veloc

mix

I need to ‘line up’ source samplesI good & bad?

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 18 / 30

Outline

1 Music & nonspeech

2 Environmental Sounds

3 Music Synthesis Techniques

4 Sinewave Synthesis

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 19 / 30

Sinewave synthesis

If patterns of harmonics are what matter,why not generate them all explicitly:s[n] =

∑k Ak [n] cos(k · ω0[n] · n)

I particularly powerful model for pitched signals

Analysis (as with speech):I find peaks in STFT |S [ω, n]| & trackI or track fundamental ω0 (harmonics / autocorrelation)

& sample STFT at k · ω0

→ set of Ak [n] to duplicate tone:

0 0.05 0.1 0.15 0.2 time / s time / s0

2000

4000

6000

8000

freq

/ Hz

freq / Hz

mag

00.1

0.2

05000

0

1

2

Synthesis via bank of oscillators

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 20 / 30

Steps to sinewave modeling - 1

The underlying STFT:

X [k , n0] =N−1∑n=0

x [n + n0] · w [n] · exp−j

(2πkn

N

)

I what value for N (FFT length & window size)?I what value for H (hop size: n0 = r · H, r = 0, 1, 2 . . . )?

STFT window length determines freq. resolution:Xw (e jω) = X (e jω) ∗W (e jω)

Choose N long enough to resolve harmonics

→ 2-3x longest (lowest) fundamental periodI e.g. 30-60 ms = 480-960 samples @ 16 kHzI choose H ≤ N/2

N too long → lost time resolutionI limits sinusoid amplitude rate of change

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 21 / 30

Steps to sinewave modeling - 2Choose candidate sinusoids at each timeby picking peaks in each STFT frame:

time / sfre

q / H

zle

vel /

dB

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.1802000400060008000

0 1000 2000 3000 4000 5000 6000 7000 freq / Hz-60

-40-20

020

Quadratic fit for peak:

400 600 800 freq / Hz

-20-10

01020

400 600 800 freq / Hz

-10

-5

0

leve

l / d

B

phas

e / r

ad

y

xy = ax(x-b)

b/2

ab2/4

+ linear interpolation of unwrapped phaseE6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 22 / 30

Steps to sinewave modeling - 3

Which peaks to pick?Want ‘true’ sinusoids, not noise fluctuations

I ‘prominence’ threshold above smoothed spectrum

0 1000 2000 3000 4000 5000 6000 7000 freq / Hz-60

-40

-20

0

20

leve

l / d

B

Sinusoids exhibit stability...I of amplitude in timeI of phase derivative in time→ compare with adjacent time frames to test?

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 23 / 30

Steps to sinewave modeling - 4

‘Grow’ tracks by appending newly-found peaksto existing tracks:

timefreq

death

birthexisting�tracks

new�peaks

I ambiguous assignments possible

Unclaimed new peakI ‘birth’ of new trackI backtrack to find earliest trace?

No continuation peak for existing trackI ‘death’ of trackI or: reduce peak threshold for hysteresis

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 24 / 30

Resynthesis of sinewave modelsAfter analysis, each track defines contours infrequency, amplitude fk [n], Ak [n] (+ phase?)

I use to drive a sinewave oscillators & sum up

0 0.05 0.1 0.15 0.2500

600

7000

1

2

3

0 0.05 0.1 0.15 0.2time / s

time / sfreq

/ Hz

leve

l

-3-2-10123

Ak[n]·cos(2πfk[n]·t)

fk[n]

Ak[n]

n

‘Regularize’ to exactly harmonic fk [n] = k · f0[n]

0 0.05 0.1 0.15 0.20

2000

4000

6000

0 0.05 0.1 0.15 0.2550

600

650

700

time / s time / s

freq

/ Hz

freq

/ Hz

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 25 / 30

Modification in sinewave resynthesis

Change duration by warping timebaseI may want to keep onset unwarped

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

1000

2000

3000

4000

5000

time / s

freq

/ Hz

Change pitch by scaling frequenciesI either stretching or resampling envelope

0 1000 2000 3000 4000010203040

freq / Hz

leve

l / d

B

0 1000 2000 3000 4000010203040

freq / Hz

leve

l / d

B

Change timbre by interpolating parameters

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 26 / 30

Sinusoids + residual

Only ‘prominent peaks’ became tracksI remainder of spectral energy was noisy?→ model residual energy with noise

How to obtain ‘non-harmonic’ spectrum?I zero-out spectrum near extracted peaks?I or: resynthesize (exactly) & subtract waveforms

es [n] = s[n]−∑

k

Ak [n] cos(2πn · fk [n])

.. must preserve phase!

0 1000 2000 3000 4000 5000 6000 7000 freq / Hz

mag

/ dB

-80

-60

-40

-20

0

20

original

sinusoids

residualLPC

Can model residual signal with LPC→ flexible representation of noisy residual

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 27 / 30

Sinusoids + noise + transients

Sound represented as sinusoids and noise:

s[n] =∑k

Ak [n] cos(2πn · fk [n]) + hn[n] ∗ b[n]

Sinusoids Residual es [n]

Parameters are Ak [n], fk [n], hn[n]

time / s

freq

/ Hz

0 0.2 0.4 0.602000400060008000

2000400060008000

0 0.2 0.4 0.602000400060008000

Separate out abrupt transients in residual?es [n] =

∑k tk [n] + hn[n] ∗ b′[n]

I more specific → more flexible

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 28 / 30

Summary

‘Nonspeech audio’I i.e. sound in generalI characteristics: ecological

Music synthesisI control of pitch, duration, loudness, articulationI evolution of techniquesI sinusoids + noise + transients

Music analysis...

and beyond?

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 29 / 30

References

W.W. Gaver. Synthesizing auditory icons. In Proc. Conference on Human factors incomputing systems INTERCHI-93, pages 228–235. Addison-Wesley, 1993.

M. Athineos and D. P. W. Ellis. Autoregressive modeling of temporal envelopes. IEEETr. Signal Proc., 15(11):5237–5245, 2007. URLhttp://www.ee.columbia.edu/~dpwe/pubs/AthinE07-fdlp.pdf.

X. Serra and J. Smith III. Spectral Modeling Synthesis: A Sound Analysis/SynthesisSystem Based on a Deterministic Plus Stochastic Decomposition. Computer MusicJournal, 14(4):12–24, 1990.

T. S. Verma and T. H. Y. Meng. An analysis/synthesis tool for transient signals thatallows aflexible sines+ transients+ noise model for audio. In Proc. ICASSP, pagesVI–3573–3576, Seattle, 1998.

E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 30 / 30