Audio Feature Extraction - mac.kaist.ac.kr

16
Juhan Nam GCT634/AI613: Musical Applications of Machine Learning (Fall 2021) Audio Feature Extraction

Transcript of Audio Feature Extraction - mac.kaist.ac.kr

Page 1: Audio Feature Extraction - mac.kaist.ac.kr

Juhan Nam

GCT634/AI613: Musical Applications of Machine Learning (Fall 2021)

Audio Feature Extraction

Page 2: Audio Feature Extraction - mac.kaist.ac.kr

Introduction

● A variety of audio features capture different aspects of music in different levels○ Acoustic level: loudness, pitch, timbre○ Musical level: key, chord, tempo, and more

AudioFeatures

Time-FrequencyRepresentation ClassifierFeature

SummaryOutput Class

Page 3: Audio Feature Extraction - mac.kaist.ac.kr

Frame-Level Audio Features

● Summarize a single frame of time-frequency representations into a single value or a small vector

● RMS: loudness

● Spectral statistics: timbre

● MFCC: timbre

● Chroma: tonality (pitch, key, chord)

Page 4: Audio Feature Extraction - mac.kaist.ac.kr

Root-Mean-Square (RMS)

● Computes the amplitude envelope of waveforms○ Frame-by-frame computation:

○ Can be computed from STFT:

RMS 𝑙 = !"∑#$%"&!(𝑥 𝑚 − 𝑙 + 𝑅 𝑤[𝑛])'

𝑁: window size𝑅: hop sizeRMS 𝑙 = !

"∑($%"&! |𝑋 𝑘, 𝑙 |'

𝑤[𝑛]: window

Piano (single note) Flute (single note)

Page 5: Audio Feature Extraction - mac.kaist.ac.kr

Spectral Statistics: Centroid

● “Center of mass” of the spectrum○ Associated with the brightness of sounds

SC(𝑙) =∑($%"&!𝑓( + |𝑋 𝑘, 𝑙 |∑($%"&! |𝑋 𝑘, 𝑙 |

Classical Music Jazz Music

𝑓( =𝑘𝑁𝑓)

Page 6: Audio Feature Extraction - mac.kaist.ac.kr

Spectral Bandwidth

● A measure of the bandwidth of the spectrum

SB 𝑙 =∑($%"&!(𝑓( − SC 𝑙 )'+ |𝑋 𝑘, 𝑙 |

∑($%"&! |𝑋 𝑘, 𝑙 |

Classical Music Jazz Music

Page 7: Audio Feature Extraction - mac.kaist.ac.kr

Spectral Flatness

● A measure of the noisiness (or tonality) of the spectrum○ The ratio between the geometric and arithmetic means

SF 𝑙 =

! ∏!"#$%& |𝑋 𝑘, 𝑙 |

1𝑁∑!"#

$%& |𝑋 𝑘, 𝑙 |

Close to 1 for white noise (maximum flatness)

Classical Music Jazz Music

Page 8: Audio Feature Extraction - mac.kaist.ac.kr

Mel-Frequency Cepstral Coefficient (MFCC)

● Most popularly used audio feature to extract “timbre” ○ Pitch-invariant: extract the spectrum envelop from an audio frame○ Standard audio feature in the legacy speech/speaker recognition systems:

captures phonetic information and speaker information○ Low-dimensional: typically 13 to 20-dim vectors

Page 9: Audio Feature Extraction - mac.kaist.ac.kr

Mel-Frequency Cepstral Coefficient (MFCC)

● Computation steps○ Mel-spectrum: reduce the dimensionality (40 - 128 bins)○ Log compression○ Discrete Cosine Transform (DCT):

■ A small set of cosine kernels with low frequencies to reduce the dimensionality again (typically, 13 – 20 bins)

■ Captures a slowly varying trend of mel-spectrum over frequency which corresponds to the spectrum envelope

DFT MelFilterbank DCT

Mel-spectrum

abs(magnitude)

logcompression

Magnitude Spectrum

MFCC

Page 10: Audio Feature Extraction - mac.kaist.ac.kr

Mel-Frequency Cepstral Coefficient (MFCC)

Magnitude spectrum(512 bins)

Frequency spectrum (mel-scaled, 60 bins)

MFCC(13 dim)

Reconstructed Mel spectrum

Reconstructed Magnitude spectrum

DCT

InverseDCT

Inversemel

filterbank

Melfilterbank

Page 11: Audio Feature Extraction - mac.kaist.ac.kr

Mel-Frequency Cepstral Coefficient (MFCC)

Spectrogram

Mel-frequencySpectrogram

MFCC

ReconstructedSpectrogramfrom MFCC

Page 12: Audio Feature Extraction - mac.kaist.ac.kr

Neural Network View of MFCC

● We can replace the hand-designed step with trainable modules ○ STFT, Mel-filterbank and DCT is a linear transform (Conv, FC)○ Abs and log compression is a non-linear function○ The linear transforms are designed by hands in MFCC but they can be

optimized using the trainable modules

STFT (DFT)

MelFilterbank DCT

Abs(magnitude)

Logcompression

ConvLayer

FCLayer

FCLayer

Non-linearLayer

Non-linearLayer

Neural Network

MFCC

Page 13: Audio Feature Extraction - mac.kaist.ac.kr

Chroma

● Musical notes are denoted with a pitch class and an octave number○ Pitch class: C, C#, D, D#, E, F, F#, G, G#, A, A#, B○ Octave number: 0, 1, 2, 3, 4, 5, …○ Example: C4 (middle C), E3, G5

● The octave difference is the most consonant pitch interval○ Therefore, they belong to the same pitch class

● This can be represented with “pitch helix”○ Chroma: inherent circularity of pitch organization○ Height: naturally increase and have one octave above

for one rotation Pitch Helix and Chroma(Shepard, 2001)

Page 14: Audio Feature Extraction - mac.kaist.ac.kr

Chroma

● Compute the energy distribution of an audio frame on 12 pitch classes

○ Convert the frequency to a musical note (=12log!"##$

+ 69) and take the pitch class from the musical note (e.g. 69à A4 à A)

○ Extract tonal characteristics, ideally removing timbre information○ Useful in music synchronization, chord recognition, music structure

analysis, music genre classification

● Computation Steps○ Projecting the DFT or Constant-Q transform onto 12 pitch classes

DFT orConstant-Q Transform

ChromaMapping

abs(magnitude) Chroma

Page 15: Audio Feature Extraction - mac.kaist.ac.kr

Chroma

Spectrogram Chroma

Chroma mapping

(Reconstructed Chroma: Shepard tone)

Page 16: Audio Feature Extraction - mac.kaist.ac.kr

Neural Network View of Chroma

● We can replace the hand-designed modules with the trainable modules ○ STFT, constant-Q transform, and chroma mapping are a linear transform○ Abs corresponds to a non-linear function○ The linear transforms are designed by hands in chroma but they can be

optimized further using the trainable modules

STFT orConstant-Q

ChromaMapping

Abs(magnitude)

ConvLayer

FClayer

Non-linearfunction

Deep Neural Network

Chroma