GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems...

download GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.

If you can't read please download the document

Transcript of GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems...

  • Slide 1
  • Slide 2
  • GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1
  • Slide 3
  • Outlines Overview of MIR system Human Listening Machine Listening Audio and Music representations Time-domain representation Waveform Time-frequency domain representations Sinusoids DFT STFT, Spectrogram 2
  • Slide 4
  • Human Listening L. Watts, Visualizing Complexity in the Brain, 2003 3 Ears Auditory Transduction http://www.youtube.com/ watch?v=PeTriGTENoc
  • Slide 5
  • Machine Listening Emulated the human auditory system? Well, it might be better to understand the functionalities in a high level and implement them in efficient ways for machines Basic Functionalities Capture sounds and convert the air vibration to an accessible form (by the machine) Transform the input to have a better view of sounds Extract only necessary part Obtain desired information from the extracted part 4
  • Slide 6
  • (Content-based) MIR System 5 Algorithms Feature Extraction Sound Capture Representation Transform Block Diagram of MIR system
  • Slide 7
  • Sound Capture 6 Microphone Mechanical vibration to electrical signals A-D converter Sampling and Quantization Produce digital waveforms Often, we store the waveforms as audio files If necessary, the audio files are compressed (mp3, wma, ) Algorithms Feature Extraction Sound Capture Representation Transform
  • Slide 8
  • Representation Transform Transform waveforms to have better view of sounds Mostly using sinusoidal basis functions Types Short-time Fourier Transform (STFT): Spectrogram Constant-Q transform Auditory filter banks Remapped spectrogram (frequency or amplitude) Auto-correlation 7 Algorithms Feature Extraction Sound Capture Representation Transform
  • Slide 9
  • Feature Extraction and Algorithms Feature extraction Extract only necessary variations in the data representation Algorithms Determine categories or specific values through training Two approaches in feature extraction and algorithms Heuristic approach: make computational rules based on domain knowledge and trial-and-error Learning-based approach: training the system using labeled (or unlabeled) data The rest of this course is all about this 8 Algorithms Feature Extraction Sound Capture Representation Transform
  • Slide 10
  • Sound Capture 9 Microphone Mechanical vibration to electrical signals Followed by pre-amplifiers Microphones and pre-amps have characteristic frequency responses that colorize the input sound A-D converter Sampling: continuous-to discrete-time signals Quantization: finite numbers of amplitude steps Produce digital waveforms Often, multiple input channels are used Stereo (2-ch) is standard in music recordings Microphone arrays: good for sound localization and spatial filtering (e.g. beam-forming )
  • Slide 11
  • Sampling Convert continuous signals to a series of discrete numbers by uniformly picking up the signal values in time Sampling theorem Sampling rate must be twice as high as the highest frequency the continuous signals contain. Lowpass filter is applied before sampling to avoid aliasing Human can hear up to 20kHz Sampling rate of 40kHz or above Examples of sampling rates Speech: 8kHz, 16kHz Music: 22.05Hz, 44.1KHz, 48KHz Professional audio gears: 48kHz, 96kHz 10
  • Slide 12
  • Quantization 11 Convert continuous level of values to a finite set of steps in amplitude Create quantization error Can be regarded as additive noise Sufficient number of quantization steps is necessary to prevent the noise from being audible Examples of quantization steps 8 bit: 48dB (dynamic range) 16 bit: 96dB 24 bit: 144dB Human ears: about 110 dB (depending on frequency)
  • Slide 13
  • (Digital) Waveform 12 The most basic audio representation that computers can take x(n) = [a1, a2, a3,...] Good to view energy change Overall dynamic range when zoomed out Fine-time note onset when zoomed in But not very intuitive
  • Slide 14
  • Another View of Waveform Waveform can be seen as representing signals with the following basis functions For example, the signal x(n) is like: Can we find better basis functions? New basis functions: 13
  • Slide 15
  • Sinusoids A periodic waveform drawn from a circle Why sinusoids are important Fundamental in Physics Eigen-functions of linear systems Human ears is a kind of spectrum analyzer 14 : Amplitude : Angular Frequency : Initial Phase
  • Slide 16
  • Discrete Fourier Transform (DFT) Complex Sinusoid By Eulers Identity: Discrete Fourier Transform Inner product with complex sinusoid Inverse Discrete Fourier Transform 15
  • Slide 17
  • DFT Inverse DFT Basis Function View Practical Form of DFT 16
  • Slide 18
  • Matrix Multiplication View of DFT In fact, we dont compute this directly. There is a more efficiently way, which is called Fast Fourier Transform (FFT) Complexity reduction by FFT: O( N 2 ) O( Nlog 2 N ) Practical Form of DFT 17
  • Slide 19
  • Practical Form of DFT DFT produces complex numbers! Magnitude Correspond to energy at frequency k Phase Corresponds to phase at frequency k 18
  • Slide 20
  • Examples of DFT 19 Sine waveform Drum Flute
  • Slide 21
  • Short-Time Fourier Transform (STFT) DFT assumes that the signal is stationary It is not a good idea to apply DFT to long and dynamically changing signals like music Instead, we segment the signal and apply DFT separately Short-Time Fourier Transform 1.Segment a frame using a window function 2.Zero-padding if necessary 3.Apply DFT to the zero-padded windowed waveform 4.Progress by hop size 5.Repeat step 1-4 This produces 2-D time-frequency representations Get spectrogram from the magnitude Parameters: window size, window type, FFT size, hop size 20
  • Slide 22
  • Windowing Types of window functions Rectangular, Triangle, Hann, Hamming, Blackman-Harris Trade-off between the width of main-lobe and the level of side-lobe 21 Main-lobe width Side-lobe level
  • Slide 23
  • Zero-padding Adding zeros to a windowed frame in time domain Corresponds to ideal interpolation in frequency domain In practice, FFT size increases by the size of zero-padding 22
  • Slide 24
  • Example: Music 23
  • Slide 25
  • Example: Deep Note 24
  • Slide 26
  • Time-Frequency Resolutions in STFT Trade-off between time-resolution and frequency-resolution Long window: high frequency-resolution / low time-resolution short window: low frequency-resolution / high time-resolution 25
  • Slide 27
  • References JOS DSP Books Mathematics of DFT https://ccrma.stanford.edu/~jos/mdft/ https://ccrma.stanford.edu/~jos/mdft/ Spectral Audio Signal Processing https://ccrma.stanford.edu/~jos/sasp/ https://ccrma.stanford.edu/~jos/sasp/ The Scientist and Engineers Guide to Digital Signal Processing http://www.dspguide.com/pdfbook.htm (See chapter 8-12) http://www.dspguide.com/pdfbook.htm 26