Gaussian Processes for Audio Feature...

44
Gaussian Processes for Audio Feature Extraction Dr. Richard E. Turner ([email protected]) Computational and Biological Learning Lab Department of Engineering University of Cambridge

Transcript of Gaussian Processes for Audio Feature...

  • Gaussian Processes for Audio Feature Extraction

    Dr. Richard E. Turner ([email protected])

    Computational and Biological Learning LabDepartment of Engineering

    University of Cambridge

  • Machine hearing pipeline

    signal

    T samples

  • Machine hearing pipeline

    frequ

    ency

    time-frequency(TF) analysis

    signal

    short time Fourier transformspectrogram

    waveletfilter bank

    (non-linear)

    T samples

    T' D > T samples

  • Machine hearing pipeline

    frequ

    ency

    time-frequency(TF) analysis

    probabilistic model

    signal

    HMM (speech recognition)NMF (source separation, denoising)ICA (source separation, denoising)

    short time Fourier transformspectrogram

    waveletfilter bank

    (non-linear)

    T samples

    T' D > T samples

  • Problems with conventional pipeline

    frequ

    ency

    time-frequency(TF) analysis

    probabilistic model

    signal

    (non-linear)

    T samples

    T' D > T samples

    noise (source mixtures) hard to model in TF domain

    (hard to propagate uncertainty noise/missing data - from signal to

    TF domain)

  • Problems with conventional pipeline

    frequ

    ency

    time-frequency(TF) analysis

    probabilistic model

    signal

    (non-linear)

    T samples

    T' D > T samples

    noise (source mixtures) hard to model in TF domain

    hard to enforce/learn dependencies intrinsic to the FT analysis

    image of mapping(injective)

    (hard to propagate uncertainty noise/missing data - from signal to

    TF domain)

  • Problems with conventional pipeline

    frequ

    ency

    time-frequency(TF) analysis

    probabilistic model

    signal

    (non-linear)

    T samples

    T' D > T samples

    noise (source mixtures) hard to model in TF domain

    learning based on time-frequency representation ignores Jacobian

    hard to enforce/learn dependencies intrinsic to the FT analysis

    image of mapping(injective)

    (hard to propagate uncertainty noise/missing data - from signal to

    TF domain)

  • Problems with conventional pipeline

    frequ

    ency

    time-frequency(TF) analysis

    probabilistic model

    signal

    (non-linear)

    T samples

    T' D > T samples

    noise (source mixtures) hard to model in TF domain

    learning based on time-frequency representation ignores Jacobian

    hard to enforce/learn dependencies intrinsic to the FT analysis

    image of mapping(injective)

    (hard to propagate uncertainty noise/missing data - from signal to

    TF domain)

  • Problems with conventional pipeline

    frequ

    ency

    time-frequency(TF) analysis

    probabilistic model

    signal

    (non-linear)

    T samples

    T' D > T samples

    noise (source mixtures) hard to model in TF domain

    learning based on time-frequency representation ignores Jacobian

    hard to enforce/learn dependencies intrinsic to the FT analysis

    image of mapping(injective)

    hard to adapt both top and bottom layers

    (hard to propagate uncertainty noise/missing data - from signal to

    TF domain)

  • Goal of this talk

    frequ

    ency

    time-frequency(TF) analysis

    probabilistic model

    signal

    (non-linear)

    T samples

    T' D > T samples

    probabilise time-frequency analysis(construct generative model in whichinference corresponds to classical time-frequency analysis)

    build a hierachical model that incorporates downstream processing module

    classical signal processing

    machine learning

  • A typical audio pipeline

    0.5 1 1.5 2

    y

    time /s

    signal

  • A typical audio pipeline

    0.5 1 1.5 2

    y

    time /s

    0.2

    window

    short time Fourier transform

    magnitude

    0.5

    1.1

    2.6

    6

    frequ

    ency

    /kH

    z

    spectrogram

    signal

    Fourier transform

  • A typical audio pipeline

    0.5 1 1.5 2

    y

    time /s

    0.2

    0

    0

    window

    short time Fourier transform

    magnitude

    0.5

    1.1

    2.6

    6

    frequ

    ency

    /kH

    z

    spectrogram

    NMF

    signal

    Fourier transform

  • A typical audio pipeline

    0.5 1 1.5 2

    y

    time /s

    0.5

    1.1

    2.6

    6

    frequ

    ency

    /kH

    z

    0.2

    0

    0

    window

    short time Fourier transform

    magnitudespectrogram

    NMF

    signal

    Fourier transform

  • What form of generative model corresponds to the STFT?

    desire: expected value of latent time-frequency coefficients sd,1:T = STFT

    • assume y formed by (weighted) superposition of band-limited signals sd,1:T

    • linearity of inference can be assured by setting the distributions of each sd,1:Tand the noise to be Gaussian

    • time-invariance =⇒ generative model statistically stationary

    =⇒ GP prior over STFT coefficients, p(sd,1:T ) = G(sd,1:T ; 0,Γ), stationary

    Γt,t′ ≈T∑k=1

    FT−1t,kγkFTk,t′ where FTk,t=e−2πi(k−1)(t−1)/T

  • Time-frequency analysis as inference

    generation

    complex sinusoids

    time-varying (complex) coefficients

  • Time-frequency analysis as inference

    generation

    complex sinusoids

    time-varying (complex) coefficients

  • Time-frequency analysis as inference

    generation inference

    complex sinusoids

    time-varying (complex) coefficients

  • Time-frequency analysis as inference

    generation inference

    complex sinusoids

    most probable coefficients given the signal is the STFT

    STFT

    time-varying (complex) coefficients

    STFT window = prior covariance

    frequency shifted inverse signal covariance

  • Time-frequency analysis as inference

    generation inference

  • Time-frequency analysis as inference

    generation inference

  • Time-frequency analysis as inference

    generation inference

  • Time-frequency analysis as inference

    generation inference

  • Time-frequency analysis as inference

    generation inference

    depends on independent of

  • Time-frequency analysis as inference

    generation inference

    signal noise

    depends on independent of

  • Time-frequency analysis as inference

    generation inference

    signal noise

    depends on independent of

    Wiener filter

  • Time-frequency analysis as inference

    generation inference

    signal noise

    depends on independent of

    Wiener filter

  • Time-frequency analysis as inference

    generation inference

    signal noise

    depends on independent of

    Wiener filter

    STFT window = prior covariancefrequency shifted inverse signal covariance

  • Time-frequency analysis as inference

    generation inference

    signal noise

    depends on independent of

    Wiener filter

    STFT window = prior covariancefrequency shifted inverse signal covariance

    probabilistic filter bank

    probabilistic STFT

  • Time-frequency analysis as inference

    • probabilistic models in which inference recovers STFT, filter bank, waveletanalysis

    – unifes a number of existing probabilistic time-series models & connectsto traditional sig. proc.

    – can learn window of STFT and frequencies (equivalently filter properties)– frequency shift relationship mimics classical relationship between these

    time-frequency relationships

    • hops/down-sampling and finite window used correspond to FITC(uniformly spaced pseudo-points) and sparse-covariance approximations

    – rediscover Nyquist in the context of approximation GPs

  • Probabilistic audio processing pipeline

    0.1

    2.6

    0.1

    20 40 60 80 100 120 140 160 180−1

    0

    1

    time /ms

    2.6

    freq

    /kH

    zfre

    q /k

    Hz

    0

    0

    envelopes

    carriers

    signal

    = bandpass Gaussian noise

    mean spectrum

  • Probabilistic audio processing pipeline

    0.1

    2.6

    0.1

    20 40 60 80 100 120 140 160 180−1

    0

    1

    time /ms

    2.6

    freq

    /kH

    zfre

    q /k

    Hz

    0

    0

    envelopes

    carriers

    signal

    = bandpass Gaussian noise

    mean spectrum

  • Probabilistic audio processing pipeline

    0.1

    2.6

    0.1

    20 40 60 80 100 120 140 160 180−1

    0

    1

    time /ms

    2.6

    freq

    /kH

    zfre

    q /k

    Hz

    0

    0envelopepatterns

    envelopes

    carriers

    signal

    = slow Gaussian process

    = bandpass Gaussian noise

    mean spectrum

    mean spectrum

  • Inference and Learning

    • Key Observation – fix envelopes:

    – posterior over carriers is Gaussian– posterior mean given by an (adaptive) filter

    • Leads to MAP estimation of the envelopes (or HMCMC), let zlt = log hlt

    ZMAP = arg maxZ

    p(Z|Y)

    p(Z|Y) = 1Zp(Z,Y) =

    1

    Z

    ∫dXp(Z,Y,X) =

    1

    Zp(Z)

    ∫dXp(Y|A,X)p(X)

    • Compute integral efficiently using chain stuctured approximation andKalman Smoothing

    • Leads to gradient based optimisation for transformed amplitudes

    • Learning: approximate Maximum Likelihood θ = arg maxθ p(Y|θ)

    • NMF: zero-temperature EM, one E-Step, initialise constant envelopes

  • Audio modelling

    � �� �� �� ��

    ��

    ��

    ��

    ��

    ���� ���

    ���

    ��

    ������ ����� ���� ��� �

    ��� ��� ���� ����

    ������� ���

    �������

    �����

    �������� �������

    ����

    ������

    ��� ��� ���� ����

    ���

    ������� ���

    ���

    ��

    ���

    ��

    ���

    ����������

    ������������������� �������������������

  • Audio modelling

    time /s

    frequ

    ency

    /kH

    fire stream wind rain

    tent-zipfoot step Turner, 2010

  • Audio modelling

    time /s

    frequ

    ency

    /kH

    fire stream wind rain

    tent-zipfoot step Turner, 2010

  • Audio modelling

    time /s

    frequ

    ency

    /kH

    Turner, 2010

  • Statistical texture synthesis

    • Old approach: build detailed physical models (e.g. rain drops)

    • New approach

    – train model on your favourite texture– sample from the prior, and then from the likelihood.

    • Waveform unique, but statistically matched to original

    • Often perceptually indistinguishable

  • Audio denoising

    −10 −5 0 5 10 15

    2

    4

    6

    8

    10

    12

    14

    SNR before /dB

    SNR

    impr

    ovem

    ent/

    dB

    1 1.5 2 2.5−0.5

    0

    0.5

    1

    PESQ before

    PESQ

    impr

    ovem

    ent

    5 10 15 200

    1

    2

    3

    4

    5

    6

    7

    SNR log−spec before /dB

    SNR

    log−

    spec

    impr

    ovem

    ent/

    dB

    2 4 6 8 10 12 14 16 18

    −5

    0

    5

    y t

    2 4 6 8 10 12 14 16 18

    −5

    0

    5

    y t

    2 4 6 8 10 12 14

    −5

    0

    5

    y t

    2 4 6 8 10 12 14

    −5

    0

    5

    y t

    2 4 6 8 10 12 14 16 18 20

    −10

    −5

    0

    5

    10

    time /ms

    y t

    2 4 6 8 10 12 14 16 18 20

    −10

    0

    10

    time /ms

    y t

    NMFtNMFGTFGTFtNMFadapted filtersunadapted filtersWienerspectral subtractblock threshold

  • Audio missing data imputation

    0 5 10 150

    2

    4

    6

    8

    10

    12

    14

    missing region /ms

    SNR

    /dB

    0 5 10 15

    2.5

    3

    3.5

    4

    missing region /ms

    PESQ

    0 5 10 15

    5

    10

    15

    20

    missing region /ms

    SNR

    log−

    spec

    /dB

    y t

    5 10 15 20 25

    −2

    −1

    0

    1

    2

    y t

    5 10 15 20 25

    −2

    −1

    0

    1

    2

    y t

    5 10 15 20 25

    −2

    0

    2

    y t

    5 10 15 20 25

    −2

    0

    2

    y t

    time /ms5 10 15 20 25

    −2

    −1

    0

    1

    2

    y t

    time /ms5 10 15 20 25

    −2

    −1

    0

    1

    2

    tNMFGTFGTFtNMFunadapted filtersadapted filters

  • Unifying classical and probabilistic audio signal processing

    Probabilistic signal processing

    Classical signal processing

    robustnessadaptation

    fast methodsimportant variables

  • Spectrogram

    Filter Bank &Hilbert STFT

    freq shift

    Amplitudes

    Cemgil &Godsill

    Qi &Minka

    Probabilistic signal processing

    Classical signal processing

    estimation

    freq shift

  • Additional slides