Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf ·...
Transcript of Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf ·...
![Page 1: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/1.jpg)
Deep learning architectures for music audio classification: a personal (re)view
Jordi Pons
jordipons.me – @jordiponsdotme
Music Technology GroupUniversitat Pompeu Fabra, Barcelona
![Page 2: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/2.jpg)
Acronyms
MLP: multi layer perceptron ≡ feed-forward neural networkRNN: recurrent neural networkLSTM: long-short term memoryCNN: convolutional neural networkBN: batch normalization
..the following slides assume you know these concepts!
![Page 3: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/3.jpg)
Outline
Chronology: the big picture
Audio classification: state-of-the-art review
Music audio tagging as a study case
![Page 4: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/4.jpg)
Outline
Chronology: the big picture
Audio classification: state-of-the-art review
Music audio tagging as a study case
![Page 5: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/5.jpg)
0
10
20
30
40
50
60
70
80
“Deep learning & music” papers: milestones#
pape
rs
![Page 6: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/6.jpg)
0
10
20
30
40
50
60
70
80
“Deep learning & music” papers: milestones
RNN from symbolic data for automatic music composition (Todd, 1988)
MLP from symbolic data for automatic music composition (Lewis, 1988)
# pa
pers
![Page 7: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/7.jpg)
0
10
20
30
40
50
60
70
80
“Deep learning & music” papers: milestones
RNN from symbolic data for automatic music composition (Todd, 1988)
MLP from symbolic data for automatic music composition (Lewis, 1988)
# pa
pers
LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002)
![Page 8: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/8.jpg)
0
10
20
30
40
50
60
70
80
“Deep learning & music” papers: milestones
RNN from symbolic data for automatic music composition (Todd, 1988)
MLP from symbolic data for automatic music composition (Lewis, 1988)
MLP learns from spectrograms data for note onset detection (Marolt et al, 2002)
# pa
pers
LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002)
![Page 9: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/9.jpg)
0
10
20
30
40
50
60
70
80
“Deep learning & music” papers: milestones
RNN from symbolic data for automatic music composition (Todd, 1988)
MLP from symbolic data for automatic music composition (Lewis, 1988)
MLP learns from spectrograms data for note onset detection (Marolt et al, 2002)
# pa
pers
LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002)
CNN learns from spectrograms for music audio classification (Lee et al., 2009)
![Page 10: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/10.jpg)
0
10
20
30
40
50
60
70
80
“Deep learning & music” papers: milestones
RNN from symbolic data for automatic music composition (Todd, 1988)
MLP from symbolic data for automatic music composition (Lewis, 1988)
MLP learns from spectrograms data for note onset detection (Marolt et al, 2002)
# pa
pers
LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002)
CNN learns from spectrograms for music audio classification (Lee et al., 2009)
End-to-end learning for music audio classification (Dieleman et al., 2014)
![Page 11: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/11.jpg)
0
10
20
30
40
50
60
70
80
“Deep learning & music” papers: milestones
RNN from symbolic data for automatic music composition (Todd, 1988)
MLP from symbolic data for automatic music composition (Lewis, 1988)
MLP learns from spectrograms data for note onset detection (Marolt et al, 2002)
# pa
pers
LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002)
CNN learns from spectrograms for music audio classification (Lee et al., 2009)
End-to-end learning for music audio classification (Dieleman et al., 2014)
![Page 12: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/12.jpg)
0
10
20
30
40
50
60
70
80
“Deep learning & music” papers: data trends#
pape
rs
symbolic data
spectrograms data
raw audio data
![Page 13: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/13.jpg)
“Deep learning & music” papers: some referencesDieleman et al., 2014 – End-to-end learning for music audioin International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Lee et al., 2009 – Unsupervised feature learning for audio classification using convolutional deep belief networksin Advances in Neural Information Processing Systems (NIPS)
Marolt et al., 2002 – Neural networks for note onset detection in piano musicin Proceedings of the International Computer Music Conference (ICMC)
Eck and Schmidhuber, 2002 – Finding temporal structure in music: Blues improvisation with LSTM recurrent networks in Proceedings of the Workshop on Neural Networks for Signal Processing
Todd, 1988 – A sequential network design for musical applicationsin Proceedings of the Connectionist Models Summer School
Lewis, 1988 – Creation by Refinement: A creativity paradigm for gradient descent learning networksin International Conference on Neural Networks
![Page 14: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/14.jpg)
Outline
Chronology: the big picture
Audio classification: state-of-the-art review
Music audio tagging as a study case
![Page 15: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/15.jpg)
Which is our goal / task?
input outputmachine learning
waveform
or any audio representation!
phonetic transcription
describe musicwith tags
event detection
deep learning model
![Page 16: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/16.jpg)
The deep learning pipeline
input outputfront-end back-end
waveform
or any audiorepresentation!
phonetic transcription
describe musicwith tags
event detection
![Page 17: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/17.jpg)
The deep learning pipeline: input?
input
?
![Page 18: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/18.jpg)
How to format the input (audio) data?
Waveformend-to-end learning
Pre-processed waveforme.g.: spectrogram
![Page 19: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/19.jpg)
The deep learning pipeline: front-end?
input front-end
waveform
spectrogram?
![Page 20: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/20.jpg)
waveform pre-processed waveform
based on domain
knowledge?filters
config?
input signal?
![Page 21: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/21.jpg)
CNN front-ends for audio classification
Waveformend-to-end learning
Pre-processed waveforme.g.: spectrogram
3x3 3x3 ... 3x3 3x33x1 3x1 ... 3x1 3x1
Sample-level Small-rectangular filters
![Page 22: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/22.jpg)
waveform pre-processed waveform
3x3 3x3 ... 3x3 3x33x1 3x1 ... 3x1 3x1
small-rectangular filterssample-level
based on domain
knowledge?filters
config?
nominimal
filter expression
input signal?
![Page 23: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/23.jpg)
Domain knowledge to design CNN front-ends
Waveformend-to-end learning
Pre-processed waveforme.g.: spectrogram
![Page 24: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/24.jpg)
Domain knowledge to design CNN front-ends
Waveformend-to-end learning
Pre-processed waveforme.g.: spectrogram
frame-level vertical or horizontal filters
filter length: 512 window length?stride: 256 hop size?
Explicitly tailoring the CNN towards learning temporal or timbral cues
![Page 25: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/25.jpg)
waveform pre-processed waveform
3x3 3x3 ... 3x3 3x33x1 3x1 ... 3x1 3x1
small-rectangular filterssample-level
based on domain
knowledge?filters
config?
yes
nominimal
filter expression
single filter shape in 1st CNN layer
frame-level vertical OR horizontal
or
input signal?
![Page 26: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/26.jpg)
DSP wisdom to design CNN front ends
Waveformend-to-end learning
Pre-processed waveforme.g.: spectrogram
Frame-level (many shapes!) Vertical and/or horizontal
Explicitly tailoring the CNN towards learning temporal and timbral cues
Efficient way to represent 4 periods!
![Page 27: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/27.jpg)
waveform pre-processed waveform
3x3 3x3 ... 3x3 3x33x1 3x1 ... 3x1 3x1
small-rectangular filterssample-level
based on domain
knowledge?filters
config?
yes
yes
nominimal
filter expression
many filter shapes in 1st CNN layer
single filter shape in 1st CNN layer
frame-level vertical OR horizontal
frame-level vertical AND/OR horizontal
or
input signal?
![Page 28: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/28.jpg)
CNN front-ends for audio classificationSample-level: Lee et al., 2017 – Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms in Sound and Music Computing Conference (SMC)
Small-rectangular filters: Choi et al., 2016 – Automatic tagging using deep convolutional neural networks in Proceedings of the ISMIR (International Society of Music Information Retrieval) Conference
Frame-level (single shape): Dieleman et al., 2014 – End-to-end learning for music audioin International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Vertical: Lee et al., 2009 – Unsupervised feature learning for audio classification using convolutional deep belief networks in Advances in Neural Information Processing Systems (NIPS)
Horizontal: Schluter & Bock, 2014 – Improved musical onset detection with convolutional neural networksin International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Frame-level (many shapes): Zhu et al., 2016 – Learning multiscale features directly from waveformsin arXiv:1603.09509
Vertical and horizontal (many shapes): Pons, et al., 2016 – Experimenting with musically motivated convolutional neural networks in 14th International Workshop on Content-Based Multimedia Indexing
![Page 29: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/29.jpg)
The deep learning pipeline: back-end?
input front-end back-end
waveform
spectrogram
several CNNarchitectures ?
![Page 30: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/30.jpg)
What is the back-end doing?
Back-end adapts a variable-length feature map to a fixed output-size
front-end
back-end
same length output
latent feature-map
front-end
back-end
same length output
latent feature-map
![Page 31: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/31.jpg)
● Temporal pooling: max-pool or average-pool the temporal axis
Pons et al., 2017 – End-to-end learning for music audio tagging at scale, in proceedings of the ML4Audio Workshop at NIPS.
● Attention: weighting latent representations to what is important
C. Raffel, 2016 – Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. PhD thesis.
● RNN: summarization through a deep temporal model
Vogl et al., 2018 – Drum transcription via joint beat and drum modeling using convolutional recurrent neural networks, In proceedings of the ISMIR conference.
Back-ends for variable-length inputs
..music is generally of variable length!
![Page 32: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/32.jpg)
● Fully convolutional stacks: adapting the input to the output with a stack of CNNs & pooling layers.
Choi et al., 2016 – Automatic tagging using deep convolutional neural networks in proceedings of the ISMIR conference.
● MLP: map a fixed-length feature map to a fixed-length output
Schluter & Bock, 2014 – Improved musical onset detection with convolutional neural networks in proceedings of the ICASSP.
Back-ends for fixed-length inputsCommon trick: let’s assume a fixed-length input
..such trick works very well!
![Page 33: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/33.jpg)
The deep learning pipeline: output
input outputfront-end back-end
waveform
spectrogram
several CNNarchitectures
MLP
RNN
attention
phonetic transcription
describe musicwith tags
event detection
![Page 34: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/34.jpg)
The deep learning pipeline: output
input outputfront-end back-end
waveform
spectrogram
several CNNarchitectures
MLP
RNN
attention
phonetic transcription
describe musicwith tags
event detection
![Page 35: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/35.jpg)
Outline
Chronology: the big picture
Audio classification: state-of-the-art review
Music audio tagging as a study case
Pons et al., 2017. End-to-end learning for music audio tagging at scale, in ML4Audio Workshop at NIPS Summer internship @ Pandora
![Page 36: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/36.jpg)
input outputfront-end back-end
describe musicwith tags?
The deep learning pipeline: input?
![Page 37: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/37.jpg)
How to format the input (audio) data?
waveform log-mel spectrogram
already: zero-mean & one-variance
NO pre-procesing!
– STFT & mel mappingreduces size of the input by removing perceptually irrelevant information
– logarithmic compressionreduces dynamic range of the input
– zero-mean & one-variance
![Page 38: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/38.jpg)
The deep learning pipeline: input?
input outputfront-end back-end
describe musicwith tags
waveform
log-melspectrogram
![Page 39: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/39.jpg)
The deep learning pipeline: front-end?
input outputfront-end back-end
describe musicwith tags?
waveform
log-melspectrogram
![Page 40: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/40.jpg)
waveform pre-processed waveform
3x3 3x3 ... 3x3 3x33x1 3x1 ... 3x1 3x1
small-rectangular filterssample-level
based on domain
knowledge?filters
config?
yes
yes
nominimal
filter expression
many filter shapes in 1st CNN layer
single filter shape in 1st CNN layer
frame-level vertical OR horizontal
frame-level vertical AND/OR horizontal
or
input signal?
![Page 41: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/41.jpg)
Studied front-ends: waveform model
sample-level (Lee et al., 2017)
![Page 42: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/42.jpg)
Studied front-ends: spectrogram model
vertical and horizontalmusically motivated CNNs
(Pons et al., 2016 – 2017)
![Page 43: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/43.jpg)
The deep learning pipeline: front-end?
input outputfront-end back-end
describe musicwith tags
waveform
log-melspectrogram
sample-level
vertical and horizontal
![Page 44: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/44.jpg)
The deep learning pipeline: back-end?
input outputfront-end back-end
describe musicwith tags
waveform
log-melspectrogram
sample-level
vertical and horizontal?
![Page 45: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/45.jpg)
Studied back-end: music is of variable length!
Temporal pooling (Dieleman et al., 2014)
![Page 46: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/46.jpg)
The deep learning pipeline: back-end?
input outputfront-end back-end
describe musicwith tags
waveform
log-melspectrogram
sample-level
vertical and horizontal
temporal pooling
![Page 47: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/47.jpg)
MagnaTTMillion song dataset 1M25k 250k
songs songssongs
![Page 48: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/48.jpg)
MagnaTTMillion song dataset 1M25k 250k
songs songssongs
spectrograms > waveforms
![Page 49: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/49.jpg)
MagnaTTMillion song dataset 1M25k 250k
songs songssongs
spectrograms > waveforms
waveforms > spectrograms
![Page 50: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/50.jpg)
Let’s listen to some music: our model in action
acoustic
string ensemble
classical music
period baroque
compositional dominance of lead vocals
major
![Page 51: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing](https://reader033.fdocuments.us/reader033/viewer/2022053019/5f2735f677f411462459cfb3/html5/thumbnails/51.jpg)
Deep learning architectures for music audio classification: a personal (re)view
Jordi Pons
jordipons.me – @jordiponsdotme
Music Technology GroupUniversitat Pompeu Fabra, Barcelona