Audio chord recognition using deep neural networks
Transcript of Audio chord recognition using deep neural networks
Audio Chord RecognitionUsing Deep Neural Networks
Bohumír Zámečník @bzamecnik(A Farewell) Data Science Seminar – 2016-05-25
Agenda
● what are chords & why recognize them?● task formulation● data set● pre-processing● model● evaluation● future work
The dream – Beatles: Penny Lane
"multiple tonesbeing playedat the same time"
~ pitch class sets
group Z12
212 = 4096 possibilities
What are chords?
Motivation – why recognize chords?
● provide rich high-level musical structure○ → visualization
● difficulty to pick by ear○ lyrics & melody – easy○ chords – harder
Representation ● symbolic names● pitch class sets (unique tones)
[1, 3, 5] [1, 4, 6] [2, 5, 7]
Task formulation – end-to-end task
● segmentation & classification○ input data: sampled audio recording○ output: time segments with symbolic chord labels
start end chord
0.440395 1.689818 B
1.689818 2.209188 B/7
2.209188 2.746326 B/6
2.746326 3.280385 B/5
3.280385 3.849274 E:maj6
3.849274 4.406553 C#:min7
4.406553 4.940612 F#:sus4
Task formulation – intermediate task
● multi-label classification of frames○ input: chromagram○ output: pitch class labels for each frame
0 0 0 1 0 0 1 0 0 0 1 1
0 0 0 1 0 0 1 0 1 0 0 1
0 0 0 1 0 0 1 0 0 0 0 1
0 1 0 0 1 0 0 0 1 0 0 1
0 1 0 0 1 0 0 0 1 0 0 1
0 1 0 0 0 0 1 0 0 0 0 1
(Isophonics)
● 180 songs● ~ 8 hours● human-annotated chord labels● raw audio possible but hard
to obtain – due to copyrights :(○ torrent to help
Data set – The Beatles: Reference Annotations
Pre-processing
● hard part – cleaning the input data :)● need to synchronize audio & features
● chromagram features○ like log-spectrogram○ bins aligned to musical tones○ linear translation○ time-frequency reassignment
■ using phase to "focus" the content position
Pre-processing – audio
Pre-processing – audio
● stereo to mono (mean)● cut to (overlapping) frames● apply window (Hann)● FFT – time-domain to frequency-domain → spectrogram● reassignment – derivative of phase wrt. time & frequency
○ better position● log scaling of frequency● requantization● dynamic range compression of values (log)
linearspectrogram
logspectrogram
reassignedlogspectrogram
Preprocessing – labels
● symbolic labels to binary pitch class vectors○ chord-labels parser
● sample to frames (to match the audio features)
B 0 0 0 1 0 0 1 0 0 0 1 1
B/7 0 0 0 1 0 0 1 0 1 0 0 1
B/6 0 0 0 1 0 0 1 0 1 0 0 1
B/5 0 0 0 1 0 0 1 0 0 0 0 1
E:maj6 0 1 0 0 1 0 0 0 1 0 0 1
C#:min7 0 1 0 0 1 0 0 0 1 0 0 1
F#:sus4 0 1 0 0 0 0 1 0 0 0 0 1
Preprocessing – tensor reshaping for the model
● (data points, features)● cut the sequences to fixed length
○ eg. 100 frames○ → (sequence count, sequence length, features)
● reshape for convolution○ → (sequence count, sequence length, features, channels)
● final shape: (3756, 100, 115, 1)
Dataset size
● ~630k frames● 115 features● ~ 4 GB raw audio● ~ 300 MB features compressed numpy array● splits
○ training 60%, validation 20 %, test 20 %○ over whole songs to prevent leakage!
Model – using deep neural networks
● the current architecture is inspired by what's used in the wild● convolutions (+ pooling) at the beginning to extract local features● recurrent layers to propagate context in time● sigmoids at the end for multi-label classification● dropout & batch normalization for regularization● ADAM optimizer
model = Sequential()
model.add(TimeDistributed(Convolution1D(32, 3, activation='relu'), input_shape=(max_seq_size, feature_count, 1)))
model.add(TimeDistributed(Convolution1D(32, 3, activation='relu')))
model.add(TimeDistributed(MaxPooling1D(2, 2)))
model.add(Dropout(0.25))
model.add(TimeDistributed(Convolution1D(64, 3, activation='relu')))
model.add(TimeDistributed(Convolution1D(64, 3, activation='relu')))
model.add(TimeDistributed(MaxPooling1D(2, 2)))
model.add(Dropout(0.25))
model.add(TimeDistributed(Convolution1D(64, 3, activation='relu')))
model.add(TimeDistributed(Convolution1D(64, 3, activation='relu')))
model.add(TimeDistributed(MaxPooling1D(2, 2)))
model.add(Dropout(0.25))
model.add(TimeDistributed(Flatten()))
model.add(BatchNormalization())
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(64, return_sequences=True))
model.add(Dropout(0.25))
model.add(TimeDistributed(Dense(12, activation='sigmoid')))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, validation_data=(X_valid, Y_valid), nb_epoch=10, batch_size=32)
implemented in Pythonusing Keras on top ofTheano/TensorFlow
6x convolutions
2x recurrent
1x classifier
Training
● trained on GPU NVIDIA GTX 980Ti● model ~260k parameters● batch size: 32● 6 GB GPU RAM● ~ 60 s per epoch● a few epochs to overfit● 46 °C :)
Evaluation
● classification metrics○ accuracy○ hamming distance – for binary vectors○ AUC
● segmentation metrics○ WAOR (weighted average overlap ratio)
Evaluation (validation set)
accuracy hamming score AUC
CNN + dense 0.402 0.873 0.910
CNN + LSTM 0.512 0.899 0.935
Pred. probability
Pred. labels
True labels
Probability error
Label error
"And I Love Her"
predicted
ground-truth
Future work
● prepare for MIREX 2016● clean up the project● write down all the stuff to blog● make interactive demos / production app● examine new approaches
○ better frame -> segment post-processing○ 2D/nD convolutions – using locality in time/octaves○ bi-directional RNN○ beat-aligned features○ language models○ unsupervised pre-training○ segmental RNN for direct segmentation
Open-source @ GitHub
● bzamecnik/audio-ml – latest ML models & experiments● bzamecnik/music-processing-experiments – chromagram features● bzamecnik/chord-labels – labels -> pitch class vectors● bzamecnik/harmoneye
○ real-time chromagram features visualization○ chord timeline visualization (from Penny Lane video)
● bzamecnik/harmoneye-android● visualmusictheory.com - blog● bzamecnik/ideas – more ideas :)
Thank you!
Audio Chord Recognition