LEARNING ENVIRONMENTAL SOUNDS WITH END-TO ......conv1 40 8 24000 input (a) Raw feature extraction...

1
4 6 8 16 32 64 128 59 60 61 62 63 64 Filter size Accuracy [%] 1 convlayer 2 convlayers 3 convlayers 1 2 3 4 5 40 45 50 55 60 65 Input length T [s] Accuracy [%] Settings Dataset: ESC-50 [Piczak, 2015] Total: 50 classes, 2,000 samples Each sample: monaural, 5 seconds, 44.1 kHz Evaluation: 5-fold cross-validation 1,200 samples for training, 400 for validation, 400 for testing Main results The accuracy of EnvNet is higher than static logmel-CNN by 5.1 % We achieve a state-of-the-art accuracy by combining EnvNet and logmel-CNN (averaging) LEARNING ENVIRONMENTAL SOUNDS WITH END-TO-END CONVOLUTIONAL NEURAL NETWORK Yuji Tokozume and Tatsuya Harada, The University of Tokyo Background & Goal ESC is usually conducted based on spectral features such as the log-mel feature These features are designed by humans separately from other parts of the system There could be other effective features of ESC If environmental sounds could be directly learned from the raw waveform, We would be able to extract a new feature representing information different from the log-mel feature This new feature could contribute to the improvement of classification performance Goal: End-to-end ESC system Related work Log-mel feature + CNN [Piczak, 2015] State-of-the-art method of ESC End-to-end speech recognition [Sainath et al., 2015] Performance matches the static log-mel feature EnvNet learns a frequency response which is quite similar to human perception, but the order of the filters is optimized to maximize the classification performance We conjecture that is why our EnvNet feature is effective and has the ability to complement the log-mel feature We proposed an end-to-end environmental sound classification (ESC) system with a CNN We achieved a 6.5% improvement in classification accuracy over the state-of-the-art logmel-CNN, simply by combining our system and logmel-CNN We analyzed the feature learned with our system, and showed that our end-to-end system is capable of extracting a discriminative feature that complements the log-mel feature T-s window (T = 1.5) raw waveform data (16 kHz) 8 maxpool conv2 pool2 40 40 150 11 46 50 160 maxpool 3×3 8 13 conv1 40 8 24000 input (a) Raw feature extraction Processing on feature-map (b) pool4 fc5 fc6 output softmax 14 50 5 1 pool3 11 4096 4096 50 1×3 maxpool EnvNet: End-to-end convolutional neural network for environmental sound classification Overview Input: fixed -s raw waveform 16 kHz, range from -1 to 1 Output: class probabilities Data augmentation Training: random cropping (max amplitude > 0.2) Test: probability voting (create a sliding window and take the average of all the softmax outputs) Network architecture Raw feature extraction (a) 1-D convolutional and pooling layers Pool2: 40 types of frequency features per 10 ms Processing on feature-map (b) 2-D convolutional and pooling layers Finally, classify sounds with fully connected layers Initial experiments Input length : 1.5 2017 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2017 March 5-9, 2017, New Orleans, USA Summary Introduction End-to-end ESC system Experiments Analysis on learned feature logmel-CNN EnvNet (ours) Accuracy [%] static delta 58.9 ± 2.6 66.5 ± 2.8 64.0 ± 2.4 69.3 ± 2.2 71.0 ± 3.1 Piczak logmel-CNN 64.5 Human 81.3 This work was funded by ImPACT Program of Council for Science, Technology and Innovation (Cabinet Office, Government of Japan) and supported by CREST, JST. Acknowledgement Conv-layers for raw feature extraction: 2 layers with size 8 Dog Cat Feature extraction Dog Cat Classification sort Each of the 40 filters responds to a particular frequency area Neighboring filters have a similar frequency response If we sort the filters based on their center frequency, the curve of the center frequency almost matches the mel-scale, i.e., how humans perceive the sound Frequency response of pool2

Transcript of LEARNING ENVIRONMENTAL SOUNDS WITH END-TO ......conv1 40 8 24000 input (a) Raw feature extraction...

Page 1: LEARNING ENVIRONMENTAL SOUNDS WITH END-TO ......conv1 40 8 24000 input (a) Raw feature extraction Processing on feature-map (b) pool4 fc5 fc6 output softmax 14 50 1 5 pool3 11 4096

4 6 8 16 32 64 12859

60

61

62

63

64

Filter size

Acc

urac

y [%

]

1 conv−layer2 conv−layers3 conv−layers

1 2 3 4 540

45

50

55

60

65

Input length T [s]

Acc

urac

y [%

]

Settingsp Dataset: ESC-50 [Piczak, 2015]• Total: 50 classes, 2,000 samples• Each sample: monaural, 5 seconds, 44.1 kHz

p Evaluation: 5-fold cross-validation• 1,200 samples for training, 400 for validation, 400 for testing

Main results

p The accuracy of EnvNet is higher than static logmel-CNN by 5.1 %

p We achieve a state-of-the-art accuracy by combining EnvNet and logmel-CNN (averaging)

LEARNING ENVIRONMENTAL SOUNDS WITH END-TO-END CONVOLUTIONAL NEURAL NETWORK

Yuji Tokozume and Tatsuya Harada, The University of Tokyo

Background & Goalp ESC is usually conducted based on spectral features

such as the log-mel featurep These features are designed by humans separately

from other parts of the system→ There could be other effective features of ESC

p If environmental sounds could be directly learned from the raw waveform,• We would be able to extract a new feature representing

information different from the log-mel feature• This new feature could contribute to the improvement of

classification performance

Ø Goal: End-to-end ESC system

Related workp Log-mel feature + CNN [Piczak, 2015]• State-of-the-art method of ESC

p End-to-end speech recognition [Sainath et al., 2015]• Performance matches the static log-mel feature

p EnvNet learns a frequency response which is quite similar to human perception, but the order of the filters is optimized to maximize the classification performance

Ø We conjecture that is why our EnvNet feature is effective and has the ability to complement the log-mel feature

p We proposed an end-to-end environmental sound classification (ESC) system with a CNN

p We achieved a 6.5% improvement in classification accuracy over the state-of-the-art logmel-CNN,simply by combining our system and logmel-CNN

p We analyzed the feature learned with our system, and showed that our end-to-end system is capable of extracting a discriminative feature that complements the log-mel feature

T-s window(T = 1.5)

raw waveform data (16 kHz)

8

maxpool

conv2 pool2

4040

150

1146

50

160

maxpool3×38 13

conv1

40

824000

input

(a) Raw feature extraction

Processing on feature-map

(b)

pool4

fc5 fc6output

softmax

14

50

51pool3 11

4096

4096 50

1×3maxpool

EnvNet: End-to-end convolutional neural network for environmental sound classification

Overviewp Input: fixed 𝑇-s raw waveform• 16 kHz, range from -1 to 1

p Output: class probabilitiesp Data augmentation• Training: random cropping (max amplitude > 0.2)• Test: probability voting (create a sliding window

and take the average of all the softmax outputs)

Network architecturep Raw feature extraction (a)• 1-D convolutional and pooling layers• Pool2: 40 types of frequency features per 10 ms

p Processing on feature-map (b)• 2-D convolutional and pooling layers• Finally, classify sounds with fully connected layers

Initial experimentsp Input length 𝑇: 1.5

2017 IEEE International Conference on Acoustics, Speech and Signal Processing

ICASSP 2017March 5-9, 2017, New Orleans, USA

Summary

Introduction

End-to-end ESC system

Experiments Analysis on learned feature

logmel-CNN EnvNet (ours) Accuracy [%]static delta

✔ 58.9 ± 2.6✔ ✔ 66.5 ± 2.8

✔ 64.0 ± 2.4✔ ✔ 69.3 ± 2.2✔ ✔ ✔ 71.0 ± 3.1Piczak logmel-CNN 64.5

Human 81.3

This work was funded by ImPACT Program of Council for Science, Technology and Innovation (Cabinet Office, Government of Japan) and supported by CREST, JST.

Acknowledgement

p Conv-layers for raw feature extraction:2 layers with size 8

Dog

Cat

① Feature extractionDog

Cat

② Classification

sort

Each of the 40 filtersresponds to a particular frequency areaNeighboring filters have a similar frequency response

If we sort the filters based on their center frequency, the curve of the center frequency almost matches the mel-scale, i.e., how humans perceive the sound

p Frequency response of pool2