Assisting Aphasia Diagnosis Employing Deep Learning

72
Master Thesis Assisting Aphasia Diagnosis Employing Deep Learning Julius Haring Master Thesis DKE-21-07 Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science of Data Science for Decision Making at the Department of Data Science and Knowledge Engineering of the Maastricht University Thesis Committee: Jan Niehues Pietro Bonizzi Maastricht University Faculty of Science and Engineering Department of Data Science and Knowledge Engineering January 24, 2021

Transcript of Assisting Aphasia Diagnosis Employing Deep Learning

Master Thesis

Assisting Aphasia Diagnosis Employing

Deep Learning

Julius Haring

Master Thesis DKE-21-07

Thesis submitted in partial fulfillmentof the requirements for the degree of

Master of Science of Data Science for Decision Makingat the Department of Data Science and Knowledge Engineering

of the Maastricht University

Thesis Committee:

Jan NiehuesPietro Bonizzi

Maastricht UniversityFaculty of Science and Engineering

Department of Data Science and Knowledge Engineering

January 24, 2021

Preface

I would like to thank my supervisors and their team, namely Jan, Pietro andDanni, who constantly provided support and helpful insight during my thesiswith great commitment. The whole project could also not have been donewithout my great colleagues at HotSprings GmbH, who gave me the possibilityto find a place in the world of work to which I return with joy every day.Especially Christian shaped the way of my career in the last five years, forwhich I am truly grateful. Lastly, the unconditional support and love of myparents gave me the strength to pursue this program in the first place. Thankyou for helping me through minor and major personal catastrophes. This thesisis dedicated to you.

I

Abstract

Aphasia is an acquired deficit in different parts of the brain that perform lan-guage processing. Patients are not capable of uttering, understanding andwriting speech to the same extent as non-impaired people are. However, thesymptoms of aphasia are diverse and displayed in unique ways for each patient.Patients in the german language area are diagnosed using the so called Aach-ener Aphasie Test (AAT) [1], which provides a standardized way of classifyingthe manifestations of said disease. This thesis investigates the automation ofthe methods applied in said test, which relies heavily on free speech interviewsbetween patients and therapists, whose audio files serve as the data for thiswork.

One of six dimensions of speech that are evaluated by the therapists is auto-matically classified. This dimension, prosody, explains how a speaker articulateswords, conveying rhythm, stressing, tone and such. Anomalies occur in the formof stuttering and soft pronounciation of hard consonants, among others. Firstly,an embedding that is capable of containing information necessary to discrimi-nate between classes of prosody on aphasic speech is investigated. It is shownthat log-Mel filterbanks are the top performing embedding on an utterance level.Furthermore, the baseline models used to perform said evaluation are then fine-tuned in order to yield the best possible classification score. The final model, asingle layer regularized Convolutional Neural Network (CNN), is able to diag-nose impaired prosody with an F1-score of 86% on a patient level. It exhibitsan average recall of 85%, which is an improvement of 11.5% with regards tocomparable works [2], with the caveat of the task at hand describing a differentset of impairments - aphasia in opposition to cancer-, ALS and cerebral palsy.

Furthermore, Automatic Speech Recognition (ASR) is relevant for transcrib-ing impaired speech due to various examinations performed within the AATusing written text, such as the interviews mentioned above. Therefore, it isevaluated whether ASR algorithms are capable of transcribing aphasic speechwhen trained on non-aphasic corpora, due to the lack of data in the domain.Transformers are proposed to solve said problems using the implementationprovided by Pham et al. [3].

II

Contents

Preface I

Abstract II

List Of Tables VI

List Of Figures VII

Abbreviations IX

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aphasic Language . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Aachener Aphasie Test . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Standardized Approaches To Aphasia Diagnosis . . . . . . 31.3.2 Shortcomings Of The AAT . . . . . . . . . . . . . . . . . 5

1.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.1 Classification Of Prosody . . . . . . . . . . . . . . . . . . 51.4.2 Automatic Speech Recognition . . . . . . . . . . . . . . . 5

1.5 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background And Literature Review 82.1 Audio Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Conventional Methods Of Signal Processing . . . . . . . . 82.1.2 Mel-Scale Methods . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Voice-Specific Methods . . . . . . . . . . . . . . . . . . . . 102.1.4 Embeddings Of Pathological Speech . . . . . . . . . . . . 11

2.2 Supervised Learning In The Audio Domain . . . . . . . . . . . . 122.2.1 Audio Classification . . . . . . . . . . . . . . . . . . . . . 122.2.2 Automatic Speech Recognition . . . . . . . . . . . . . . . 15

III

3 Proposed Approach 183.1 Classification Of Prosody . . . . . . . . . . . . . . . . . . . . . . 18

3.1.1 Evaluation Problem . . . . . . . . . . . . . . . . . . . . . 183.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 193.1.3 Deep Learning Models . . . . . . . . . . . . . . . . . . . . 213.1.4 Model Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.5 Multi-Class Problem . . . . . . . . . . . . . . . . . . . . . 27

3.2 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . 273.2.1 Audio Embedding . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . 273.2.3 Transformer Model . . . . . . . . . . . . . . . . . . . . . . 283.2.4 Evaluation Procedure . . . . . . . . . . . . . . . . . . . . 29

4 Experiments: Classification Of Prosody For Aphasic Speech 304.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Embedding Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4.1 Performance Of Embeddings . . . . . . . . . . . . . . . . 334.4.2 Performance Of Baseline Models . . . . . . . . . . . . . . 34

4.5 Model Fine-Tuning Results . . . . . . . . . . . . . . . . . . . . . 354.5.1 Learning Adjustments . . . . . . . . . . . . . . . . . . . . 354.5.2 Impact Of Voting Schemes . . . . . . . . . . . . . . . . . 364.5.3 Impact Of Regularization . . . . . . . . . . . . . . . . . . 374.5.4 Impact Of Initialization . . . . . . . . . . . . . . . . . . . 384.5.5 Data-Driven Tuning . . . . . . . . . . . . . . . . . . . . . 394.5.6 Impact Of Model Size . . . . . . . . . . . . . . . . . . . . 41

4.6 Multi-Class Problem . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Experiments: Automatic Speech Recognition Of Aphasic Speech 445.1 Speech Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.1 Mozilla Common Voice . . . . . . . . . . . . . . . . . . . 445.1.2 LibriVox . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.1.3 Aachener Aphasie Test . . . . . . . . . . . . . . . . . . . . 45

5.2 Evaluation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.1 Word Error Rate . . . . . . . . . . . . . . . . . . . . . . . 455.2.2 Character Error Rate . . . . . . . . . . . . . . . . . . . . 45

5.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.4 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.5 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.6 Evaluation On Aphasic Speech . . . . . . . . . . . . . . . . . . . 47

IV

6 Conclusion 496.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2.1 Classification Of Prosody For Aphasic Speech . . . . . . . 516.2.2 Automatic Speech Recognition Of Aphasic Speech . . . . 51

Glossary 52

Bibliography 53

A Feature Evaluation Results 59

B LibriVox Partition 62B.1 Validation Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62B.2 Holdout Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

V

List of Tables

3.1 Summary of feature extraction methods . . . . . . . . . . . . . . 213.2 Multi-Layer Perceptron architectures . . . . . . . . . . . . . . . . 213.3 Long Short-Term Memory architectures . . . . . . . . . . . . . . 223.4 Convolutional Neural Network architectures . . . . . . . . . . . . 223.5 Initialization techniques . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Patient count per aphasia type . . . . . . . . . . . . . . . . . . . 304.2 Recording metadata for patients . . . . . . . . . . . . . . . . . . 314.3 Metrics of Functional ComParE features . . . . . . . . . . . . . . 334.4 Metrics of log-Mel filterbanks with CNNs . . . . . . . . . . . . . 344.5 Metrics of log-Mel filterbanks with shifted cutoff . . . . . . . . . 344.6 Metrics of baseline models . . . . . . . . . . . . . . . . . . . . . . 344.7 Comparison of voting scheme performance . . . . . . . . . . . . . 364.8 Metrics of regularization techniques . . . . . . . . . . . . . . . . . 374.9 Metrics of initialization techniques . . . . . . . . . . . . . . . . . 384.10 Metrics of excluded boundary cases . . . . . . . . . . . . . . . . . 394.11 Metrics of multi-layer architectures . . . . . . . . . . . . . . . . . 414.12 Metrics of three class problem . . . . . . . . . . . . . . . . . . . . 42

5.1 Count of samples per aphasia type and severity . . . . . . . . . . 455.2 Corpus level metrics of ASR models per evaluation corpus . . . . 465.3 Utterance level ASR metrics per type and severity of aphasia . . 48

A.1 Metrics of feature evaluation (Part 1) . . . . . . . . . . . . . . . 59A.2 Metrics of feature evaluation (Part 2) . . . . . . . . . . . . . . . 60A.3 Metrics of feature evaluation (Part 3) . . . . . . . . . . . . . . . 61

VI

List of Figures

2.1 Simple sinusoidal and waveplot . . . . . . . . . . . . . . . . . . . 82.2 Frequency representations of signals . . . . . . . . . . . . . . . . 92.3 Mel scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Sliding window over log-Mel filterbanks . . . . . . . . . . . . . . 132.5 Visualization of positional encodings . . . . . . . . . . . . . . . . 16

3.1 Count of patients per prosody rating . . . . . . . . . . . . . . . . 193.2 Voting schemes visualized . . . . . . . . . . . . . . . . . . . . . . 243.3 Architecture with residual connections . . . . . . . . . . . . . . . 263.4 Example of the text processing . . . . . . . . . . . . . . . . . . . 283.5 Diagram of a Transformer . . . . . . . . . . . . . . . . . . . . . . 29

4.1 Train/Test split . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Results of the feature evaluation . . . . . . . . . . . . . . . . . . 334.3 Loss function over Training- (blue) and Validation set (orange)

for the best performing algorithm of Section 4.4 . . . . . . . . . . 354.4 Validation loss before and after changing the batch size and learn-

ing rate. Dark blue: Batch size 20, learning rate 0.001, Cyan:Batch size 256, learning rate 0.0005 . . . . . . . . . . . . . . . . . 36

4.5 Predictions for soft- and hard voting per prosody rating . . . . . 374.6 Difference between validation- and training loss for the CNN

model with and without regularization . . . . . . . . . . . . . . . 384.7 Predictions in- and excluding boundary cases per prosody rating 394.8 Predictions in- and excluding boundary cases per aphasia type . 404.9 Predictions in- and excluding boundary cases per aphasia severity 404.10 Predictions in- and excluding boundary cases per segment length

on a logarithmic scale . . . . . . . . . . . . . . . . . . . . . . . . 414.11 Predictions per prosody rating . . . . . . . . . . . . . . . . . . . 424.12 Predictions per aphasia type and severity . . . . . . . . . . . . . 43

5.1 Examples of ASR output of the Common Voice model . . . . . . 47

VII

Abbreviations

AAT Aachener Aphasie Test. II–IV, 2–6, 20, 23, 28–31, 45, 46, 51

AM Acoustic Model. 15

ASR Automatic Speech Recognition. II–VII, 5–8, 14–16, 18, 22, 27–29, 44–48,50, 51

CER Character Error Rate. IV, 45, 46, 48, 50

CNN Convolutional Neural Network. II, VI, VII, 13–15, 22, 34, 38, 42, 49, 50,60, 61

dbs e.V. german federal association for academic speech therapy. 1

DFT Discrete Fourier Transform. 10

DNN Deep Neural Network. 15, 23, 24, 31, 59

GMM Gaussian Mixed Model. 17

GRU Gated Recurrent Unit. 13

HMM Hidden Markov Model. 5, 6, 15–17

HNR Harmonic-to-Noise ratio. 11, 20

LDA Linear Discriminant Analysis. 17

LLD Low-Level Descriptor. 20, 21, 60

LM Language Model. 15, 28

LSTM Long Short-Term Memory. VI, 13, 22, 34, 60, 61

MFCC Mel Frequency Cepstral Coefficient. 10–14, 17, 20, 21, 61

MLP Multi-Layer Perceptron. VI, 13, 16, 21, 34, 49

VIII

RNN Recurrent Neural Network. 13, 15

RNN-T Recurrent Neural Network Transducer. 15

STFT Short Time Fourier Transform. 9

SVM Support Vector Machine. 6, 12

TDNN Time Delay Neural Network. 15

WER Word Error Rate. IV, 14, 16, 17, 45, 46, 48, 50

IX

Chapter 1

Introduction

In order to assist the diagnosis of aphasia, a neurological deficit illustrated inthe following sections, this thesis proposes two deep learning based methodsthat are capable of classifying a certain speech dimension of aphasic speech, aswell as transcribing it, both applied on audio recordings. The need for suchmethods is introduced in Section 1.1, which is followed by a stricter descriptionof the impairment and current methods of diagnosis and their shortcomingsin the Sections 1.2 and 1.3. Section 1.4 adresses the open problems this workapproaches and is followed by the proposed research questions in Section 1.5.Finally, the approach to solve said questions is outlined in Section 1.6.

1.1 Motivation

Aphasia is an acquired impairment of brain regions responsible for human lan-guage processing. Patients suffer from reduced capability to comprehend, for-mulate, read and write language. Most commonly, aphasia is acquired after astroke. According to the german federal association for academic speech therapy(dbs e.V.) [4], it is estimated that 100.000 stroke patients suffer from aphasia ingermany alone, with 25.000 new aphasia cases each year, out of 270.000 strokeincidences. But strokes are not the only circumstance under which people ac-quire said disease. Others include head trauma, brain tumor and intoxication.Among the patients, there is a vast spectrum of how the impairment manifestsitself, complicating the diasnosis needed for a proper treatment. Therefore, itis of utmost importance that treatments are tailored to patients, demanding acertain and fast diagnosis. However, existing methodology, such as described inSection 1.3, claims significant amounts of resources, such as speech therapistsand physicians, transport and monetary compensation. It is the task of thisthesis to accelerate existing diagnostic methodology employing machine learn-ing methodology, hence potentially freeing up resources and ensuring propertreatment for aphasia patients.

1

1.2 Aphasic Language

Aphasia is always unique per patient. Patients mostly do not share pathology,as the symptoms come in different varieties and severities. Hence, even pa-tients of the same subtype of aphasia might reveal vastly different impairments.Examples of such impairments are stuttering, trouble finding words, impairedlistening comprehension and inability to repeat after other people, among manyothers. People with deficits in designating objects might have trouble findingwords in general, or selectively, such as only for colors, modes of transportationor people. In the german language area, which is the focus of this research,aphasia is classified into four main types, called Broca-, Wernicke-, amnestic-and global aphasia, according to the so called ’Aachener Aphasie Test’ [1] (en-glish: Aachen aphasia test), or in short: AAT. The following descriptions adhereto the AAT.

Broca Aphasia The speech production of Broca patients is arduous. Theyare not capable of finding adequate words and do not employ proper grammar.Their speech is slowed down and can, to outsiders, seem artificial, as they fail tostress words correctly or to employ melodious speech. The impaired are awareof their inabilities and comprehend speech from other sources correctly.

Wernicke Aphasia Patients suffering from Wernicke aphasia do often notrecognize the severity of their impairment and misinterpret other peoples speech.Additionally, the sentences they form are on average longer and more interlacedthan the ones uttered by non-impaired speakers, while they struggle to build upcoherence regarding the content of what they say. Often, they state the samething multiple times.

Amnestic Aphasia Amnestic aphasia can be mistaken for Broca aphasia.Patients tend to have troubles with finding words, and so they use empty phrasesand paraprasings, hence they describe what they mean using the vocabularythey have access to instead. Their speech often contains long pauses. Besidesthe ones above, no impairments are present.

Global Aphasia The most severe form of aphasia is global aphasia. The pro-duction and comprehension of language are impaired. Speech does not conveyany coherent content and reading and writing are mostly not possible. Peoplesuffering from global aphasia tend to overemploy facial expressions and gestures,as well as automatisms such as ’yeahyeahyeah’.

Not Classifiable Besides the main types, patients can be designated as non-classifiable if they do not meet any criteria, or the criteria for multiple types ofaphasia at once.

2

1.3 Aachener Aphasie Test

This Section encompasses a description of the AAT, which in the german lan-guage area is the gold standard for aphasia diagnosis.

1.3.1 Standardized Approaches To Aphasia Diagnosis

The AAT as a system for diagnosis of aphasia owes its usage to the meritsof its standardized procedure. Before its invention, aphasia was diagnosed inmultiple non-standardized ways, drawing upon the individual experience of thetherapists, differing in each therapy location. The AAT improves the validityand reliability by creating a process that is highly regulated. Initially executedduring the first six weeks after acquisition of aphasia, a follow up exam is doneafter six to twelve months, and then in a yearly interval [1]. It has four maingoals:

1. Determining whether a patient suffers from aphasia, especially with re-gards to patients with brain damage

2. Identifying speech impairments such as alexia

3. Classifying the patients impairments into the main subtypes

4. Identifying the severity of the symptoms and the syndrome

In order to accomplish these, the test uses several subtests.

AAT Tests

Token Test The token test evaluates the general ability of the patient toconcentrate, as well as their listening comprehension and cognitive capacity.The therapist prepares so called tokens, such as shapes of different colors. Thepatient is then assigned the task to point at a shape with a specific color, suchas an orange triangle. While being able to differentiate between aphasic- andnon-aphasic patients, this test does not provide clues with regards to the specificsyndrome the patient suffers from.

Repetition Of Speech The patient is summoned to repeat after the ther-apist. The therapist utters raw sounds, german words, foreign words and fullsentences.

Literary Language This test consists of multiple tasks. Patients read froma board, use letter cards to assemble words uttered by the therapist and writedown what the therapist says.

Designating Items The patient is tasked to name objects, colors, compoundnouns, situations and activities.

3

Speech Comprehension For each word or sentence uttered by the therapist,the patient has to choose among four simple drawings, identifying the one relatedto the utterance. This way, the auditory and reading comprehension of thepatient is tested.

Spontaneous Speech Test The therapist interviews the patient in a semi-structured interview. They ask broadly formulated questions regarding the pa-tients family life, current situation, disease and work. The function of this testis to assess six dimensions of spontaneous speech, as described in the followingSection. This allows the therapists to classify the syndrome of aphasia, if anyis present. This test is from here on referred to as interview. The audio record-ings provided as data in Sections 4.1 and 5.1.3 stem from this test, executed bymultiple combinations of patients and therapists.

Dimensions Of Spontaneous Speech

This Section describes the six dimensions that are evaluated during the sponta-neous speech test.

Communication Behaviour This dimension is populated by the patientsability to participate in a conversation of verbal and non-verbal form. Therefore,it describes whether they are capable of comprehending questions, answering tothem and expressing themselves.

Prosody A main focus of this thesis is the spontaneous speech dimensionsof prosody. Prosody is the study of so called suprasegmentals [5]. These areelements of speech beyond phonetic segments such as vowels, therefore theyoperate on the level of syllables and further. There is no consensus on theexact variables attributed to prosody, but major variables are identified suchas duration, intensity, timbre and intonation. Together, but not exclusively,suprasegmentals convey the emotion of the speaker, as well as intentions, em-phasis and whether the speech was sarcastic or ironic [5]. In the context of theAAT it is rated within the spontaneous speech test.

Automated Speech This dimension encompasses the use of automatismsand empty phrases.

Semantic Structure The dimension of semantic structure deals with theability to utter and combine words with regards to sentence structure.

Phonematic Structure Regarding the phonology of the patient, the correctarrangement and utterance of sounds is evaluated.

Syntactical Structure It is evaluated how well the patient is able to formsentences that adhere to the structure dictated by grammar.

4

1.3.2 Shortcomings Of The AAT

In order to arrive at a definitive diagnosis, the AAT has to be undertaken mul-tiple times. This results in the consumption of manifold resources. First of all,patients from all over the german language area, including remote rural areas,have to be transported to a testing center, such as the University Hospital ofAachen [1]. While testing, highly qualified speech therapists and physicians, aswell as testing facilities, are occupied for the duration of the tests. This is due tothe nature of the AAT, which is labour intensive due to presence constraints, thenecessity of protocols and their post processing, as well as organizational effort.Additionally, the results of the spontaneous speech interviews are prone to theexperience and conduct of the therapist, as there are no objective biomarkers,as they are available for diseases like cancer.

1.4 Problem Statement

The interviews introduced in Section 1.3.1 are manually evaluated with regardsto the six dimensions of spontaneous speech. This enables potential automa-tization, cutting time costs while upholding quality standards, and potentiallyremoving biases introduced by the personality of therapists.

1.4.1 Classification Of Prosody

A first point of automatization lies within the classification of prosody on theaudio data of said interviews. The therapists rate prosody on a discrete or-dinal scale ranging from zero, for total impairment of prosody, to five, for noimpairment. By automating this process, the time that it costs to analyze theinterviews is reduced and potential therapists and physicians freed up for othertasks within the process of diagnosis. A system solving said problem should beable to take raw audio recordings and output a score regarding the quality ofthe speakers prosody.

1.4.2 Automatic Speech Recognition

Various examinations within the AAT require written transcripts. AutomaticSpeech Recognition, which is capable of returning said representation from au-dio data, is usually performed on non-impaired speakers. Current methods oftranscribing aphasic speech [6] using ASR tools rely on conventional machinelearning models, such as Hidden Markov Models. Hence, the performance ofstate of the art methods such as Transformers [3] on impaired speech of apha-sia patients is not known. Accomplishing efficient and reliable ASR potentiallyopens up several possibilities with regards to aphasic speech, such as automateddiagnosis of spontaneous speech dimensions or assistance in patients daily lifedue to eased communication.

5

1.5 Research Questions

Adressing the problems formulated in Section 1.4, three research questions arecreated with the task of assisting the AAT. The first two regard the classifica-tion of the prosody dimension. The problem is reduced to a two class problem,between impaired and non-impaired speakers, designated by I and NI respec-tively.

Research question 1: Which features extracted from recordings can be usedto correctly discriminate classes of prosody on a dataset of impaired speakers?

As further covered in Section 2.1, there are multiple possibilities to embedsounds, and specifically speech into a computer readable format. The vastamount of research however focuses on non-impaired speech. Hence, a represen-tation of audio data stemming from aphasia-impaired speakers is to be foundwith the goal to ease classification of prosodic features.

Research question 2: How do deep learning approaches perform on the taskof classifying prosody of impaired speakers?

Existing methods for the task of prosody classification, described in Chapter2, employ conventional methods of machine learning, such as Support VectorMachines (SVMs) and rule based systems. However, in the recent past, deeplearning has made an impact on the performance of audio machine learning,as can be seen in the recent high performing algorithms of contests such asInterspeechs ComParE [7]. Hence an effort to investigate the performance ofdeep learning methods on the classification of aphasic speech data is undertaken.

Research question 3: Can pathological speaker’s audio be transcribed usingTransformers?

While Automatic Speech Recognition for aphasic speech is an ongoing re-search topic, current approaches (see [8] and [6]) use conventional methods ofmachine learning, such as Hidden Markov Models. Recently, Transformers [9]have been applied to audio data by Pham et al. [3], performing ASR. The per-formance of their approach is evaluated on non-impaired data. Therefore, inthis thesis, said Transformer architecture is trained on a corpus of non-impairedspeakers and its performance evaluated on aphasic speakers, in an effort to judgethe usability of said approach on impaired speakers.

6

1.6 Outline

The following chapters examine the research questions formulated in Section1.5. Therefore, Chapter 2 provides background on the methods used for theembedding of audio data, borrowing from the domain of signal processing. Fur-thermore, it supplies insight into the usage of deep learning in the audio domainand situates this project within the related work. Chapter 3 then summarizesthe approach proposed in this work, followed by the actual experiments andtheir results for the problems of prosody classification and ASR in the Chapters4 and 5 respectively. Lastly, Chapter 6 summarizes the results of the approachestaken and attempts to answer the research questions.

7

Chapter 2

Background And LiteratureReview

This chapter illustrates how machine learning can borrow from the domain ofsignal processing in order to generate audio embeddings that possess informativevalue for machine learning tasks such as supervised learning. Furthermore, itsheds light on how these embeddings are then used for such tasks, specifically inthe context of audio classification and ASR. This work is situated in the relevantliterature within the respectively relevant Sections.

2.1 Audio Embeddings

2.1.1 Conventional Methods Of Signal Processing

(a) Sine wave (b) Envelope of a voice recording

Figure 2.1: Simple sinusoidal and waveplot

The domain of signal processing focuses on extracting and manipulating infor-mation out of a signal source, such as audio or images. Therefore, continuousanalog signals - such as the exact audio that one hears when engaging in a per-son to person conversation - need to be represented in a digital manner, which isof discrete nature: digital signals. A very simple signal is the sine wave. As seenin Figure 2.1 (a), it can be described by its amplitude, period and phase offset.

8

There are multiple usages of the term amplitude, but in the case of sinusoidalsit is referred to as the peak amplitude, or maximum difference from the meanof the signal. Its period describes the time interval that passes for a full cycle,and its phase offset how the sinusoidal is shifted with respect to the time axis.This is summed up in Equation 2.1. The frequency is the inverse of the periodf = 1

T , amplitude is described by A and the phase offset by ϕ.

y(t) := Asin(2πft+ ϕ) (2.1)

More complex signals, such as audio in the case of Figure 2.1 (b), can berepresented using the envelope of its amplitude peaks, as is often visualized incommon audio players and audio workstations. The information this represen-tation bears is of little to no use for the analysis of speech, as humans formsentences using more than only variations in signal intensity. Therefore, signalsare represented in the frequency domain too. This is done by employing the socalled Fourier transform, which decomposes a signal from the time domain tothe time-frequency domain. This is seen in Figure 2.2 (a).

(a) Fourier transform (b) Spectrogram

Figure 2.2: Frequency representations of signals

This representation does neglect an important factor - time. In order toaccomplish this, the so called Short Time Fourier Transform (STFT) is calcu-lated by sliding a window function, which is nonzero for a small time, over thetime axis and calculating the fourier transform of the results. The acquiredembedding is a frequency representation over time, visualized in Figure 2.2 (b).Software that extracts audio features, such as Praat [10] and openSMILE [11],often relies on these spectral representations by, for example, fitting polynomialsof the frequencies over time and saving their coefficients, or storing informationabout quantiles of pitch.

2.1.2 Mel-Scale Methods

In order to represent audio signals closer to what humans perceive, the scale offrequencies depicted in the aforementioned representations can be mapped toreflect the human auditory system. Humans do not perceive frequencies on alinear scale. Audio signals are thus scaled to the called Mel scale [12], which isan empirically designed scale that resembles how humans perceive frequency, asseen in Figure 2.3.

9

Figure 2.3: Mel scale

This opens up the realm of log-Mel filterbanks and MFCCs. Log-Mel filter-banks are calculated by mapping the spectrogram of a signal to the Mel scaleand applying the logarithm, which replaces any necessary multiplication withan addition. Then, the filterbanks are reduced in count. This representationcan further be decorrelated by taking the Discrete Fourier Transform (DFT)of said filterbanks. The amplitudes of the resulting spectrums are the so calledMel Frequency Cepstral Coefficients (MFCCs), which are a commonly used rep-resentation for audio analysis. The drawback of said method is their pronenessto additive noise. As a result, their values are commonly normalized, reducingits impact.

2.1.3 Voice-Specific Methods

Further methods have been proven successful representations in the domain ofcomputational linguistics. These focus on spoken language.

Fundamental frequency Especially useful in word boundary detection, fun-damental frequency is denoted by f0 and describes the lowest frequency of asound [13].

Formants The human voice is generated by releasing air through the vocalchords and filtering it using the vocal tract. Due to this design, each phonemein the human repertoire has peaks of certain frequencies. Typically, the firsttwo formants are used to classify a phoneme [13]. Formants by definition onlyexist in human voices.

Timbre Humans are able to classify sounds even though they exhibit the samepitches. The properties responsible for this are inherent in the spectral distri-bution of a given signal, most notably the attack, or time till peak amplitude ofa given envelope [13]. Jitter and shimmer also provide insight into the timbreby describing fluctiations in periodicity and amplitude [11]. Timbre serves toconvey emotions and other clues about the condition of the speaker.

Rhythm Lastly, repetetive events in an audio file can hint at an inherentrhythm [13]. This might not be as trivial as extracting signal peaks, as humans

10

are capable to form complex rhythms, for example due to changes in metre whileciting a poem.

2.1.4 Embeddings Of Pathological Speech

Combining the methods mentioned in Sections 2.1.1 to 2.1.3, Kim et al. [2]attempt to classify the intelligibility of pathological speakers, namely patientssuffering from advanced head- and neck cancer, cerebral palsy and amyotrophiclateral sclerosis. The patients of the study suffer from diseases that affect onlythe speech production, non its comprehension and processing, due to the inflic-tion of nervous demange or muscle failure, and hence are not directly compara-ble to aphasic speakers. This is called dysarthric speech. In order to solve thebinary problem of detecting whether a patient is impaired or not, Kim et al.design three subsystems.

The first subsystem relates to the prosody. Firstly, on the phone- and ut-terance level, markers such as pitch, duration and pitch stylization parametersare collected. Then, summary statistics, such as quantile ranges, are takenand polynomials fitted in order to gain coefficients. For the subsystem regard-ing voice quality, jitter, shimmer and Harmonic-to-Noise ratio (HNR) [14] areextracted and again statistical measures taken over the temporal dimension.Lastly, the subsystem for pronounciation uses 39-dimension MFCCs, combinedwith formants and durations of syllables and pauses and their summary statisticssaved. The modeling of said approach is described in Section 2.2.1. Kohlschein[8] extends the methods of Kim et al. [2] for the sake of classifying prosody inaphasic speech. An additional subsystem for the rating of speech velocity usesa method proposed by de Jong [15]. It uses a four step procedure:

1. Extract signal intensity over time

2. Mark signal peaks above the median intensity of the signal as potentialsyllables

3. Discard potential syllables that do not occur after a decline in intensity

4. Extract pitch contour followed by discarding unvoiced syllables

The speech velocity is then extracted from the timing of the syllables andcompared to a standard value defined by a speech therapist. Lastly, a subsystemfor the measurement of speech fluency uses features extracted with regards tothe length of the signal, signal energy [16] and frequency domain, drawing onthe works of Lopez-de-Ipina et al. [17] in automatic diagnosis of Alzheimersdisease. The energy of a signal x(t) is defined as the area under the squaredmagnitude of the signal, as seen in 2.2 and 2.3 for the discrete and continuouscase respectively. How these features are then modeled follows in Section 2.2.1.

Es =

∞∑n=−∞

|x(n)|2 (2.2)

11

Es =

∫ ∞−∞|x(t)|2dt (2.3)

MFCCs are used in the work of Mayle et al. [18], diagnosing dysarthricspeech in Mandarin syllable pronounciations. They solely extract 19 Mel Fre-quency Cepstral Coefficients and reach a speaker-level performance of 92.3%AUROC. The modeling Mayle et al. administer is further described in 2.2.1.

2.2 Supervised Learning In The Audio Domain

Independent from the set of features used, machine learning in the audio domaincomes with its own inherent problems and characteristics. How models in thisdomain are trained is explained in this Section.

2.2.1 Audio Classification

The possible problems within audio classification occupy a wide range of possibletasks. Problems range from genre classification in music, to speaker identifica-tion in speech, to source identification in artificial sounds such as vehicle noise,to bird detection in ornithology and beyond. An inherent attribute of audioclassification is its temporal nature. Almost always audio data has a variablelength, posing the question of how to train machine learning models with avariable length of input data.

Non-Sequential Approach

The first option of classifying audio data is by employing fixed-size feature setstrough summary statistics. These, as described in Section 2.1.4, are then fedinto classical data mining algorithms and combined by ensemble methods.

In the work of Kim et al. [2], patients utter multiple sentences. Featuresare calculated on an utterance level and then classified employing Support Vec-tor Machines (SVMs). Afterwards, the results are combined using posteriorsmoothing, an approach designed by the authors in order to sample from theposteriors of the incoming utterances. The result is a recall value of 73.5% ona patient level. The downside of this approach is the intense usage of summarystatistics. Local anomalies, for example in MFCCs, are of high explanatorypower when it comes to impaired speech, and their informations lost due to theusage of such method. On the other hand, the memory consumption as wellas the computations needed during model training are greatly reduced, as thecomputation of is done upfront. Kohlschein [8] extracts the features as in Kim etal. [2], but combines the features using solely SVMs. The additional subsystemsmentioned in 2.1.4 are also combined using Support Vector Machines.

12

Sliding Window Approach

Figure 2.4: Sliding window over log-Mel filterbanks

In order to not rely on summary statistics, a sliding window approach is feasible.This way, features such as log-Mel filterbanks are extracted for windows accrossthe temporal axis, as illustrated in Figure 2.4. The stride between windowsand their size can be tuned. This has the advantage that no information is lostdue to summarizing. The resulting features are then fed into a classificationalgorithm. If the window size is too small, information might not fit into thewindow, due to, for example, long words. If the window size is too large, sampleshave to be discarded as they are too small to fit into a window. Additionally, atradeoff between represented information and feature set size is present due tothe size of the strides. The results of all window have to be combined in orderto arrive at a decision for a single sample. This can be achieved by employinghard- or soft-voting among others, as described in the methods of Section 3.1.4.

Sequential Deep Learning

Recurrent Neural Networks (RNNs) are an approach to model temporal data inan Encoder-Decoder fashion. As in a Multi-Layer Perceptron (MLP), a repre-sentation of the inputs is constructed and the target variable predicted from saidvector. A particularity of RNNs is that each hidden states output is incorpo-rated as an input to itself when training the next timestep. The problem withthis approach is that the gradient decays exponentially through time, henceprohibiting the learning of long-term relations. Especially Long Short-TermMemorys (LSTMs) [19] and Gated Recurrent Units (GRUs) [20] have provento be successful at combatting this so called vanishing gradient problem by im-posing constraints on which temporal information is kept during the learningprocess.

Mayle et al. [18] employ LSTMs in their research mentioned in Section 2.1.4.They compare Bi-Directional LSTMs, a form where the hidden units have accessto data from the past and future in relation to the current timestep, with single-and multi-layer LSTMs. This architecture enables Mayle et al. to make use ofthe full set of MFCC vectors for a given signal.

A typical shortcoming of any RNN is that training is not parallelizablethrough time. The hidden units do not scale with time, making RNNs storageefficient but computationally cumbersome. In contrast, Convolutional Neural

13

Networks (CNNs) are highly parallelizable. Their advantage lies within theusage of independent filter kernels that are convoluted over the signal.

Alhlffee [21] uses CNNs to classify emotions from combinations of speechand text signals. Peak performance for its four class emotion detection prob-lem is reached when using 40-dimensional Mel Frequency Cepstral Coefficientscombined with Word2Vec text embeddings [22]. While this is method pro-duces a classification accuracy of 76.1%, a pure audio-driven feature set reaches73.6 % accuracy, proving the feasibility of classification tasks employing spec-tral and cepstral feature embeddings in combination with Convolutional NeuralNetworks.

Attention Based Methods

As stated in Section 1.4, Transformers are capable of performing AutomaticSpeech Recognition [3], proving their capability of learning audio features. Thisis further clarified in Section 2.2.2. Boes et al. [23] combine audio and videofeatures into a single Transformer in order to classify and synchronize audioevents. Therefore, the encoder is fed with audio features while the decoderreceives video data along the encoders output. The work reaches a micro-averaged F1-score of 70.1% on a 17 class problem. Discarding the decoder,Zhuang et al. [24] perform music genre classification. They propose using theTransformers encoder layer on top of a feed forward network. The results ofsaid paper are not available to the public, hence no scores are provided.

Transfer Learning

Another approach of audio classification is provided by Lech et al. [25]. Unlikeaforementioned works they employ images created from spectrograms in orderto classify emotions on the EMO-DB [26] dataset. This enables the authorsto use pre-trained neural networks, namely AlexNet [27] trained on 1.2 millionimages of the ImageNet [28] dataset, for the purpose of audio classification.They further compare the impact of different image representations, namelyRGB color images, single channel images from red (R), green (G) and blue (B)RGB channels, and grey-scale images. They reach a weighted F1-score of 79.6%on the EMO-DB [26] dataset using RGB images.

Transfer learning is not restricted within the audio domain. Schneider etal. [29] propose Wav2Vec. Their model is trained to solve a next time stepprediction task using two stacked Convolutional Neural Networks with the goalof improving ASR performance in environments with few available data by cre-ating a pre-trained embedding. Pre-trained models, created using 960 hoursof Librispeech [30] data, are available to download for the public. Testing theapproach on only eight hours of available labeled data reduces the Word Er-ror Rate (WER) up to 36%. Applications on classification tasks are yet to betested.

14

2.2.2 Automatic Speech Recognition

Hidden Markov Models

Early approaches to Automatic Speech Recognition range back to 1952. Au-drey, designed by Davis et al. [31], is able to recognize spoken digits by de-tecting formants. In the 1980s, the focus of research shifts from simple patternrecognition algorithms to statistical modeling [32]. Initial experiments, such asthe voice-activated typewriters by Jelinek et al. [33], feature Hidden MarkovModels (HMMs). HMMs work by providing audio signal as input states andstoring phonetic information in hidden states. Given a sequence of audio fea-tures Y1:T = y1, ..., yT the model tries to find the most likely sequence of wordsw1:L = w1, ..., wL matching the input, as seen in Equation 2.4 [34].

w = arg maxw

{P (w|Y )} (2.4)

In order to reduce the complexity of directly modeling the words from au-dio input, Hidden Markov Models instead perform the equivalent problem ofEquation 2.5.

w = arg maxw

{P (Y |w)P (w)} (2.5)

An important further step is the usage of Deep Neural Networks (DNNs) inorder to estimate the emission probabilities of the Acoustic Model (AM) P (Y |x)in Equation 2.5. These so-called DNN-HMMs [35] still rely on complex priorknowledge of intrinsic semantic information inside the text samples - requiringa Language Model (LM) in P (w) - and require time-aligned audio in order tobe trained.

End-to-End Models

As mentioned in Section 2.2.1, several algorithms are natively able to handle se-quential input, including Recurrent Neural Networks and Convolutional NeuralNetwork. Graves et al. [36] propose using an encoder-decoder RNN architec-ture, called Recurrent Neural Network Transducer (RNN-T), for the purpose ofsequence transduction. This has the advantage of combining Acoustic Modeland Language Model in a single statistical process, without having to singlehandedly train them beforehand. Time Delay Neural Networks (TDNNs) [37]are employed in the widely used ASR-Toolkit Kaldi [38]. These are differentfrom conventional CNNs in terms of connections to past time steps, as theyemploy so called delay taps that connect past time steps to the current outputvia fixed weights.

Attention Models

Existing Encoder-Decoder models decline in performance with growing inputsequence length. Especially in Recurrent Neural Networks, where the gradient

15

vanishes through time, this poses a problem. Bahdanau et al. [39] introduceattention mechanisms, which enable the decoder to decide which part of theinput representations is relevant to the current output position by creating a socalled context vector, as seen in 2.6, where a and h annotate the attention scoreand hidden state respectively.

ci =

Tx∑j=1

aijhj (2.6)

In order to further improve said approach in terms of training time andparallelizability, Vaswani et al. [9] propose the Transformer. It relies on socalled self-attention. Self-attention is calculated on pairs of positions of theinputs themselves instead of the outputs of convolutions or recurrency. Thisway, arbitrary-length input can be processed without a decline in performance.The process is repeated multiple times in parallel, employing so called multi-head attention. As attention is not position-aware itself, the authors proposepositional encoding in the form of

PE(pos,2i) = sin(pos/10002i/dmodel) (2.7)

PE(pos,2i+1) = cos(pos/10002i/dmodel). (2.8)

The positional encoding is visualized in Figure 2.5.

Figure 2.5: Visualization of positional encodings

Originally designed for the work on text corpora, Pham et al. [3] modify thisarchitecture in order to support audio data, performing an ASR task. Employingensemble techniques, Nihues et al. are able to reach a WER of 9.9% on theSwitchboard dataset [40], which is comparable to state of the art performancein Povey et al. [41], reaching 9.6%.

Automatic Speech Recognition For Aphasic Speech

In the context of aphasic speech, Abad et al. [42] employ a MLP-HMM onPortuguese aphasia patients, solving a word naming task. The resulting model

16

achieves a WER of 21% employing spectral features such as spectrograms. On acontinuous speech task, Le et al. [6] reach a WER of 39.7% using GMM-HMMscombined with MFCC features processed using Linear Discriminant Analysis(LDA). Their experiment is evaluated on english speakers from the AphasiaBankdataset [43].

17

Chapter 3

Proposed Approach

This chapter regards the approaches taken in order to accomplish the researchgoals formulated in Section 1.5. The research questions one and two are ap-proached by solving a binary classification task. The task itself is achieved byrefining the problem of Section 1.4.1. Then, eight feature sets are extracted asproposed in Section 3.1.2. These are then evaluated on a multitude of modelsintroduced in Section 3.1.3. Finally, a candidate feature set and model are cho-sen, which are then fine-tuned as proposed in Section 3.1.4. Evaluations followin the experiments Section.

Additionally, a methodology for the training and evaluation of an ASR modelwith regards to aphasic speech is established in Section 3.2. Therefore, in theSections 3.2.1 and 3.2.2, it is outlined how the input data is processed. Finally,an evaluation strategy is designed in Section 3.2.4.

3.1 Classification Of Prosody

This Section highlights the methods used to answer the research questions oneand two. The general approach taken is that first all proposed feature sets arecalculated and trained on a multitude of models. These are then evaluated, andthe top-scoring combination of feature set and model taken as a baseline forfurther tuning. Then, a final model is proposed after going through multipletuning steps.

3.1.1 Evaluation Problem

Taking the problem outlined in Section 1.4.1, a deep learning task is created.The task is to create an end-to-end model that is capable of classifying whethera persons prosody is impaired or not. Hence, the classes NI , for not impaired,and I , for impaired, are created. A sample is thus assigned to the class NI if itsprosody ratings equals the maximum rating of five, otherwise it is assigned toclass I . The classification is to be performed on raw audio data, as its goal is to

18

input an interview recording and output a prosody diagnosis, without any texttranscripts. Additionally, the final algorithm is to be trained and evaluated onthe three-class problem proposed by Kohlschein [8], which consolidates everytwo prosody ratings into one class, resulting in the classes of prosody ratingszero and one, two and three, and finally four and five.

Figure 3.1: Count of patients per prosody rating

As demonstrated in Figure 3.1 the prosody ratings below three are scarcelypresent, resulting in a distribution of ratings skewed towards non-impaired pa-tients, which is the main factor in chosing a two- and three class problem overthe classification of all five classes. Additionally, comparable research, such asperformed by Kim et al. [2] and Mayle et al. [18], performs classification usinga binary tasks, making the results comparable.

3.1.2 Feature Extraction

For this work, a multitude of feature sets is calculated, drawing from summarystatistics up to pre-trained embeddings.

Spectral and Cepstral Features

The spectral and cepstral features calculated in this Section are all of temporalnature, hence necessitating sequence models.

Melspectrogram Using librosa [44], Melspectrograms are extracted. Thepower spectrum is computed using fourier transform windows of size 512, usingHann windows [45].

19

Log-Mel Filterbanks For this feature set, 40 Mel filterbanks are calculatedusing a 25 millisecond Povey window [38]. Additionally, deltas are appended.Again, the size of the fourier transform is 512 and librosa [44] is used.

Mel Frequency Cepstral Coefficients In addition to Log-Mel filterbanks,their MFCCs are extracted. Again, deltas are appended.

Interspeech Features

The Interspeech ComParE [7] is an annual competition that concerns research inthe domain of signal processing, natural language processing, machine learning,audio, speech and more. Eyben et al. [11] provide a software called openSMILE,which is capable of extracting the features as provided to the participants ofsaid challenge. Additionally, it is able to supply the eGeMAPS [46] featureextraction, which is explicitly recommended for the usage within paralinguisticsand clinical speech analysis.

ComParE Two feature sets are provided to the ComParE [7] participants,namely Low-Level Descriptors (LLDs) and Functionals. Regarding the LLDs,four energy related sets are supplied, as well as 55 spectral and six voice relatedfeatures. Examples of the three categories are the sum of the auditory spectrum,or loudness, the MFCCs, as well as the formants. The functional features de-scribe summary statistics and regression coefficients over said descriptors, hencenon-sequential embeddings. 65 Low-Level Descriptors and 6373 Functionals areextracted. A summary of all features can be found in [47]. The features stemfrom the tasks of deception-, sincerity- and native language classification.

eGeMAPS Using eGeMAPS [46], 18 Low-Level Descriptors are available.They are categorized as frequency related parameters, such as pitch and for-mants, energy related, such as Harmonic-to-Noise ratio, and spectral parame-ters, such as Alpha ratio, defined as the ratio of summed energy from 50-1000 Hzand 1-5 kHz. Additional summary statistics are supplied in the form of, amongothers, number of loudness peaks per second and mean length of unvoiced re-gions. The final set of functionals contains 88 parameters. The features aredesigned in order to solve several emotion recognition tasks, as defined in [46].

Pre-trained Features

As mentioned in 2.2.1, Wav2Vec [29] is a pre-trained embedding for audio data.As the AAT data supplied for this work is small in size, further explored inSection 4.1, this approach is used in an effort to enrich the AAT data withphonetic information of the pre-trained models. Hence the data is resampledto 16.000 Hz. Then 30 ms windows with a stride of 10 ms are embedded into afeature space with 512 dimensions. The extraction is done using fairseq [48], asoftware for sequence modeling supplied by Ott et al.

20

Summary Of Feature Extraction Methods

The resulting feature extraction methods are summed up in Table 3.1. Thevariables n and t describe the sample and timestep respectively.

Name Sequential Extraction Type SizeMelspectrogram Yes librosa Spectral [n, t, 40]

Log-Melfilterbanks

Yes librosa Spectral [n, t, 80]

MFCCs Yes openSmile Cepstral [n, t, 26]ComParE LLD Yes openSmile Mixed [n, t, 65]

ComParEFunctionals

No openSmile Statistical [n, 6373]

eGeMAPS LLD Yes openSmile Mixed [n, t, 13]eGeMAPSFunctionals

No openSmile Statistical [n, 88]

Wav2Vec Yes fairseq Pre-Trained [n, t, 512]

Table 3.1: Summary of feature extraction methods

3.1.3 Deep Learning Models

After extracting the features from the dataset they have to be evaluated. There-fore, 13 baseline models are created, ranging from Multi-Layer Perceptrons(MLPs) to Transformers, depending on the extracted features. As these baselinemodels serve the comparison of feature extraction, no further regularization isapplied. The models are evaluated on an utterance- instead of a patient level,minimizing the influence of voting mechanics with regards to feature evaluation.

Multi-Layer Perceptrons

The functional features [7] [46] created on the Interspeech data are not of se-quential nature, hence do not require sequential models. Therefore, MLPs arecreated to handle said features. The proposed models consist of one or fourlayers, with 15 and 40 neurons respectively. Inner layers are activated by sig-moid functions, output layers entail softmax functions. The architectures aresummarized in Table 3.2.

Architecture # 1 2 3 4Layer Type MLP MLP MLP MLPLayer Count 4x 4x 1x 1xNeuron Count 15 40 15 40Inner Layers Sigmoid Sigmoid - -Output Softmax Softmax Softmax Softmax

Table 3.2: Multi-Layer Perceptron architectures

21

Long Short-Term Memorys

In accordance with the work of Mayle et al. [18], LSTMs are among the eval-uated algorithms for sequential feature embeddings. The architectures tested,described in Table 3.3, contain four and one layer respectively with 15 and 40LSTM units. All units are activated using the tanh function in the samplespace, and sigmoid accross the temporal backpropagation. The output is againa softmax layer.

Architecture # 1 2 3 4Layer Type LSTM LSTM LSTM LSTMLayer Count 4x 4x 1x 1xUnit Count 15 40 15 40Output Softmax Softmax Softmax Softmax

Table 3.3: Long Short-Term Memory architectures

Convolutional Neural Networks

As noted in 2.2.1, CNNs provide a highly parallelizable alternative to LSTMs.In Table 3.4 it is illustrated how this work employs four baseline models designedusing CNNs. The architecture leans on the LSTM implementations mentionedbefore. Convolutional Neural Networks do not rely on backpropagation throughtime, hence it is not possible to directly feed their outputs into feed forwardarchitectures. For this purpose, Global Max Pooling is employed in order tovectorize the output and feed it into a softmax layer. The convolutional layersthemselves are activated by ReLU functions.

Architecture#

1 2 3 4

Layer Type Convolutional Convolutional Convolutional ConvolutionalLayerCount

4x 4x 1x 1x

FilterCount

15 40 15 40

Reshaping Global MaxPooling

Global MaxPooling

Global MaxPooling

Global MaxPooling

Output Softmax Softmax Softmax Softmax

Table 3.4: Convolutional Neural Network architectures

Transformers

From the field of attention based algorithms, the model of Pham et al. [3] isaltered to no longer perform an ASR task, but a classification task instead.Therefore, the outputs of its encoder are vectorized by Global Max Pooling.

22

Four Layers with two heads are used with 64 feed-forward units. Finally, theoutput is fed into a softmax layer.

3.1.4 Model Tuning

After evaluating the feature extraction algorithms, the method yielding thehighest results, a combination of feature extraction and model, is used for furtherfine tuning.

Voting Algorithm

The results taken into consideration up to this point are processed on an utter-ance level. However, the diagnosis in the AAT relies on interviews of variablelength, employing a multidude of utterances per patient. Two voting schemesare compared in order to predict a final diagnosis.

Soft Voting The outputs of each model are computed by a softmax layer,hence the outputs are restricted to lie within (0; 1], where each output predictsthe softmax probability of their class being the true class using Equation 3.1.Using soft voting, the classes are assigned to the highest sum of softmax proba-bilities over all utterances contained within a single patients interview. At thispoint, a weighting of importance for different classifiers is possible in an ensem-ble setting, which does not apply to the usage of a single Deep Neural Network.Additionaly, samples can be weighted, which is also not of use for the task athand, as there is no heuristic for the assessment of importance among samples.A short utterance can yield the same predictive information as a long speechwith slight anomalies. The formula for unweighted soft-voting when using Nsamples is shown in Equation 3.2.

σ(~z)i =ezi∑Kj=1 e

zj(3.1)

y = argmaxi

N∑j=1

σj(~z)i (3.2)

Hard Voting Hard voting, also labeled majority voting, relies on the ma-jority decision of each independent prediction. Hence, the mode of the outputdistribution is assigned to a patient regarding the utterances they yield.

The final proposed voting schemes are visualized in Figure 3.2. Depending onthe certainty of the classifiers prediction, both methods can yield vastly differentresults.

23

Figure 3.2: Voting schemes visualized

Regularization

The models trained and evaluated to this point are prone to overfitting, as noregularization is applied. In accordance with the results, regularization meth-ods are compared, which has the goal to ensure that global minima are found inthe optimization space. The proposed methods are dropout, weight regulariza-tion and batch normalization. Additionally, combinations of said regularizationtechniques are taken into consideration. The parameters are obtained by per-forming a grid search for the first 10 epochs of training the baseline modelcombined with the desired regularization. Then, using the determined param-eters, a full training as defined in Section 4.2 is executed for the evaluation ofsaid method.

Weight Regularization For weight regularization, the squared sum of weightsis appended to the loss function, multiplied by the loss parameter λ. This pa-rameter is determined to be λ = 0.05.

Dropout For the dropout, 40% of the units in a DNN are not trained atrandom during each learning step.

Batch Normalization Batch normalization is applied after the activationfunction.

Initialization

Especially in a setting with small amounts of training data, which applies hereas further described in Section 4.1, the training of DNNs is sensitive to initialconditions. These are impacted by several things, such as optimizer choice,initial learning rate, data distribution and weight initialization. In order to

24

find the optimal configuration for the task at hand, different initialization tech-niques are compared. The candidates are Glorot- [49], He- [50] and LeCun [51]initialization. For all three, weights are drawn from either a normal- or uniformdistribution. The parameters of said distribution are calculated with equationsof the number of input units in the weight tensor. For Glorot initialization [49],the output dimension is also taken into account. The techniques are summed upin Table 3.5. The symbol u denotes the unit size of in- and output respectively.

Normal Uniform

Glorot [49] N (0,√

2uin+uout

) U(−l, l)∧l =√

6uin+uout

He [50] N (0,√

2uin

) U(−l, l) ∧ l =√

6uin

LeCun [51] N (0,√

1uin

) U(−l, l) ∧ l =√

3uin

Table 3.5: Initialization techniques

Model Size

After optimizing the baseline model as far as possible, it is researched whetherarchitectures deriving from said model with increased layer and unit count canproduce more reliable predictions. Building on the results of Section 4.5.5, fourmodels are proposed.

Increased filter count This model equals the one from Section 4.5.5, exceptthat its filter count is increased to 256, while the kernels are enlarged to size 15.

Increased layer count Further experiments are done using a five layer ar-chitecture:

1. Layer, Convolutional, 40 filters, kernel size 20

2. Layer, Convolutional, 60 filters, kernel size 15

3. Layer, Convolutional, 80 filters, kernel size 7

4. Layer, Global Max Pooling

5. Layer, Dense, Softmax output

Residual Network First proposed by He et al. [52], residual connectionsenable parts of neural networks to learn residuals of in- and outputs, instead ofthe desired distribution of the outputs. Considering a subset of layers F (x) - theconvolutional layers in this case - and a target distribution H(x), the layers areforced to approximate F (x) := H(x)−x instead of F (x) := H(x). Rearranging

25

said residual formula, it becomes clear that the residual mapping is F (x) + x,hence adding the input of the block to its output, forming a so called residualconnection. He et al. [52] propose that this architecture eases the learningprocess for the neural network. Residual connections are thus employed as aform of further regularization, as depicted in Figure 3.3.

Figure 3.3: Architecture with residual connections

ResNet50

In addition to implementing a custom architecture using residual connections,the so called ResNet architecture by He et al. [52] is applied to the problem ofthis section. Due to the small size of the aphasia corpus as described in Sec-tion 4.1 and resource constraints, the smallest available model with pre-trainedweights, ResNet50, is selected. It is pre-trained on the ImageNet corpus [53],on which it achieves a top-1 and top-5 accuary of 74.9% and 92.1% respectively,employing 25.636.712 parameters.

As the aphasia data is sound data but the ResNet architecture is trainedon image data, further preprocessing is required. Firstly, the features of thetraining data are stacked in order to receive a grayscale image. Hence, thedata is reshaped from the dimensions [n, t, 80] to [n, t, 80, 3], with each of thevalues of the last dimension being a copy. Furthermore, the color channels arezero-centered with respect to the ImageNet dataset and no scaling is applied.

The output of the ResNet architecture is fed into three dense layers withtwenty, ten and two neurons respectively. The first two layers are activatedusing ReLU functions while the output layer uses Softmax. Due to the smallamount of training data, ResNet is not fine-tuned and only the dense layers aretrained.

The motivation behind the usage of pre-trained image models is, that recentresearch by Palanisamy et al. [54] has shown that the usage of deep learningmodels pre-trained on image problems can increase audio classification resultsup to 20% in performance, measured with regards to accuracy.

26

3.1.5 Multi-Class Problem

As stated in Section 3.1.1, the algorithm is to be evaluated on a three classproblem, as proposed by Kohlschein [8], in addition to the binary problem ofprosody-impairment diagnostic. In order to do so, the top performing model ofthe steps above is modified with regards to its output layer. Hence, the softmaxlayer outputs three probability scores instead of two. The parametrization ischosen from the two layer problem. It is then evaluated on a speaker level.

3.2 Automatic Speech Recognition

With regards to the third research question, an Automatic Speech Recogni-tion model is trained. This chapter illustrates the methods approaching thattask. First, the features are calculated. A methodology is established for theextraction of both the audio features and their text transcript in Sections 3.2.1and 3.2.2. Afterwards, Section 3.2.3 outlines how the models trained from saidfeatures is evaluated in Section 3.2.4.

3.2.1 Audio Embedding

The audio used in this work, further described in Section 5.1, consists of text-aligned sentences. The processing of these is done as in Pham et al. [3]. Hence,log-Mel filterbanks are extracted without any deltas, and finally normalized perutterance. These are then down-sampled by stacking four consecutive featurevectors in order to reduce the sequence length. It is to be noted that, due tospace restrictions on the available hardware, the number of filterbanks is reducedfrom 40 to 23. No augmentation as in Pham et al. [3] is applied.

3.2.2 Text Processing

The text that is aligned to said audio features is initially processed using regularexpressions. Malformed punctuation, such as a question mark followed by apoint, is replaced by the correct punctuation. Additionally, arbitrary amountsof ellipses, informally known as dot-dot-dot, are replaced by the standard threepoints. Then the texts are processed by tokenizing them using Moses [55].Afterwards they are true-cased employing the same tool. Finally, using theimplementation of Sennrich et al. [56], Byte-Pair Encoding is applied. Byte-PairEncoding originates from compression algorithms. In its basic form, reoccuringpairs of bytes, or characters in this work, are replaced by bytes that are notcontained in the data. A german example of this processing technique is to befound in Figure 3.4.

27

Source: Ein Picknick . .. in HamptonCourt.

Processed: Ein Pi@@ ck@@ ni@@ ck ... in H@@ am@@ p@@ t@@ on@@Co@@ ur@@ t .

Figure 3.4: Example of the text processing

Aphasic Data Extraction

The AAT data, as described in Section 5.1, contains text transcripts on aninterview level. Within any given interview, two people are present, namelythe patient and a therapist. While timestamps for the segmentation of audiodata are available from the research of Kohlschein et al. [8], there is no suchinformation for the text data. The texts are marked with patient and therapistsspeech in an alternating manner. However, the segmentation of said text isnon-trivial, as often the speech of patients is interrupted by the therapists withaffirming sounds such as ’hmm’, which in turn opens up a new audio segment,without being denoted in the transcript. Additionally, when a patient is askedto answer and does not utter anything, which happens due to their aphasia,it is marked within the audio timestamps, but not within the text. Hence,Forced Alignment is taken into consideration for the task at hand. Multiplesoftwares are tested in order to accomplish an alignment. Aeneas [57] does notoutput any usable results. Apparently due to the nature of aphasic speech, itis not capable to output any alignment beyond random assignment. Mozilla’sDeepSpeech [58] Align relies on Language Models. There is currently no suchLM available that supports the alignment task on german speakers. The sameapplies for the Montreal Forced Aligner [59].

Hence, a subset of the dataset is manually transcribed. For each aphasia typeand severity, ten samples are extracted, with the exception of heavy severityamnestic aphasia, for which no data is available.

3.2.3 Transformer Model

As stated in Section 1.5, Transformers are used to perform Automatic SpeechRecognition in this work. Therefore, the implementation provided by Pham etal. [3] is used to work on the provided data. Further details on how this isachieved follow in Section 5.3.

As described in Section 2.2.2, Transformers rely on self-attention parallelizedby so called multi-head attention. The output of an attention layer is added toits input via a residual connection [52] and normalized. Afterwards,the resultsare passed into a feed forward layer, again adding the residuals and normalizing.The resulting block can be concatenated multiple times, making up an encoder.A decoder is constructed in the same manner, feeding the results of the encoderinto an intermediate attention layer. To make the inputs of the encoder anddecoder position aware, positional encoding as shown in Figure 2.5 is applied.The whole model is visualized in Figure 3.5.

28

Figure 3.5: Diagram of a Transformer

3.2.4 Evaluation Procedure

The general approach to evaluate the performance of ASR algorithms on apha-sic speech is to first train models on non-impaired speakers. Multiple speechcorpora are compared with regards to the performance on disjunct test sets,and a final training corpus chosen. To this point, no aphasic speech is evaluatedyet. Due to restrictions of hard drive size and possible training duration onthe available hardware, only a single corpus can be held in storage at any time,ruling out the possibility of training the algorithm on multiple corpora at once.Finally, the model is evaluated on the aphasic speech of the AAT data.

29

Chapter 4

Experiments: ClassificationOf Prosody For AphasicSpeech

In this chapter, first the data used for the classification task proposed in 3.1.1is explored in Section 4.1. Afterwards, the learning setup used to train compa-rable models is explained in Section 4.2, which is followed by insight into theevaluation methodology in Section 4.3. Finally, Section 4.4 and 4.5 describe theevaluation of the feature embedding algorithms and how the final models areperforming under the tuning methods imposed by Section 3.1.4.

4.1 Data

The dataset used in this thesis is supplied by HotSprings GmbH and UniversityHospital Aachen. It consists of 442 recordings from 343 patients from the Aach-ener Aphasie Test. After removing patients without a diagnosis, 240 patientsremain, as seen in Table 4.1.

Amnestic 38Broca 72Global 83

Wernicke 47

Table 4.1: Patient count per aphasia type

Each audio file describes a full interview with a patient. As the result of pre-vious work by Kohlschein et al. [8], timestamps for every interview are availablefor diarization and segmentation. Each of these segments describe an utteranceby a speaker. For the purpose of prosody classification, the samples by thera-

30

pists are discarded. Firstly, the therapists are not rated with regards to theirprosody. One could assume a perfect prosody rating, but the therapists oftenspeak slower or with more emphasis than naturally as they need to communicatewith aphasia patients. Additionally, the therapist segments only include fourminutes of speech in total, as the interview is focused on the patient speaking.Statistics on the patients segment lengths are shown in Table 4.2.

Total length (hours) 44Median length (seconds) 1.11

Longest (seconds) 247.55

Table 4.2: Recording metadata for patients

Each segment, from now on referred to as sample, consists of a date ofrecording, the name of the therapist, an anonymized identifier for each patient,its aphasia type and severity, as well as ratings for each of the six speech dimen-sions declared in Section 1.3. These are the result of manual rating by speechtherapists of the AAT.

A holdout set is created by reserving the interviews of 24 patients for test-ing purposes. Therefore, 60487 training- and 6448 test-/holdout- samples arecreated. The holdout set employs stratification with regards to the prosodyrating, in order to be able to evaluate the impact of the binary classificationproblem that each rating has in retrospect. The distribution of prosody ratingsis visualized in Figure 4.1. The red line indicates the split into the classes I andNI .

Figure 4.1: Train/Test split

4.2 Experimental Setup

In order to train the algorithms proposed in Section 3, a validation set is ex-tracted from the training set in addition to the holdout set. Hence, 3129 samplesserve for the validation of the Deep Neural Networks during training. The loss

31

function of cross-entropy is minimized using Nadam [60] with a learning rateof η0 = 0.001. The training is stopped for ten consecutive epochs without animprovement of validation loss. The maximum amount of epochs is set to 100.The data is fed in batches of size 20. As the data is imbalanced - the class NImakes up 63% of the data - a stratified sample is taken as training data. Dueto the skewing of the training data distribution - there are, for example, onlytwo speakers with a prosody rating of one in the dataset, compared to 91 withrating five - the training data is again stratified with regards to the class distri-bution for the three class problem proposed in Section 3.1.5. Experiments withweighting the loss function in accordance to the amount of data per class yieldinstable gradients and do not enhance the performance, making downsamplingthe preferable choice for handling the class imbalance. For every feature typeexcept Wav2Vec [29] the features are normalized per utterance.

4.3 Evaluation Metrics

The models are evaluated using macro F1 scores. This means that the score foreach class is calculated by using Equations 4.1 to 4.3 and then averaged withouttaking into account the label distribution.

F1 = 2 ∗ precision ∗ recallprecision+ recall

(4.1)

precision =tp

tp+ fp(4.2)

recall =tp

tp+ fn(4.3)

In general, false positives are preferred over false negatives, as the cost ofmissing an aphasia diagnosis is higher than the cost of screening a non-impairedspeaker. However, the model is not optimized for recall of the class I , as thiswould allow the model to always assign said class, rendering the model useless.Therefore, with equal F1-scores, a model that has a higher recall for the class Iis to be preferred. Lastly, simpler models with regards to parameter count arepreferred, as the inference time is part of the user experience, which is crucialfor the acceptance of the model within the medical domain.

4.4 Embedding Evaluation

After training all feature sets on their respective baseline models, the resultsare compiled and evaluated with respect to the top performing embeddings andbaseline models in this Section.

32

4.4.1 Performance Of Embeddings

The performance of all embeddings with their respective baseline models isshown in Figure 4.2. The red line indicates the average F1-score over all modelsfor a specific embedding.

Figure 4.2: Results of the feature evaluation

Of the features proposed in Section 2.1, log-Mel filterbanks produce thehighest F1-scores with a mean of 48% and a maximum of 58%. The lowestscores are reached for the functional ComParE feature set, always reaching anF1-score of 25%. A look at Table 4.3 reveals that the algorithm always predictsthe class I .

Class Precision Recall F1 Support

I 33% 100% 50% 1527NI 0% 0% 0% 3084

Average 17% 50% 25% 4611

Table 4.3: Metrics of Functional ComParE features

Further investigation reveals that the outputs of the softmax layer in saidexperiment always yield values that are in the ranges [0.45; 0.55). The validationloss fluctuates in (0.66, 0.72). As I accounts for 63% of the data, the intialcross-entropy without regularization is expected to be −0.63 ∗ log(0.5)− 0.37 ∗log(0.5) ≈ 0.69 for random guesses. As the predictions are not generalizing and

33

the learning rate fluctuates around the one for random guessing, the algorithmhas no learning success.

In contrast, the outputs of log-Mel filterbanks paired with CNNs are dis-tributed within (0; 1), with high precision values for NI and high recall for I ,as seen in Table 4.4.

Class Precision Recall F1 Support

I 43% 85% 57% 1527NI 86% 44% 58% 3084

Average 64% 65% 58% 4611

Table 4.4: Metrics of log-Mel filterbanks with CNNs

By shifting the cutoff threshold for which the model assigns the class NI ,the results of Table 4.5 can be achieved, performing with an F1-score of 63%,improving the existing baseline evaluation by 5%.

Class Precision Recall F1 Support

I 47% 66% 55% 1527NI 79% 63% 70% 3084

Average 63% 65% 63% 4611

Table 4.5: Metrics of log-Mel filterbanks with shifted cutoff

This renders log-Mel filterbanks to be the feature embedding of choice forthe task of aphasia speech classification.

4.4.2 Performance Of Baseline Models

Architecture Precision Recall F1

1x CNN (40) 64% 65% 58%

CNN 57% 55% 41%LSTM 44% 52% 37%Transformer 43% 51% 35%MLP 32% 51% 29%

Table 4.6: Metrics of baseline models

Table 4.6 shows the performance of the algorithms averaged over all featuresets per baseline model, including the top performing model. Sequence modelsoutperform Multi-Layer Perceptrons, which can be attributed to the feature setsused and their non-temporal nature. Sequential models also exhibit differencesin performance, with Convolutional Neural Networks performing the best with

34

an average F1-score of 41%, in comparison to the worst performer in said modelclass with 35% for Transformers.

The full results of the feature evaluation are shown in Appendix A.

4.5 Model Fine-Tuning Results

4.5.1 Learning Adjustments

Figure 4.3: Loss function over Training- (blue) and Validation set (orange) forthe best performing algorithm of Section 4.4

Due to the learning rate fluctuating in the experiments above in combinationwith overfitting heavily on the training data, as shown in Figure 4.3, measure-ments are taken to prohibit said behaviour for future models. A small batchsize, as in this case 20, suffers from variations in the distribution of the trainingdata, as the means and variances of the batches fluctuate. Compensating for thiseffect, the batch size is increased to 100. Additionally, the initial learning rateis reduced to η0 = 0.0005, prohibiting the model from overshooting local andglobal minima. Figure 4.4 shows the effect these changes have on the validationloss, enabling a better estimation of when to stop the learning process.

35

Figure 4.4: Validation loss before and after changing the batch size and learningrate. Dark blue: Batch size 20, learning rate 0.001, Cyan: Batch size 256,learning rate 0.0005

4.5.2 Impact Of Voting Schemes

With a minimal model at hand, the voting schemes of Section 3.1.4 are com-pared.

Architecture Precision Recall F1

Soft Voting 72% 70% 66%Hard Voting 78% 71% 72%

Table 4.7: Comparison of voting scheme performance

Table 4.7 shows that hard voting outperforms soft voting by a margin of6%. Figure 4.5 highlights within which prosody ratings errors occur using thevoting algorithms. While soft voting produces false predictions over nearlythe complete spectrum of prosody ratings, hard voting produces the correctpredictions except close to the class boundary.

36

(a) Soft Voting (b) Hard Voting

Figure 4.5: Predictions for soft- and hard voting per prosody rating

4.5.3 Impact Of Regularization

After applying the methods of Section 4.5.1 in an effort to reduce gradientfluctuations, the problem of overfitting remains. This reduces the capabilityof the network to learn. The regularization techniques introduced in Section3.1.4 are thus applied. As, in initial experiments using this technique, thelearning rate starts to fluctuate after 10 epochs, a learning rate decay accordingto Equation 4.4 is henceforth applied.

ηt = η0 ∗ e−0.01∗t (4.4)

Indeed, weight regularization and dropout are capable of improving thelearning results when applied mutually exclusive, as seen in Table 4.8. BatchNormalization, as well as combinations of regularization techniques do not im-prove the result of the learning.

Technique Precision Recall F1

Weight regu-larization

80% 80% 78%

Dropout 78% 78% 78%Batch normal-ization

55% 55% 54%

Weight reg. &Dropout

55% 55% 54%

Weight reg. &Batch norm.

82% 65% 62%

Batch norm.& Dropout

28% 50% 36%

Table 4.8: Metrics of regularization techniques

37

Figure 4.6 further highlights the improvements gained by applying regular-ization, namely the best performing regularization technique of weight regular-ization. It is calculated by subtracting the training loss from the validationloss. The red area indicates overfitting. The overfitting is delayed and reduced.Additionally, network settles faster on a minimum, which in turn gives resultsthat are 6% better than the existing models.

Figure 4.6: Difference between validation- and training loss for the CNN modelwith and without regularization

4.5.4 Impact Of Initialization

As a last step towards optimizing the baseline model without enlargening it,the initialization techniques of Table 3.5 are incorporated. The results are vi-sualized in Table 4.9. Precision and recall improve by 3% and 2% respectivelywith regards to Table 4.8, while the F1-score is improved by 1% by using LeCunnormal initialization [51]. Additionally, the bias is initialized using a constantvalue of 0.01, as proposed by Li et al. [61]. This ensures that the ReLU ac-tivation functions are activated from the first epoch on, therefore propagatinggradients immediately.

Class Precision Recall F1 Support

I 100% 64% 78% 14%NI 67% 1.00% 80% 10%

Average 83% 82% 79% 24%

Table 4.9: Metrics of initialization techniques

38

4.5.5 Data-Driven Tuning

(a) Including boundary cases (b) Excluding boundary cases

Figure 4.7: Predictions in- and excluding boundary cases per prosody rating

The model resulting from the steps above is as optimized as a single layerarchitecture allows. Hence, a look at its predictions is taken with regards tovarious metadata of its predictions, namely the prosody ratings, aphasia type,aphasia severity and recording length. As one can see in Figure 4.7 (a), themodel struggles to classify patients assigned to the class I especially for theboundary cases. The label NI is assigned when a recording has a prosody ratingof four or lower, but there is no insight available how severe the difference to aprosody rating of five is. Additionally, there is no research available that hints ata large inter-rater reliability of the interview process, suggesting the possibilityof subjectivness in the assignment of the ratings. Hence, a training session islaunched which does not include these boundary cases.

Class Precision Recall F1 Support

I 82% 100% 90% 14NI 100% 70% 82% 10

Average 91% 85% 86% 24

Table 4.10: Metrics of excluded boundary cases

As seen in Table 4.10, the exclusion of boundary cases leads the model toperform better at predicting the class I . No impairment is missed by the model.Further investigation shows an improvement of the performance on Wernickepatients, which are no longer misclassified. This is visualized in Figure 4.8.

39

(a) Including boundary cases (b) Excluding boundary cases

Figure 4.8: Predictions in- and excluding boundary cases per aphasia type

The same development is shown in Figure 4.9. The model is capable ofcorrectly classifying all light- and heavy aphasia cases, in contrast to missinghalf of the heavy cases when including the boundary cases during training. Italso misses less moderate aphasia cases. It has to be noted that two of the threepatients misclassified are not rated with regards to severity.

(a) Including boundary cases (b) Excluding boundary cases

Figure 4.9: Predictions in- and excluding boundary cases per aphasia severity

Lastly, a look at the length of the segments is taken on an utterance level. Byexcluding the boundary cases, the model is capable of more correctly classifyingutterances of all lengths, as seen in Figure 4.10.

40

(a) Including boundary cases (b) Excluding boundary cases

Figure 4.10: Predictions in- and excluding boundary cases per segment lengthon a logarithmic scale

This algorithm, which is the best performing of this work, surpasses theperformance of the method by Kim et al. [2] by 11.5% with regards to its recall.However, due to the nature of the impairments presented by Kim et al. [2], theresults are not fully comparable. As stated in Section 2.1.4, the patients of saidwork suffer from impairments of speech production, not speech processing as awhole.

4.5.6 Impact Of Model Size

After optimizing the one-layer baseline model to a performance of 86% F1-score,multi-layer versions of said architecture are implemented, as proposed in Section3.1.4.

Architecture Precision Recall F1

Existing model 91% 85% 86%

256 filter, kernel size 15 74% 74% 71%4 layers 84% 81% 82%4 layers + residual connections 91% 85% 86%ResNet50 81% 76% 77%

Table 4.11: Metrics of multi-layer architectures

Comparing the performance of the multi-layer architectures, there lies noperformance benefit in the usage of more complex models, which can be at-tributed to the small amount of training data available. The residual architec-ture with four layers, which is the top performing one among the multi-layerarchitectures, exhibits the same metrics as the existing model. Regarding theevaluation strategy of 4.3, the existing model is hence preferred due to its smaller

41

amount of parameters - it has 16.222 parameters compared to 133.942 in theresidual architecture.

4.6 Multi-Class Problem

The current top performing model, shown in Table 4.10, is capable of performingthe binary classification task designated in Section 3.1.1, and is yet to be testedon the multi class problem proposed in Section 3.1.5. Hence the single layerCNN is modified to output three softmax probability scores and trained andevaluated on three classes.

Class Precision Recall F1 Support

0-1 67% 40% 50% 52-3 38% 50% 43% 6

4-5 85% 85% 85% 13

Average 63% 58% 59% 24

Table 4.12: Metrics of three class problem

Table 4.12 shows the result of said model. It performs with an F1-score of59%, peaking in performance for the class 4-5 with 85%. When analyzing themodels prediction with regards to the prosody ratings, shown in Figure 4.11,the algorithm misclassifies all patients with a prosody rating of one. Due tothe stratification procedure, no samples of said rating are in the training set,resulting in said behaviour.

Figure 4.11: Predictions per prosody rating

Difficulties mostly arise when trying to classify global aphasia patients. Lessthan half of the global aphasia patients are classified correctly. This is visualized

42

in Figure 4.12 (a). In Figure 4.12 (b), one can see that the most false predictionslie within moderate severity patients, which make up the majority of cases.

(a) Type (b) Severity

Figure 4.12: Predictions per aphasia type and severity

43

Chapter 5

Experiments: AutomaticSpeech Recognition OfAphasic Speech

This section describes the results of the methods proposed in Section 3.2. There-fore, Section 5.1 first outlines which datasets are used for the training of an ASRmodel on non-impaired data and the evaluation of aphasic speech. Section 5.2highlights how the models are evaluated. The model parameters, training- andtranslation-procedure are explained in Sections 5.3 and 5.4, which is followed bythe evaluation procedure in Section 5.5. Finally, it is discussed how the aphasicdata is evaluated on a model that is trained on non-impaired speakers.

5.1 Speech Corpora

5.1.1 Mozilla Common Voice

The Common Voice corpus [62] is the first corpus for the ASR task. In the ger-man version, it consists of 836 hours from 12.659 speakers, resulting in 246.525training- and 15.588 holdout-samples. A validation set of 5000 samples is ad-ditionally extracted. Due to size restrictions on the available hardware, only117.981 samples from the training-set are used, containing 400 hours of speech.It consists of voluntary participants speaking pre-defined sentences from publicdomain movie scripts and contributors.

5.1.2 LibriVox

The LibriVox corpus [63] consists of over 547 hours of german speech from 86audio books. The audio is segmented on a sentence level. Additionally, there are110 hours of aligned and filtered sentences for machine translation tasks betweengerman and english, which does not concern this work. The data is partitioned

44

into a training-, validation- and holdout-set, reserving three books for validationand testing respectively. The partitioning can be found in Appendix B.

5.1.3 Aachener Aphasie Test

As stated in Section 3.2.2, a subset of ten transcripts per aphasia type andseverity is drawn from the corpus provided by Kohlschein [8], as shown in Table5.1. The data amounts for a total of 9 minutes. The class of heavy amnesticaphasia misses due to lack of data.

Amnestic Broca Global Wernicke

Light 10 10 10 10Medium 10 10 10 10Heavy - 10 10 10

Table 5.1: Count of samples per aphasia type and severity

5.2 Evaluation Scheme

For this work, two metrics are collected in an effort to make the predictions ofASR models comparable.

5.2.1 Word Error Rate

The Word Error Rate (WER) is derived from the Levenshtein distance. It servesthe purpose of comparing two sentences or corpora of sentences with regards tothe amount of editing necessary in the ground truth to arrive at the hypothesis.Therefore, as seen in Equation 5.1, the WER is calculated summing up thesubstitutions Sw, deletions Dw and insertions Iw and normalizing them by theamount of words Nw in the ground truth, or reference.

WER =Sw +Dw + Iw

Nw(5.1)

Due to the amount of words Nw in the reference, the WER can exceed 1,hence the so called ratio can be above 100%.

5.2.2 Character Error Rate

Alongside the WER, Character Error Rate (CER) is calculated. It is derivedfrom the same formula as for the WER, but on a character level.

CER =Sc +Dc + Ic

Nc(5.2)

Again, this formula can exceed 1. The usage of CER is necessary, for exam-ple, because the occurance of compound words in the german language. Hence

45

the word ’Fahrgast’, which denotes passenger, is composed of the words forthe verb ’drive’ and noun ’guest’, namely ’fahr(en)’ and ’Gast’. This could bepredicted as two separate words, or a compound noun.

5.3 Model Training

As mentioned in Section 3.2.3, the implementation by Pham et al. [3] for Trans-former models performing ASR is used in this Section. For this work, a stochas-tic Transformer with 20 Encoder- and 10 Decoder layers employs eight headswith a size of 512. The stochasticity stems from the usage of stochastic residualconnections, where a residual block is dropped with a chance of 50%. The sizeof the feed-forward layers is 2048 and it employs ReLU activation functions.Dropout is applied to various components, such as the multi-head attention,the feed-forward layers and on embedding indices, using the parameters fromPham et al. [3] due to resource constraints. The model training is evaluatedby monitoring perplexity. It is trained on the Common Voice corpus [62] andthe LibriVox corpus [63] mentioned in Section 5.1 independently. The hardwarerestrictions impose a maximum training duration of 16 hours, which is chosenas the total training duration. The learning is capped at 256 epochs, which isnot reached by any model.

5.4 Translation

For the translation, batches of eight samples are processed, which is the max-imum that fits in the available GPU memory for the largest samples. A beamsearch with beam width 10 is performed in order to enhance predictions onlong utterances. The predicted texts contains Byte-Pair encoded sequences,hence the encoded tokens are replaced by the respective byte-pairs, producinga coherent text.

5.5 Model Evaluation

Training Corpus LibriVox Common Voice

Evaluation Corpus WER CER WER CERLibriVox 359.670 390.251 288.351 239.011Common Voice 219.539 248.386 167.119 155.729AAT 238.760 180.899 191.036 121.782

Table 5.2: Corpus level metrics of ASR models per evaluation corpus

In Table 5.2, the results of the training on all corpora is shown per evaluationset, on a corpus level. The evaluation metrics are in a range where they arenot comparable to state of the art models (see [3] and [6]). Examples of the

46

actual predictions displayed in Figure 5.1, along their respective sentence levelmetrics, show that the trained models are of no use, as the predicted sentencesbare no resemblance to the supplied ground truth. This applies to all modelsand evaluation corpora. The LibriVox and Common Voice models are onlycapable of producing 30 and 37 unique sentences respectively, often generatingvariations of sentences in terms of length, for 20.939 unique samples. Hence, noevaluation is possible using the obtained results.

Truth:Zieht euch bitte draußen die Schuhe aus .Prediction:Das ist nicht moglichWER: 200.000, CER: 147.826

Truth:Es kommt zum Showdown in Gstaad .Prediction:Ich mochte mich fur den Bericht gestimmt werdenWER: 100.000, CER: 76.000

Truth:Ihre Fotostrecken erschienen in Modemagazinen wie der Vogue ,Harper ’ s Bazaar und Marie Claire .Prediction:Das ist nicht moglich .WER: 320.000, CER: 348.000

Truth:Felipe hat eine auch fur Monarchen ungewohnlich lange Titelliste .Prediction:Ich mochte mich fur den BerichtWER: 150.000, CER: 144.118

Figure 5.1: Examples of ASR output of the Common Voice model

The quality of the results is explained by the training circumstances. Dueto the restrictions in hard drive space and training time, the parameter searchfor feature extraction and model training is cut short, resulting in suboptimalAutomatic Speech Recognition results.

5.6 Evaluation On Aphasic Speech

For the sake of completeness, the metrics for the ASR on aphasic speech aresupplied in Table 5.3. It is again not possible to generate any insight using theseresults, as formulated in Section 5.5.

47

Severity Type global amnestic Broca Wernicke

LightWER 179.333 220.619 150.833 213.167CER 112.821 151.587 92.670 140.335

MediumWER 183.833 195.417 138.667 217.357CER 110.848 135.299 92.767 143.779

HeavyWER 209.500 - 194.500 205.440CER 134.935 - 114.433 122.648

AverageWER 190.889 208.018 161.333 211.988CER 119.535 143.443 99.957 135.587

Table 5.3: Utterance level ASR metrics per type and severity of aphasia

48

Chapter 6

Conclusion

The last chapter of this work summarizes the results. First, Section 6.1 addressesthe research questions of Section 1.5. This is followed by suggestions for futurework in Section 6.2.

6.1 Research Questions

Using the results from the chapters above, the research questions are addressedin this Section.

Research question 1: Which features extracted from recordings can be usedto correctly discriminate classes of prosody on a dataset of impaired speakers?

Aphasic speech is inherently different from non-impaired speech due to anoma-lies in speech processing. Therefure, a multitude of embeddings, from straight-forward signal processing techniques to extraction methods derived from speechclassification competitions are compared on a multitude of baseline models, withthe task of representing aphasic speech in order to classify prosody. Especiallyspectral- and cepstral methods perform well on the task of prosody classifica-tion. Non-sequential methods based on summary statistics have been shown tobe trailing in performance on said task when used in combination with deeplearning techniques, namely Multi-Layer Perceptrons. Additionally, unsuper-vised pre-trained embeddings of Wav2Vec [29] have been demonstrated to becapable of conveying the information necessary to classify prosody, but do notoutperform spectral- and cepstral methods. Especially used in combinationwith single layer Convolutional Neural Networks, log-Mel filterbanks are shownto exhibit peak performance, closely followed by Melspectrograms. They arecapable of being used for audio classification on aphasic speech.

Research question 2: How do deep learning approaches perform on the taskof classifying prosody of impaired speakers?

49

With the results from research question 1 at hand, the top-scoring baselinemodel is fine-tuned in order to reach the best possible classification score. It isshown that a single layer CNN is capable of performing said task. Therefore,the impact of voting schemes for the evaluation on a speaker level is analyzed.Here, hard voting outperforms soft voting. Additionally the performance ofdiffering regularization- and initialization techniques is evaluated, with weightregularization combined with LeCun initialization [51] outperforming the alter-natives. The biggest jump in performance is achieved by neglecting boundarycases during the training, which results in a performance gain of 7%. Finally,different model sizes in addition to transfer learning on image datasets usingResNet50 [52] are evaluated, resulting in the performance of the models trainedbefore at best, rendering them less of a fit due to increased complexity. Thefinal model is capable of classifying prosody with an F1-score of 86% on thebinary problem of impairment classification. It reaches a perfect recall on theimpaired class, which is of high utility in when comparing the cost of a falsenegative to the cost of a false positive in the medical domain. This proves thefeasibility of classification of prosody of aphasic speech. Comparing the recallof said approach with the work of Kim et al. [2], which employs conventionalmethods of machine learning, a performance gain in recall of 11.5% (73.5% com-pared to 85%) is reached, with the caveat of being not completely comparabledue to the nature of the impairments proposed by Kim et al., which concernsimpairments of motoric speech production exclusively, not brain-related speechprocessing.

As a final experiment, the algorithm proposed for a binary classificationproblem is shown to work on the three class problem proposed by Kohlschein[8], emphasizing the usability of the model.

Research question 3: Can pathological speakers audio be transcribed usingTransformers?

Employing two speech corpora of non-impaired speakers, namely CommonVoice [62] and LibriVox [63], two Automatic Speech Recognition systems aretrained. As stated in Section 5.5, no evaluation is possible due to the resultsobtained, yielding Word Error Rates and Character Error Rates above 120. Asthis is attributed to the parametrization and training circumstances, no answercan be given to the research question, as further work can possibly yield positiveresults for said question.

6.2 Further Work

Building on this work, several paths can be taken in order to create gains inperformance. These are presented in this Section.

50

6.2.1 Classification Of Prosody For Aphasic Speech

The embeddings created using Wav2Vec [29] are pre-trained on a speech corpusof non-impaired speakers. However, they are not fine-tuned on aphasic speech.A performance improvement is expected when repeating the evaluation of saidembedding when performing fine-tuning using the aphasia corpus provided inthis work. This also applies to the ResNet50 [52] architecture, with the caveatthat this would necessitate a large amount of data, which is currently not avail-able within the AAT. In general, larger speech corpora are expected to increasethe performance of the presented algorithms. The Transformer architecture, aswell as the models with increased size, proposed in Sections 3.1.3 and 3.1.4,can be improved by performing a full parameter search instead of relying oninsight from previous evaluations. The same applies for the three class problemof Section 3.1.5, where a change in data stratification can lead to improvementsof the performance for underrepresented classes.

6.2.2 Automatic Speech Recognition Of Aphasic Speech

The models of this work perform worse than state of the art algorithms for bothaphasic- [6] and non-aphasic [3] speech. This is largely attributed to resourceconstraints, such as lacking storage space and computation time. Further workwith more computational resources will inevitably produce results that performbetter, also making the transcriptions of aphasic speech more comparable tonon-aphasic speech, as each output is more informative. It is recommended touse the pre-processing approach and full parametrization supplied by Pham etal. [3], which has been proven to produce state of the art results on the Switch-Board dataset [40]. This work also lacks the fine-tuning on aphasic data dueto the availability of only a small corpus of text-aligned speech. This can beovercome by investigating the usage of further forced alignment tools, or man-ually transcribing more audio files. Additionally, other corpora with impairedspeech can be combined with the AAT corpus, such as AphasiaBank [64], whichrequires authorized access. This opens up the possibility to also evaluate Auto-matic Speech Recognition on data for heavy cases of amnestic aphasia.

51

Glossary

additive noise Noise that is added to a given signal. 10

alexia Inability to read words as a whole while being able to identify singleletters. 3

attack Time till the first peak in amplitude for a given envelope of a signal. 10

automatism Spontaneous utterance, such as ’yeahyeahyeah’. 2, 4

Byte-Pair Encoding Compression algorithm. The most frequent pair of bytesis replaced by a byte that did not yet occur. 27, 46

delta Differences of the features over time, hence: ∆k = fk−fk−1. Calculationof delta-deltas (∆∆k = ∆k −∆k−1) is also possible. 20, 27

formant Frequencies unique to certain phonemes. 10, 11, 15, 20

jitter Deviation from true periodicity. 10, 11

phonology The study of the function of specific sounds within language. 4,10, 11, 15, 20

prosody The study of suprasegmentals. II–V, VII, 4–7, 11, 18, 19, 27, 30–32,36, 37, 39, 42, 49–51

semantic The study of meaning of language signs and references. 4, 15

shimmer Fluctuations in amplitude of a signal. 10, 11

spectrogram Visual representation of signal strength over frequency and time.9, 10, 14, 17, 19, 21, 49, 61

suprasegmental A speech unit as a reification of phonetic segments. 4

52

Bibliography

[1] Walter Huber. Aachener Aphasie-Test. Verl. fur Psychologie Hogrefe, 1983.

[2] Jangwon Kim, Naveen Kumar, Andreas Tsiartas, Ming Li, and Shrikanth SNarayanan. Automatic intelligibility classification of sentence-level patho-logical speech. Computer speech & language, 29(1):132—144, January 2015.

[3] Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Muller, and AlexWaibel. Very deep self-attention networks for end-to-end speech recogni-tion. pages 66–70, 09 2019.

[4] dbs e.V. Aphasie - Informationen fur Betroffene und Angehorige.https://www.dbs-ev.de/fileadmin/dokumente/Publikationen/

dbs-Information_Aphasie.pdf, 2016. Accessed: 20.12.2020.

[5] Anthony Fox. Prosodic Features and Prosodic Structure: The Phonologyof ’Suprasegmentals’. Filologia y linguistica. OUP Oxford, 2002.

[6] Duc Le and Emily Mower Provost. Improving automatic recognition ofaphasic speech with aphasiabank. pages 2681–2685, 09 2016.

[7] Bjorn Schuller, Stefan Steidl, Anton Batliner, Julia Hirschberg, Judee Bur-goon, Alice Baird, Aaron Elkins, Yue Zhang, Eduardo Coutinho, and Kee-lan Evanini. The interspeech 2016 computational paralinguistics challenge:Deception, sincerity and native language. pages 2001–2005, 09 2016.

[8] Christian Peter Kohlschein. Automatische Verarbeitung von Spon-tansprachinterviews des Aachener Aphasie Tests mittels Verfahren desmaschinellen Lernens. Shaker, 2019.

[9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett, editors, Advances in Neural InformationProcessing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017.

[10] Paul Boersma and David Weenink. Praat, a system for doing phonetics bycomputer. Glot international, 5:341–345, 01 2001.

53

[11] Florian Eyben, Martin Wollmer, and Bjorn Schuller. opensmile – the mu-nich versatile and fast open-source audio feature extractor. pages 1459–1462, 01 2010.

[12] Stanley Stevens, John Volkmann, and Edwin Newman. A scale for the mea-surement of the psychological magnitude pitch. Journal of the AcousticalSociety of America, 8:185–190, 1937.

[13] David Gerhard. Audio signal classification: History and current techniques.2003.

[14] Carole Ferrand. Harmonics-to-noise ratio: An index of vocal aging. Journalof voice : official journal of the Voice Foundation, 16:480–7, 01 2003.

[15] Nivja de Jong and Ton Wempe. Praat script to detect syllable nuclei andmeasure speech rate automatically. Behavior research methods, 41:385–90,05 2009.

[16] Madiha Jalil, Faran Butt, and Ahmed Malik. Short-time energy, magni-tude, zero crossing rate and autocorrelation measurement for discriminatingvoiced and unvoiced segments of speech signals. pages 208–212, 05 2013.

[17] Karmele Lopez-de Ipina, Jesus Alonso, Carlos Travieso, Jordi Sole-Casals,Harkaitz Eguiraun Martinez, Marcos Faundez-Zanuy, Aitzol Ezeiza, NoraBarroso, Miriam Ecay-Torres, Pablo Martinez-Lage, and Unai Lizardui. Onthe selection of non-invasive methods based on speech analysis oriented toautomatic alzheimer disease diagnosis. Sensors, 5:6730–6745, 05 2013.

[18] Alex Mayle, Zhiwei Mou, Razvan Bunescu, Sadegh Mirshekarian, Li Xu,and Chang Liu. Diagnosing dysarthria with long short-term memory net-works. pages 4514–4518, 09 2019.

[19] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neu-ral computation, 9:1735–80, 12 1997.

[20] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. Learning phrase representationsusing RNN encoder-decoder for statistical machine translation. CoRR,abs/1406.1078, 2014.

[21] Mahmood Alhlffee. Mfcc-based feature extraction model for long time pe-riod emotion speech using cnn. Revue d’Intelligence Artificielle, 34:117–123,05 2020.

[22] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean.Distributed representations of words and phrases and their compositional-ity. CoRR, abs/1310.4546, 2013.

54

[23] Wim Boes and Van hamme Hugo. Audiovisual transformer architecturesfor large-scale classification and synchronization of weakly labeled audioevents. Proceedings of the 27th ACM International Conference on Multi-media, Oct 2019.

[24] Yingying Zhuang, Yuezhang Chen, and Jie Zheng. Music genre classifica-tion with transformer classifier. In Proceedings of the 2020 4th InternationalConference on Digital Signal Processing, ICDSP 2020, page 155–159, NewYork, NY, USA, 2020. Association for Computing Machinery.

[25] Margaret Lech, Melissa Stolar, Robert Bolia, and Michael Skinner.Amplitude-frequency analysis of emotional speech using transfer learningand classification of spectrogram images. Advances in Science, Technologyand Engineering Systems Journal, 3:363–371, 08 2018.

[26] Felix Burkhardt, Astrid Paeschke, M. Rolfes, Walter Sendlmeier, and Ben-jamin Weiss. A database of german emotional speech. volume 5, pages1517–1520, 01 2005.

[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classifica-tion with deep convolutional neural networks. Neural Information Process-ing Systems, 25, 01 2012.

[28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S.Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visualrecognition challenge. CoRR, abs/1409.0575, 2014.

[29] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli.wav2vec: Unsupervised pre-training for speech recognition. CoRR,abs/1904.05862, 2019.

[30] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur.Librispeech: An asr corpus based on public domain audio books. pages5206–5210, 04 2015.

[31] K. H. Davis, R. Biddulph, and S. Balashek. Automatic recognition ofspoken digits. The Journal of the Acoustical Society of America, 24(6):637–642, 1952.

[32] B. Juang and Lawrence Rabiner. Automatic speech recognition - a briefhistory of the technology development. 01 2005.

[33] F. Jelinek. Continuous speech recognition by statistical methods. Proceed-ings of the IEEE, 64:532–556, 1976.

[34] M.J.F. Gales and Steve Young. The application of hidden markov modelsin speech recognition. Foundations and Trends in Signal Processing, 1:195–304, 01 2007.

55

[35] Geoffrey Hinton, li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed,Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Phuongtrang Nguyen,Tara Sainath, and Brian Kingsbury. Deep neural networks for acousticmodeling in speech recognition: The shared views of four research groups.Signal Processing Magazine, IEEE, 29:82–97, 11 2012.

[36] Alex Graves. Sequence transduction with recurrent neural networks. CoRR,abs/1211.3711, 2012.

[37] Alexander Waibel, Toshiyuki Hanazawa, G. Hinton, Kiyohiro Shikano, andK.J. Lang. Phoneme recognition using time-delay neural networks. Acous-tics, Speech and Signal Processing, IEEE Transactions on, 37:328 – 339, 041989.

[38] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, OndrejGlembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian,Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. The kaldispeech recognition toolkit. In IEEE 2011 Workshop on Automatic SpeechRecognition and Understanding. IEEE Signal Processing Society, December2011. IEEE Catalog No.: CFP11SRW-USB.

[39] Dzmitry Bahdanau, Kyunghyun Cho, and Y. Bengio. Neural machinetranslation by jointly learning to align and translate. ArXiv, 1409, 092014.

[40] J. J. Godfrey, E. C. Holliman, and J. McDaniel. Switchboard: telephonespeech corpus for research and development. In [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and SignalProcessing, volume 1, pages 517–520 vol.1, 1992.

[41] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vi-mal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur. Purelysequence-trained neural networks for asr based on lattice-free mmi. pages2751–2755, 09 2016.

[42] Alberto Abad, Anna Pompili, Angela Costa, and Isabel Trancoso. Auto-matic word naming recognition for treatment and assessment of aphasia.13th Annual Conference of the International Speech Communication Asso-ciation 2012, INTERSPEECH 2012, 2, 01 2012.

[43] Margaret Forbes, Davida Fromm, and Brian Macwhinney. Aphasiabank:A resource for clinicians. Seminars in speech and language, 33:217–22, 082012.

[44] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar,Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal anal-ysis in python. In Proceedings of the 14th python in science conference,volume 8, 2015.

56

[45] Julius Von Hann and Robert DeCourcy Ward. Handbook of climatology.MacMillan, 1903.

[46] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andre, C. Busso,L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong. Thegeneva minimalistic acoustic parameter set (gemaps) for voice research andaffective computing. IEEE Transactions on Affective Computing, 7(2):190–202, 2016.

[47] Felix Weninger, Florian Eyben, Bjorn Schuller, Marcello Mortillaro, andKlaus Scherer. On the acoustics of emotion in audio: What speech, music,and sound have in common. Frontiers in psychology, 4:292, 05 2013.

[48] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, NathanNg, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit forsequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations,2019.

[49] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of trainingdeep feedforward neural networks. In Yee Whye Teh and Mike Titterington,editors, Proceedings of the Thirteenth International Conference on ArtificialIntelligence and Statistics, volume 9 of Proceedings of Machine LearningResearch, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May2010. JMLR Workshop and Conference Proceedings.

[50] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deepinto rectifiers: Surpassing human-level performance on imagenet classifica-tion. In Proceedings of the IEEE International Conference on ComputerVision (ICCV), December 2015.

[51] Yann Lecun, Leon Bottou, Genevieve Orr, and Klaus-Robert Muller. Effi-cient backprop. 08 2000.

[52] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residuallearning for image recognition. CoRR, abs/1512.03385, 2015.

[53] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: ALarge-Scale Hierarchical Image Database. In CVPR09, 2009.

[54] Kamalesh Palanisamy, Dipika Singhania, and Angela Yao. Rethinking cnnmodels for audio classification, 2020.

[55] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Mar-cello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, ChristineMoran, Richard Zens, Chris Dyer, Ondrej Bojar, Alex Constantin, andEvan Herbst. Moses: Open source toolkit for statistical machine transla-tion. 06 2007.

57

[56] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine trans-lation of rare words with subword units. In Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics (Volume 1: LongPapers), pages 1715–1725, Berlin, Germany, August 2016. Association forComputational Linguistics.

[57] ReadBeyond. Aeneas. https://www.readbeyond.it/aeneas/, 2015. Ac-cessed: 2020-12-27.

[58] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos,Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, AdamCoates, and Andrew Y. Ng. Deep speech: Scaling up end-to-end speechrecognition. CoRR, abs/1412.5567, 2014.

[59] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, andMorgan Sonderegger. Montreal forced aligner: Trainable text-speech align-ment using kaldi. pages 498–502, 08 2017.

[60] Timothy Dozat. Incorporating nesterov momentum into adam. 2016.

[61] Fei-Fei Li, Ranjay Krishna, and Danfei Xu. Lecture notes on CS231n:Convolutional Neural Networks for Visual Recognition, April 2020. Ac-cessed: 2020-12-27.

[62] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, MichaelKohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Ty-ers, and Gregor Weber. Common voice: A massively-multilingual speechcorpus. CoRR, abs/1912.06670, 2019.

[63] Benjamin Beilharz, Xin Sun, Sariya Karimova, and Stefan Riezler. Lib-rivoxdeen: A corpus for german-to-english speech translation and speechrecognition. CoRR, abs/1910.07924, 2019.

[64] Elizabeth Bates. Aphasiabank german nonprotocol cap corpus, 2004. Ac-cessed: 2021-01-02.

58

Appendix A

Feature Evaluation Results

Features Config Accuracy Precision Recall F1

Functionals eGeMAPS

4x DNN (15) 33% 17% 50% 25%4x DNN (40) 41% 56% 53% 38%1x DNN (15) 43% 55% 54% 41%1x DNN (40) 33% 61% 50% 25%Average 37% 47% 52% 32%

Functionals ComParE

4x DNN (15) 33% 17% 50% 25%4x DNN (40) 33% 17% 50% 25%1x DNN (15) 33% 17% 50% 25%1x DNN (40) 33% 17% 50% 25%Average 33% 17% 50% 25%

Table A.1: Metrics of feature evaluation (Part 1)

59

Features Config Accuracy Precision Recall F1

LLD eGeMAPS

4x CNN (15) 49% 58% 57% 49%4x CNN (40) 47% 58% 56% 47%1x CNN (15) 46% 58% 56% 45%1x CNN (40) 43% 58% 55% 42%4x LSTM (15) 33% 17% 50% 25%4x LSTM (40) 33% 17% 50% 25%1x LSTM (15) 34% 46% 49% 27%1x LSTM (40) 33% 17% 50% 25%Transformer 33% 11% 33% 16%Average 39% 38% 51% 33%

LLD ComParE

4x CNN (15) 34% 53% 50% 27%4x CNN (40) 36% 57% 51% 31%1x CNN (15) 37% 55% 52% 33%1x CNN (40) 36% 56% 51% 30%4x LSTM (15) 33% 17% 50% 25%4x LSTM (40) 33% 17% 50% 25%1x LSTM (15) 39% 48% 49% 38%1x LSTM (40) 47% 50% 50% 47%Transformer 41% 55% 53% 40%Average 37% 45% 51% 33%

Table A.2: Metrics of feature evaluation (Part 2)

60

Features Config Accuracy Precision Recall F1

log-Mel filterbanks

4x CNN (15) 41% 61% 55% 38%4x CNN (40) 46% 60% 57% 45%1x CNN (15) 56% 65% 64% 56%1x CNN (40) 58% 64% 65% 58%4x LSTM (15) 49% 55% 55% 49%4x LSTM (40) 42% 57% 54% 40%1x LSTM (15) 47% 56% 55% 47%1x LSTM (40) 54% 54% 55% 53%Transformer 50% 62% 60% 50%Average 49% 59% 58% 48%

Melspectrograms

4x CNN (15) 51% 60% 59% 51%4x CNN (40) 51% 63% 61% 51%1x CNN (15) 49% 59% 58% 49%1x CNN (40) 50% 60% 59% 49%4x LSTM (15) 41% 53% 52% 39%4x LSTM (40) 37% 51% 50% 33%1x LSTM (15) 50% 58% 57% 50%1x LSTM (40) 39% 55% 52% 37%Transformer 53% 58% 58% 53%Average 47% 57% 56% 46%

MFCCs

4x CNN (15) 35% 52% 50% 28%4x CNN (40) 39% 51% 50% 37%1x CNN (15) 36% 52% 51% 32%1x CNN (40) 33% 42% 50% 25%4x LSTM (15) 46% 52% 52% 46%4x LSTM (40) 37% 51% 50% 33%1x LSTM (15) 49% 52% 52% 48%1x LSTM (40) 47% 50% 50% 46%Transformer 35% 57% 51% 28%Average 39% 51% 51% 36%

Wav2Vec

4x CNN (15) 35% 52% 50% 30%4x CNN (40) 38% 54% 52% 34%1x CNN (15) 41% 60% 55% 39%1x CNN (40) 49% 62% 59% 48%4x LSTM (15) 37% 50% 50% 34%4x LSTM (40) 33% 17% 50% 25%1x LSTM (15) 45% 54% 53% 45%1x LSTM (40) 37% 55% 51% 32%Transformer 33% 17% 50% 25%Average 39% 47% 52% 35%

Table A.3: Metrics of feature evaluation (Part 3)

61

Appendix B

LibriVox Partition

B.1 Validation Set

� 98.lustige geschichten 1709 librivox

� 16.stoertebeker 1001 librivox 64kb mp3

� 2.aquis submersus 0902 librivox 64kb mp3

B.2 Holdout Set

� 78.sommer in london 0711 librivox 64kb mp3

� 34.peterchens mondfahrt librivox 64kb mp3

� 104.jonathan frock 1312 librivox

62