New features for on-line aphasia therapy - INESC-ID · New features for on-line aphasia therapy...

New features for on-line aphasia therapy

Anna Maria Pompili

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Examination Committee

Chairperson: Prof. Pedro Manuel Moreira Vaz Antunes de SousaSupervisor: Prof. Isabel Maria Martins TrancosoSupervisor: Dr. Alberto Abad GaretaMember of the Committee: Prof. Alfredo Manuel dos Santos Ferreira Júnior

November 2013

To Giuseppina and Francesco.

iii

My deepest gratitude goes to Professor Alberto Abad. He always guided and supported me in the most

comprehensive and constructive way, providing brilliant ideas, showing me the right approach to address

complicated problems, and readily helping me to overcome the many difficulties that I had to tackle while

pursuing the objectives of this thesis. His guidance, constant incentives and endless availability have

been fundamental for the achievement of this result.

I wish to express my gratitude to Professor Isabel Trancoso, not only for her valuable guidance, but

also for having welcome me in the L2F group. During the time spent here, she always motivated me with

inspiring discussions, and provided me her full support and availability. She never missed a chance to

demonstrate me her trust, and constantly accompanied my work, enlightening my way with her unique

ability to identify innovative research directions and enticing applications for the results achieved during

this work.

I owe a very special acknowledgment to Isabel Pavao Martins, Jose Fonseca, Gabriela Leal, Luisa

Farrajota, and Sofia Clerigo, from the Language Research Laboratory group (LEL - Laboratorio de

Estudos de Linguagem) of the Lisbon Faculty of Medicine. Their cooperation has been fundamental

to allow the project VITHEA to become a reality.

I also want to thank Professor Nuno Mamede and Professor Sara Candeias from the L2F group, for

having kindly provided important resources that have constituted the baseline for some of the results

achieved with this work. Without these initial onsets, those results would not have been possible.

Thank you also to all the colleagues and room-mates that I have had the pleasure to know during

these years. They have been supporters of this experience not only with their kindness and friendship,

but also by providing active participation in the data collection and user evaluation experience.

Finally, my special thanks go to Paolo, my companion. His advices, cares, and support have been

invaluable to me to overcome the hardest difficulties.

v

Afasia e um tipo particular de disturbio da comunicacao causada por lesoes de uma ou mais areas

do cerebro que afectam diferentes funcionalidades da linguagem e da fala. Os acidentes vasculares

cerebrais sao uma das causas mais comuns dessa doenca.

VITHEA (Terapeuta Virtual para o tratamento da afasia) e uma plataforma on -line desenvolvida

para o tratamento de doentes afasicos, incorporando os recentes avancos das tecnologias de fala para

proporcionar exercıcios de nomeacao a pessoas com uma reduzida capacidade de nomear objetos.

O sistema, disponıvel ao publico desde Julho de 2011, recebeu ja varios premios nacionais e interna-

cionais e e atualmente distribuıdo a cerca de 160 utilizadores entre profissionais de saude e doentes.

O foco deste trabalho e investigar a viabilidade da incorporacao de funcionalidades adicionais que

podem potenciar o sistema VITHEA. Essas funcionalidades visam tanto estender a usabilidade do sis-

tema quanto reforcar o seu desempenho, considerando assim varias areas heterogeneas do projeto.

Entre estas funcionalidades destacam-se: uma nova versao do aplicativo cliente para estender a porta-

bilidade da plataforma a dispositivos moveis, uma interface hands- free para facilitar os doentes por-

tadores de deficiencias fısicas, e uma funcionalidade de pesquisa avancada para melhorar a gestao

dos dados da aplicacao. Foi tambem estudada a viabilidade de um novo tipo de exercıcios e avaliado

o desempenho de um novo lexico de pronuncia com o objectivo de melhorar os resultados de recon-

hecimento. Em geral, os resultados de questionarios de satisfacao dos utilizadores e as avaliacoes

automaticas tem proporcionado feedback encorajador sobre as melhorias desenvolvidas.

Palavras-chave: Afasia, recuperacao da linguagem, terapia virtual, disturbio da fala, nomeacao

oral, reconhecimento de fala

vii

Aphasia is a particular type of communication disorder caused by the damage of one or more language

areas of the brain affecting various speech and language functionalities. Cerebral vascular accidents

are one of the most common causes.

VITHEA (Virtual Therapist for Aphasia Treatment) is an on-line platform developed for the treatment

of aphasic patients, incorporating recent advances of speech and language technologies to provide word

naming exercises to individuals with lost or reduced word naming ability. The system, publicly available

since July 2011, received several national and international awards and is currently distributed to almost

160 users among professional health-care and patients.

The focus of this thesis is to investigate the feasibility of incorporating additional functionalities that

may enhance the VITHEA system. These features aimed at both extending the usability of the system

and strengthening its performance, and thus involve several heterogeneous areas of the project. The

main new features were: a new version of the client application to extend the portability of the platform

to mobile devices, an ad-hoc hands-free interface to facilitate patients with physical disabilities, and an

advanced search capability to improve the management of the application data. This study also included

the assessment of the feasibility of a new type of exercises, and the evaluation of a new pronunciation

lexicon aimed at improving recognition results. Overall, the results of user interaction satisfaction ques-

tionnaires and the automatic evaluations have provided encouraging feedback on the outcome of the

developed improvements.

Keywords: Aphasia, language recovery, virtual therapy, speech disorder, word naming, speech

recognition

ix

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

List of abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Structure of this Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 5

2.1 Aphasia language disorder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Aphasia symptoms classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Aphasia treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Brief introduction to automatic speech recognition . . . . . . . . . . . . . . . . . . 7

2.2.2 AUDIMUS speech recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.3 Automatic word verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3.0.1 Word verification based on keyword spotting . . . . . . . . . . . 10

2.2.3.0.2 Keyword spotting with AUDIMUS . . . . . . . . . . . . . . . . . . 11

2.3 Platform for speech therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 VITHEA: An on-line system for virtual treatment of aphasia . . . . . . . . . . . . . 12

2.3.1.1 The patient and the clinician applications . . . . . . . . . . . . . . . . . . 13

2.3.1.1.1 Patient application module . . . . . . . . . . . . . . . . . . . . . 13

2.3.1.1.2 Virtual character animation and speech synthesis . . . . . . . . 13

2.3.1.1.3 Speech synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1.1.4 Clinician application module . . . . . . . . . . . . . . . . . . . . 14

2.3.1.2 Platform architecture overview . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 New features for aphasia therapy: State of the art . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.1 Content adaptation for mobile devices . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.2 Hands-free speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.3 Exploiting IR for improved search functionality . . . . . . . . . . . . . . . . . . . . . 18

xi

2.4.4 New automatic evocation exercises for therapy treatment . . . . . . . . . . . . . . 19

2.4.5 Exploiting syllable information in word naming recognition of aphasic speech . . . 20

3 Content adaptation for mobile devices 21

3.1 Service Oriented Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Representational State Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.2 Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.3 Android Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Architectural overview of the implemented prototype . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 REST authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.2 Implemented architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.2.0.1 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.2.0.2 Data representation . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.3 Client application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 User experience evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Hands-free speech recording 31

4.1 Voice activity detection task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.1 Speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Exploiting IR for improved search functionality 41

5.1 Extended search functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1.2 Metadata generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1.3 Indexes generation and management . . . . . . . . . . . . . . . . . . . . 44


5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 New automatic evocation exercises for therapy treatment 49

6.1 Automatic animal naming recognition task . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.1 Keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1.2 Keyword model generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1.3 Background penalty for keyword spotting tuning . . . . . . . . . . . . . . . . . . . . 51

6.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

xii

6.2.1 Speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7 Exploiting syllable information in word naming recognition of aphasic speech 57

7.1 Syllabification task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


7.2.1 Speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8 Conclusions 63

8.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Bibliography 72

xiii

4.1 Baseline configuration established through exhaustive search. . . . . . . . . . . . . . . . 39

4.2 Results obtained on the development test set with the baseline configuration. . . . . . . . 39

4.3 Results obtained on the evaluation set with the baseline configuration. . . . . . . . . . . . 40

5.1 Coverage of the additional metadata generated. . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 Precision and recall for each of the indexes generated. . . . . . . . . . . . . . . . . . . . . 45

5.3 Number of results returned by the system using the extended search feature and using a

standard search functionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.1 Speech corpus data, including gender, total number of words and the total number of valid

words uttered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2 Experiments data and resulting average WER, including file size information. . . . . . . . 53

6.3 Experiments data and resulting average WER, including file size information. . . . . . . . 54

6.4 Experiments data and resulting average WER. . . . . . . . . . . . . . . . . . . . . . . . . 55

6.5 Automatic and manual WER with the configuration 2 of the last set of experiments. . . . 56

7.1 Average WVR for the APS-I and APS-II corpus with different pronunciation models. . . . . 60

7.2 WVR for APS-I and APS-II data sets and average WVR, using automatically calibrated

background penalty term. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

xv

2.1 Block diagram of AUDIMUS speech recognition system. . . . . . . . . . . . . . . . . . . . 10

2.2 Comprehensive overview of the VITHEA system. . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Screen-shots of the VITHEA patient application. . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Interface for the creation of new stimulus. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Interface for the management of multimedia resources. . . . . . . . . . . . . . . . . . . . . 16

3.1 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Screen-shots of the VITHEA mobile patient application. . . . . . . . . . . . . . . . . . . . 28

3.3 Results of the evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Distribution of the user grades for the questions of the third group. . . . . . . . . . . . . . 30

4.1 Architectural implementation of the VAD algorithm. . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Process of generation of the speech corpus. . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 Structure of the objects of the VITHEA system that are of interest for the search functionality. 43

5.2 Results provided for the search query “seco” (dry) on the field answer of a Question. . . . 46

5.3 Results provided for the search query “alimento” (food) on the field answer of a Question. 46

5.4 Results provided for the search query <“harpia” (harpy), “animais” (animals)> on the

fields title and category of a document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.1 First set of experiments using the keyword model with different values for the threshold. . 53

6.2 Second set of experiments using the keyword model filtered with Onto.PT and different

values for the threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.3 Third set of experiments, including phonetic transcription correction and filled pause models. 55

7.1 Results for the APS-I comparing the two pronunciation lexicons, the standard and the

augmented version provided with syllable boundaries. . . . . . . . . . . . . . . . . . . . . 59

7.2 Results for the APS-II comparing the two pronunciation models, the standard and the

augmented version provided with syllable boundaries. . . . . . . . . . . . . . . . . . . . . 60

xvii

ANN Artificial Neural Network

ASR Automatic Speech Recognition

CSR Continuous Speech Recognition

CVA Cerebral Vascular Accident

JSON JavaScript Object Notation

IWR Isolated Word Recognition

LVCSR Large Vocabulary Continuous Speech Recognition

MLP Multilayer Perceptron

REST Representational State Transfer

RPC Remote Procedure Call

WFST Weighted Finite State Transducer

TTS Text-To-Speech

SOA Service-Oriented Architecture

SOAP Simple Object Access Protocol

URI Uniform Resource Identifiers

XML Extensible Markup Language

WVR word verification rate

WER Word Error Rate

VAD Voice Activity Detection

xix

1Aphasia is a particular type of communication disorder caused by the damage of one or more language

areas of the brain affecting various speech and language functionalities. Cerebral vascular accidents

are one of the most common causes. A frequent syndrome is the difficulty to recall names or words.

Typically, such problems can be treated through word naming therapeutic exercises. In fact, frequency

and intensity of speech therapy is a key factor in the recovery, thus motivating the development of

automatic therapy methods that may be used remotely.

VITHEA (Virtual Therapist for Aphasia Treatment) is an on-line platform developed for the treatment

of aphasic patients, incorporating recent advances of speech and language technologies to provide

word naming exercises to individuals with lost or reduced word naming ability. The project started in

June 2010 and saw the release of the first public prototype in July 2011. Since then, the system has

continuously evolved with improvements both on the speech recognition techniques used and on the

functionalities provided to patients and therapists. After three years of active development the project is

now used daily by patients and speech therapists and has been awarded from both the speech and the

health-care communities.

The success of the system motivated the research on additional features which could extend its

functionalities and robustness. The new features will, indeed, consider a heterogeneous domain of

enhancements that includes, among others, the evaluation of new approaches to improve the recognition

quality and the recording process, the development of new interfaces to improve user experience, and

the experimental implementation of new types of exercises.

The focus of the present work is, as the name states, to investigate on the feasibility of incorporating

additional functionalities that may improve the VITHEA system. These features aim at both extend the

usability of the system and strengthen its performance. The first will be achieved by providing a new

version of the client application for mobile device, a hands-free interface for an easier recording experi-

ence, an advanced search functionality for an improved management of platform data, and new type of

exercises. For what concerns system performance, a new approach that considers the syllabic division

of words will be studied and tested within the current speech recognition process.

The VITHEA system comprises two specific modules, dedicated respectively to the patients for car-

rying out the therapy sessions and to the clinicians for the administration of the functionalities related to

them.

Since smartphone, tablet and cellphone have become mainstream over the last few years, mobile

services are more and more integrated in everyday’s life. In some cases, a smartphone may be cheaper

than a computer, more practical, and even easier to use as it does not require the use of an external

input device. Thus, the adaptation of the VITHEA platform to mobile devices has been considered of high

importance for the diffusion of the system. However, this extension is currently limited by the recording

module of the application. Here, an architecture compliant with the new requirements and a client version

version for mobile devices have been designed and implemented in order to verify the performance and

the level of appreciation of the user for the mobile version.

Related also with the client module, it is worth noticing that most of the times aphasia is the con-

sequence of a Cerebral Vascular Accident (CVA) and, in those cases, affected patients may also expe-

rience some sort of physical disability in arm mobility. In such situations, the support for a hands-free

interface will notably improve the usability of the system. However, the typical extension of these inter-

faces for human-computer interaction consists of voice commands as alternative input modality. In the

particular case of the VITHEA project, being the user of the system affected from a language disor-

der, hands-free computing could not be interpreted as an alternative way of interaction, instead it will

be selectively applied to automate the process of recording the users answers, and thus, provide addi-

tional benefits to people experiencing disabilities. Determining when the recording process should start

could be efficiently detected automatically, by considering as reference the end of the description of the

stimulus spoken by the virtual therapist or the end of subsequent reproduction of the audio/video file

in the case of a multimedia stimulus. The detection of the end of the speech is a more challenging is-

sue. Common solutions relies on Voice Activity Detection (VAD) approaches, which automatically try to

determine the presence or absence of voice relying on some features of the input signal. Depending on

the implementation, the features used may vary. In this work, the energy of the speech has been used

as a baseline to develop an algorithm that implements the automatic detection of the end of the speech.

On the other hand, the objective of the clinician module is to allow the management of patient data

as well as of the collection of exercises and resources associated to them. During the last years sev-

eral improvements were introduced to allow the incorporation of new exercises and the creation and the

management of groups of speech-language therapists and patients. However, many important function-

alities that affect the overall usability of the clinician module were still missing. The management of the

exercises data, which now exceeds one thousand of stimuli, only provide a listing functionality, missing

the option to search for a given stimuli. Considering the amount of data stored in the system, the lack of

a search feature strongly affects the daily usage of the module. Besides, it should be noted that the data

constituting the exercises and the stimuli is somehow peculiar in its format. In fact, most of the time, this

is represented by a single keyword (i.e.: the title of a document). This means that if the therapist does

not remember the exact term he/she is looking for, the search will probably fail. For these reasons, it is

important that the search functionality keeps into account these constraints and thus provides extended

search capabilities. Techniques, such as Query Expansion, from the area of Information Retrieval will be

exploited to achieve this purpose.

For what concerns the therapeutic exercises, there are several naming tasks for assessing the pa-

tient’s ability to provide the verbal label of objects, actions, events, attributes, and relationship. There

2

are different types of naming tasks, such as category naming, confrontation naming, automatic clo-

sure naming, automatic serial naming, recognition naming, repetition naming, and responsive nam-

ing [Campbell 05, Murray 01]. Currently, the VITHEA system supports exercises based on visual con-

frontation, automatic closure naming, and responsive naming. The integration of automatic serial naming

or semantic category naming exercises would be of valuable help for patients in recovering from aphasia.

Finally, during preliminary experiments evaluating the performance of the word naming recog-

nition task within the VITHEA system, an analysis of word detections errors have been per-

formed [Abad 13]. From these results emerged that some characteristics of aphasic patients that

sometimes causes keywords to be missed are both pauses in between syllables or mispronounced

phonemes. Recordings have confirmed that some patients have the tendency to speak with a slow

rhythm, almost as if they were dividing the word into syllables. This phenomenon, even more stressed,

was also directly observed in different sessions of experimentation of the system, either performed from

a patient or a healthy subject. In these contexts, when the system failed to recognize the user answer,

the user then typically starts at syllabify the word. These reasons have motivated the idea of investi-

gating on the integrating of an external speech tool that would perform the syllabification of words. The

syllabified version will constitute an augmented grammar for the recognizer that will hopefully improve

its performance.

All the objectives that were identified in the thesis proposal [Pompili 13] were implemented, with the

exception of the ”awareness and profiling” functionality. This feature has been substituted with the pro-

viding of an advanced search capability that has been integrated into the clinician module. In fact, during

the evolution of the thesis work, this feature appeared more interesting and useful for the improvement

of the project at the point of justifying the introduction of this amendment. Thus the main goals of this

works now are:

• content adaptation for mobile devices,

• hands-free speech interface,

• advanced search capability,

• new naming exercises,

• syllabification tool.

The present thesis consists of 8 chapters, structured as follows:

• Chapter 2 starts by reporting on background concepts and on the state of the art of on-line plat-

forms dedicated to speech disorders, describing with further details the VITHEA system. Then, it

3

focuses on the specification of the new features that represents the target of this work, reporting

the current state of the art, where applicable.

• Chapter 3 reports on the architecture, the design, and the security pattern that have been followed

to develop a new version of the system supported by mobile devices. The constraints that have

guided the ultimate prototype and the choices that have been taken, are here justified, motivated

and explained. Then, the Chapter concludes with the description of the results of a user experience

evaluation.

• Chapter 4 describes in detail both the options chosen for the implementation of a VAD approach

carried out at the same time of the recording process and the architectural updates that are in-

volved for its integration within the VITHEA system. The Chapter ends with the evaluation of the

algorithm through automated tests carried out with the recordings of daily users of the system.

• In Chapter 5 the focus is on improving the management of the data of the system, by providing

an advanced search feature. This is achieved through the generation of metadata information

provided by ontological resources. These data are then exploited from a query expansion process

and a full-text search engine for providing an extended set of results. Precision and recall measures

for a given test set of queries are reported at the end of the Chapter.

• Chapter 6 explains the concept of evocation exercises and how a specific subclass, the animal

naming, has been implemented through an iterative process of enhancements. The construction

of the baseline list of admissible animals, constituting a key component for the recognition process,

is detailed together with the automated evaluation carried out through the collection of a speech

corpus.

• Chapter 7 introduces the issues that surround the task of syllabic division of words and describes

how an external software that provides the orthographic syllabification, has been adapted to the

architecture of the speech recognition engine AUDIMUS. Then, the results of the automated test,

carried out with the corpus of aphasic patients collected during the VITHEA project, is described.

• Finally, Chapter 8 presents the conclusions and future work.

4

2This chapter aims at providing both important background knowledge that will be referred in the rest

of this document and the relevant state of the art for the new targeted features. It is divided into 4

main Sections, first a short background on aphasia and common therapeutic approaches are described

(Section 2.1), then an overview of an Automatic Speech Recognition (ASR) system is provided in Sec-

tion 2.2, focusing on AUDIMUS, the in-house speech recognition engine used. Section 2.3 describes

current known platforms providing on-line tools for voice disorders, with a deep focus on the VITHEA

system. Finally, Section 2.4 is devoted at describing the state of the art that is relevant for each of the

new features object of this work.

Aphasia is a speech disorder which comprises difficulties in both production and comprehension of spo-

ken or written language. It is caused by damage to one or more of the language areas of the brain,

typically it occurs after brain injuries. There are several causes of brain injuries affecting communication

skills, such as brain tumours, brain infections, severe head injuries, and most commonly, cerebral vas-

cular accidents (CVA). Among the effects of aphasia, the difficulty to recall words or names is the most

common disorder presented by aphasic individuals. In fact, it has been reported in some cases as the

only residual deficit after rehabilitation [Wilshire 00]. Several studies about aphasia have demonstrated

the positive effect of speech-language therapy activities for the improvement of social communication

abilities [Basso 92]. Moreover, it has been shown that the intensity of therapy positively influences

speech and language recovery in aphasic patients [Bhogal 03].

2.1.1 Aphasia symptoms classification

We can classify various aphasia syndromes by characterizing the speech output in two broad categories:

fluent and non-fluent aphasia [Goodglass 93]. Fluent aphasia has normal articulation and rhythm of

speech, but is deficient in meaning. Typically, there are word-finding problems that most affect nouns

and picturable action words. Non-fluent aphasic speech is slow and laboured with short utterance length.

The flow of speech is more or less impaired at the levels of speech initiation, the finding and sequenc-

ing of articulatory movements, and the production of grammatical sequences. Following the above

classification, we list the major types of aphasia and their properties:

1. Fluent

(a) Wernicke’s aphasia is caused by damage to the temporal lobe of the brain, is one of the most

common syndromes in fluent aphasia. People with Wernicke’s aphasia may speak in long

sentences that have no meaning, adding unnecessary or made-up words. Individuals with

Wernicke’s aphasia usually have great difficulty understanding the speech of both themselves

and others and are therefore often unaware of their mistakes.

(b) Transcortical aphasia presents similar deficits as in Wernicke’s aphasia, but repetition ability

remains intact.

(c) Conduction aphasia is caused by deficits in the connections between the speech-

comprehension and speech-production areas. Auditory comprehension is near normal, and

oral expression is fluent with occasional paraphasic errors. Repetition ability is poor.

(d) Anomic aphasia with anomic aphasia the individual may have difficulties naming certain

words, linked by their grammatical type (e.g. difficulty naming verbs and not nouns) or by

their semantic category (e.g. difficulty naming words relating to photography but nothing else)

or a more general naming difficulty.

2. Non-fluent

(a) Broca’s aphasia is caused by damage to the frontal lobe of the brain. People with Broca’s

aphasia may speak in short phrases that make sense but are produced with great effort.

People with Broca’s aphasia typically understand the speech of others fairly well. Because of

this, they are often aware of their difficulties and can become easily frustrated.

(b) Global aphasia presents severe communication difficulties, individuals with global aphasia will

be extremely limited in their ability to speak or comprehend language. They may be totally

non-verbal, and/or only use facial expressions and gestures to communicate.

(c) Transcortical Motor aphasia presents similar deficits as Broca’s aphasia, except repetition

ability remains intact. Auditory comprehension is generally fine for simple conversations, but

declines rapidly for more complex conversations.

2.1.2 Aphasia treatment

In some cases, a person will completely recover from aphasia without treatment. This type of sponta-

neous recovery usually occurs following a type of stroke in which blood flow to the brain is temporarily

interrupted, but quickly restored, called a transient ischemic attack. In these circumstances, language

abilities may return in a few hours or a few days. For most cases, however, language recovery is not

as quick or as complete. While many people with aphasia experience partial spontaneous recovery, in

which some language abilities return a few days to a month after the brain injury, some residual disor-

ders typically remain. In these instances, most clinicians would recommend speech-language therapy.

The recovery process usually continues over a two-year period, although clinicians believe that the most

effective treatment begins early in the recovery process.

There are multiple modalities of speech therapy [Albert 98]. The most commonly used techniques

are focused on improving expressive output, such as the stimulation-response method and the Melod-

6

ical Intonation Therapy (MIT). MIT is a formal, hierarchically structured treatment program based on

the assumption that the stress, intonation, and melodic patterns of language output are controlled pri-

marily by the right hemisphere and, thus, are available for use in the individual with aphasia with left

hemisphere damage [Albert 94]. Other methods are linguistic-oriented learning approaches, such as

the lexical-semantic therapy or the mapping technique for the treatment of agrammatism. Still, other

techniques such as Promoting Aphasics’ Communicative Effectiveness (PACE), focus on enhancing

communicative ability, non-verbal as well as verbal, in pragmatically realistic settings [Davis 85]. Several

non-verbal methods for the treatment of severe global aphasics rely on computer-aided therapy such

as the visual analogue communication, iconic communication, visual action and drawing therapies are

currently used [Sarno 81]. An example is Computerized visual communication (or C-VIC) designed as

an alternative communication system for patients with severe aphasia and is based on the notion that

people with severe aphasia can learn an alternative symbol system and can use this alternative system

to communicate [Weinrich 91].

Furthermore, although there exists such an extended list of treatments specifically thought to recover

from a different disorder caused by aphasia, one class of treatment especially important is the one

devoted to help improving word retrieval problems, since as noticed, it is one of the most common

residual disorder in all aphasia syndromes. Naming abilities problems are typically treated with semantic

exercises like Naming Objects or Naming common actions where commonly the patient is asked to name

a subject represented in a picture [Adlam 06].

Speech recognition is the translation, operated by a machine, of spoken words into text. It is a difficult

task, whose automation involves many areas of computer science, from signal processing, to statistical

frameworks and machine learning techniques. In the following, in order to describe the components of

the ASR module that are of relevance for the project, a brief introduction to the speech recognition topics

is provided.

2.2.1 Brief introduction to automatic speech recognition

Speech recognition systems do not actually perform the recognition or decoding step directly on the

speech signal. Rather, the speech waveform is divided into short frames of samples, which are con-

verted to a meaningful set of features. The duration of the frames is selected so that the speech wave-

form can be regarded as being stationary. In addition to this transformation, some pre-processing tech-

niques are applied to the waveform signal in order to enhance it and to better prepare it for the speech

recognition.

In the feature extraction step, the sampled speech signal is parametrized. The goal is to extract

a number of parameters (‘features’) from each frame of the signal containing the relevant speech in-

formation and being robust to acoustic variations and sensitive to linguistic context. More in detail,

features should be robust against noise and factors that are irrelevant for the recognition process, also

7

features that are discriminant and allow to distinguish between different linguistic units (e.g., phones)

are required.

Then, the next stage in the recognition process is to do a mapping of the speech vectors found at

the previous step and the wanted underlying sequence of acoustic classes modelling concrete symbols

(phonemes, letters, words...). Acoustic modelling is arguably the central part of any speech recognition

system, it plays a critical role in improving ASR performance. The practical challenge is how to build ac-

curate acoustic models, that can truly reflect the spoken language to be recognized. Typically, sub-word

models like phonemes, diphones or triphones, are more often used as the unit of acoustic model with

respect to word model. An extended and successful statistical parametric approach to speech recogni-

tion is the Hidden Markov Model (HMM) paradigm [Rabiner 89, Rabiner 93] that supports both acoustic

and temporal modelling. HMMs model the sequence of feature vectors as a piecewise stationary pro-

cess. An utterance X = x1, . . . , xn, . . . , xN is modelled as a succession of discrete stationary states

Q = q1, . . . , qk, . . . , qK , K < N , with instantaneous transitions between these states. An HMM is typi-

cally defined as a stochastic finite state automaton, usually with a left-to-right topology. It is called ”hid-

den” Markov model because the underlying stochastic process (the sequence of states) is not directly

observable, but still affects the observed sequence of acoustic features. Alternatively, Artificial Neu-

ral Network (ANN) have been proposed as an efficient approach to acoustic modelling [Tebelskis 95].

Although for the past thirty years ANNs have been used for difficult problems in pattern recognition,

more recently many researchers have shown that these nets can be used to estimate probabilities that

are useful for speech recognition. Multilayer Perceptron (MLP)s, are the most common ANN used for

speech recognition. Typically, MLPs have a layered feedforward architecture with an input layer, zero

or more hidden layers, and one output layer. ANN-HMM hybrid systems have been focus of research in

order to combine the strengths of the two approaches [Morgan 95]. Systems based on this connectionist

approach have performed very well on Large Vocabulary Continuous Speech Recognition (LVCSR).

Knowledge of the rules of a language, the way in which words are connected together into phrases,

is expressed by the language model. It is an important building block in the recognition process, it is

used to guide the search for an interpretation of the acoustic input. There are two types of models

that describe a language: grammar-based and statistical-based language models. When the range of

sentences to be recognized is very small, it can be captured by a deterministic grammar that describes

the set of allowed phrases. In large vocabulary applications, on the other hand, it is too difficult to write a

grammar with sufficient coverage of the language, therefore a stochastic grammar, typically an n-gram

model is often used. An n-gram grammar is a representation of an n-th order Markov language model

in which the probability of occurrence of a symbol is conditioned upon the prior occurrence of n−1 other

symbols. When sub-word models are used, the word model is then obtained by concatenating the sub-

word models according to the pronunciation transcription of the words in a dictionary or lexical model. Its

purpose is to map the orthography of the words in the search vocabulary to the units that model the

actual acoustic realization of the vocabulary entries. Lexicon generation may rely on manual dictionaries

or on automatic grapheme-to-phoneme modules, both rule-based or data-driven learned approaches (or

hybrid).

8

The last step in the recognition process is the decoding phase, whose objective is to find a sequence

of words whose corresponding acoustic and language models best match the input signal. Therefore,

such a decoding process with trained acoustic and language models is often referred as a search pro-

cess. Its complexity varies according to the recognition strategy and to the size of the vocabulary. With

Isolated Word Recognition (IWR) word boundaries are known, the word with highest forward probability

is chosen as the recognized word and the search problem becomes a simple pattern recognition prob-

lem. Search in Continuous Speech Recognition (CSR), on the other side, is more complicated since the

search algorithm has to consider the possibility of each word starting at any arbitrary time frame. Also, for

small vocabulary tasks, it is possible to expand the whole search network defined by the language and

lexical restrictions to directly apply conventional time-synchronous Viterbi search. However, in LVCSR

systems different strategies should be addressed. These, span from graph compaction techniques, on-

the-fly expansion of the search space [Ortmanns 00] and heuristic methods.

2.2.2 AUDIMUS speech recognizer

AUDIMUS is the ASR system developed by the Spoken Language Processing Lab of INESC-ID (L2F)

group and integrated into the VITHEA system. It is the result of several years of dedicated research

efforts to the development of ASR systems. AUDIMUS is a hybrid recognizer that follows the above

mentioned connectionist approach [Morgan 95]. It combines the temporal modelling capacity of HMMs

with the pattern discriminative classification of MLP. A Markov process is used to model the basic

temporal nature of the speech signal, while an ANN is used to estimate posterior phone probabilities

given the acoustic data at each frame. As shown in Fig. 2.1, the baseline system combines three MLP

outputs trained with different feature sets: Perceptual Linear Predective (PLP, 13 static + first deriva-

tive) [Hermansky 90], log-RelAtive SpecTrAl (RASTA, 13 static + first derivative) [Hermansky 92], and

Modulation SpectroGram (MSG, 28 static) [Kingsbury 98]. This merged approach has proved being

more efficient and robust with respect to using one of the feature individually [Meinedo 00]. This is ex-

plained by the integration of the advantages of these three feature sets: the inclusion of the attributes

of the psychological processes of human hearing into the analysis used with PLP [Jamaati 08] makes

the speech perception more human-like, the compensation for linear channel distortions provided by

RASTA, the improved performance in terms of stability provided by MSG in the presence of acoustic

interferences, like high levels of background noise and reverberation [Koller 10]. The AUDIMUS decoder

is based on a Weighted Finite State Transducer (WFST) approach to large vocabulary speech recogni-

tion [Mohri 02, Caseiro 06]. AUDIMUS integrates a rule-based grapheme-to-phone conversion module

based on WFSTs for European Portuguese [Caseiro 02]. The acoustic model integrated in VITHEA was

trained with 57 hours of downsampled Broadcast News data and 58 hours of mixed fixed-telephone and

mobile-telephone data in European Portuguese [Abad 08].

9

Figure 2.1: Block diagram of AUDIMUS speech recognition system.

2.2.3 Automatic word verification

The task that performs the evaluation of the utterances spoken by the patients, in a similar way to the

role of the therapist in a rehabilitation’s session, is referred as word verification. This task consists of

deciding whether a claimed word W is uttered in a given speech segment S or not. In the simplest case,

a true/false answer is provided, but a verification score might be also generated. It should be noted that

the task has been called word verification, although it actually refers to term verification, since a keyword

may in fact consist of more than one word (e.g. rocking chair).

2.2.3.0.1 Word verification based on keyword spotting Several approaches exist based on speech

recognition technology to tackle the word verification problem. Given that word W is known, forced

alignment with an ASR system could be one of the most straightforward possibilities. However, speech

from aphasic patients contains a considerable amount of hesitations, doubts, repetitions, descriptions

and other speech disturbing factors that are known to degrade ASR performance, and consequently, this

will further affect the alignment process. These issues led to consider the forced alignment approach

inconvenient for the word verification task. Alternatively, keyword spotting methods can better deal with

unexpected speech effects. The object of keyword spotting is to detect a certain set of words of interest

in the continuous audio stream. In fact, word verification can be considered a particular case of keyword

spotting (with a single search term) and similar approaches can be used.

Keyword spotting approaches can be broadly classified into two categories [Szoke 05]: based on

LVCSR or based on acoustic matching of speech with keyword models in contrast to a background

model. Methods based on LVCSR search for the target keywords in the recognition results, usually

in lattices, confusion networks or n-best hypothesis results since they allow improved performances

compared to searching in the 1-best raw output result. The training process of an LVCSR system

requires large amounts of audio and text data, which may be a limitation in some cases. Additionally,

LVCSR systems make use of fixed large vocabularies (>100K words), but when a specific keyword is

not included in the dictionary, it is never detected. Acoustic approaches are very closely related to IWR.

They basically extend the IWR framework by incorporating an alternative competing model to the list of

keywords generally known as background, garbage or filler speech model. A robust background speech

model must be able to provide low recognition likelihoods for the keywords and high likelihoods for out-

10

of-vocabulary words in order to minimize false alarms and false rejections when CSR is performed. Like

in the IWR framework, keyword models can be word-based or phonetic-based (or sub-phonetic). The

latter allows simple modification of the target keywords since they are described by their sequence of

phonetic units.

In order to choose the best approach for this task, preliminary experiments were conducted on a tele-

phone speech corpus considering both LVCSR and acoustic matching approach [Abad 13]. According

to the results obtained, it was considered that acoustic based approaches were more adequate for the

type of problem addressed in the on-line therapy system.

2.2.3.0.2 Keyword spotting with AUDIMUS To accomplish the technique described in the previous

section and a successful integration into the VITHEA system, the baseline ASR system was modified

to incorporate a competing background speech model that is estimated without the need for acoustic

model re-training.

While keyword models are described by their sequence of phonetic units provided by an automatic

grapheme-to-phoneme module, the problem of background speech modelling must be specifically ad-

dressed. The most common approach consists of building a new phoneme classification network that in

addition to the conventional phoneme set, also models the posterior probability of a background speech

unit representing “general speech”. This is usually done by using all the training speech as positive

examples for background modelling and requires re-training the acoustic networks. Alternatively, the

posterior probability of the background unit can be estimated based on the posterior probabilities of

the other phones [Pinto 07]. The second approach has been followed, estimating the likelihoods of a

background speech unit as the mean of the top-6 most likely outputs of the phonetic network at each

time frame.In this way, there is no need for acoustic network re-training. The minimum duration for the

background speech word is fixed to 250 msec.

Up to our knowledge, there are only a few of therapeutic tools that support automatic evaluation through

speech recognition. Two of the most outstanding are PEAKS (Program for Evaluation and Analysis of all

Kinds of Speech disorders) and VITHEA (Virtual Therapist for Aphasia Treatment). PEAKS [Maier 09]

is an on-line recording and analysis environment for the automatic or manual evaluation of voice and

speech disorders. Once connected to the system, a patient may perform a standardized test which

is then analysed by automatic speech recognition and prosodic analysis. The result is presented to

the user, and can be compared to previous recordings of the same patient or to recordings from other

patients.

VITHEA [Abad 13] is an on-line platform designed to act as a “virtual therapist” for the treatment of

Portuguese speaking aphasic patients. The system allows word naming exercise, wherein the patient

is asked to recall the content presented in a photo or picture shown. By means of the use of automatic

speech recognition, the system processes what is said by the patient and decides if it is correct or wrong.

The program provides feedback both as a written solution and as a spoken message produced by an

11

Databasesystem

AUDIMUSsystem

WebApplication

serverClient

Figure 2.2: Comprehensive overview of the VITHEA system.

animated agent using text-to-speech synthesis.

The VITHEA system, target of this work, will be described deeply in the following Sections.

2.3.1 VITHEA: An on-line system for virtual treatment of aphasia

The on-line system described in [Pompili 11] is the first prototype for aphasia treatment resulting from the

collaboration of the Spoken Language Processing Lab of INESC-ID (L2F) and the Language Research

Laboratory of the Lisbon Faculty of Medicine (LEL), which has been developed in the context of the

activities of the Portuguese national project VITHEA1. It consists of a web-based platform that permits

speech-language therapists to easily create therapy exercises that can be later accessed by aphasia

patients using a web-browser. During the training sessions, the role of the therapist is taken by a “virtual

therapist”that presents the exercises and that is able to validate the patients’ answers. The overall flow

of the system can be described as follows: when a therapy session starts, the virtual therapist shows to

the patient, one at a time, a series of visual or auditory stimuli. The patient is then required to respond

verbally to these stimuli by naming the contents of the object or action that is represented. The utterance

produced is recorded, encoded and sent via network to the server side. Here, a web application server

receives the audio file and processes it by an ASR module, which generates a textual representation.

This result is then compared with a set of predetermined textual answers (for the given question) in order

to verify the correctness of the patient’s input. Finally, feedback is sent back to the patient. Figure 2.2

shows a comprehensive view of this process. In practice, the platform is intended not only to serve as an

alternative, but most importantly, as a complement to conventional speech-language therapy sessions,

permitting intensive and inexpensive therapy to patients, besides providing to the therapists a tool to

assess and track the evolution of their patients.

The various approaches for aphasia rehabilitation introduced in Section 2.1.2 aim at different pur-

poses. Most of them are focused on restoring language abilities, others are intended to compensate for

language problems and learn other methods of communicating. The approach followed by the VITHEA

1http://www.vithea.org

12

http://www.vithea.org

system falls in the first category, aiming at restoring linguistic processing by means of linguistic ex-

ercises. In particular, the focus of the system is on the recovery of word naming ability for aphasic

patients. Exercises are designed for Portuguese speaking aphasia patients.

2.3.1.1 The patient and the clinician applications

The system comprises two specific modules, dedicated respectively to the patients for carrying out the

therapy sessions and to the clinicians for the administration of the functionalities related to them. The

two modules adhere to different requirements that have been defined for the particular class of user for

which they have been developed. Nonetheless they share the set of training exercises, that are built by

the clinicians and performed by the patients.

2.3.1.1.1 Patient application module The patient module is meant to be used by aphasic individuals

to perform the therapeutic exercises. Figure 3.2 illustrates some screen-shots of the Patient Module.

Exercise protocol Following the common therapeutic approach for treatment of word finding difficul-

ties, a training exercise is composed of several semantic stimuli items. Stimuli may be of several

different types (text, audio, image and video) and they are classified according to themes, in order

to immerse the individual in a pragmatic, familiar environment. Like in ordinary speech-language

therapy sessions, once the patient is logged into the system, the virtual therapist guides him/her

in carrying out the training sessions, providing a list of possible exercises to be performed. When

the patient chooses to start a training exercise, the system presents target stimuli one at a time in

a random way and he/she is asked to respond to each stimulus verbally. After the evaluation of the

patient’s answer by the system, the patient can listen again to his/her previous answer, record an

utterance in case of invalid answer or skip to the next exercise.

Exercise interface The exercise interface has been designed to cope with the functionalities needed

for automatic word recalling therapy exercises, which includes among others the integration of

an animated virtual character (the virtual therapist), Text-To-Speech (TTS) synthesized voice, im-

age and video displaying, speech recording and play-back functionalities, automatic word naming

recognition and exercise validation and feed-back prompting, besides conventional exercise navi-

gation options. Additionally, the exercise interface has also been designed to maximize simplicity

and accessibility. First, because most of the users for whom this application is intended suffered a

CVA and they may also have some sort of physical disability. Second, because aphasia is a pre-

dominant disorder among elderly people, which are more prone to suffer from visual impairments.

Thus, the graphic elements chosen, were carefully considered, using big icons for representing

the interface.

2.3.1.1.2 Virtual character animation and speech synthesis The virtual therapist’s representation

to the user is achieved through a tri-dimensional (3D) game environment with speech synthesis capa-

bilities. Within the context of the VITHEA application, the game environment is essentially dedicated to

graphical computations, which are performed locally in the user’s computer. Speech synthesis genera-

tion occurs in a remote server, thus ensuring proper hardware performance. The game environment is

13

Figure 2.3: Screen-shots of the VITHEA patient application.

based on the Unity2 game engine, it contains a low poly 3D model of a cartoon character with visemes

and facial emotions, which receives and forwards text (dynamically generated according to the sys-

tem’s flow) to the TTS server. Upon server reply, the character’s lips are synchronized with synthesized

speech.

2.3.1.1.3 Speech synthesis DIXI [Paulo 08] is the TTS engine developed by the Spoken Language

Processing Lab of INESC-ID (L2F) group and integrated into the game environment. It has been con-

figured for unit selection synthesis with an open domain cluster voice for European Portuguese. DIXI is

used to gather SAMPA phonemes [Trancoso 03], their timings and raw audio signal information, which

is lossy encoded for usage in the client game. The phoneme timings are essential for a visual output

of the synthesized speech, since the difference between consecutive phoneme timings determines the

amount of time a viseme should be animated.

2.3.1.1.4 Clinician application module The clinician module is specifically designed to allow clini-

cians to manage patient data, to regulate the creation of new stimuli and the alteration of the existing

ones, and to monitor user performance in terms of frequency of access to the system and user progress.

The module is composed of three sub-modules:

User management This module allows the management of a knowledge base of patients that can be

edited by the therapist at any time. Besides basic information related to the user personal profile,

2http://unity3d.com/

14

the database also stores for each individual his/her type of aphasia, his/her aphasia severity (7-

level subjective scale) and aphasia quotient (AQ) information from the Western Aphasia Battery.

Exercise editor This module allows the clinician to create, update, preview and delete stimuli from an

exercise in an intuitive fashion similar in style to a WYSIWYG editor. In addition to the canonical

valid answer, the system accepts for each stimulus an extended word list comprising the most

frequent synonyms and diminutives.

Since the stimuli are associated with a wide assortment of multimedia files, besides their manage-

ment, the module also provides a rich Web based interface to manage the database of multimedia

resources used within the stimuli. The system is capable of handling a wide range of multimedia

encoding: audio (accepted file types: wav, mp3), video (accepted file types: wmv, avi, mov, mp4,

mpe, mpeg, mpg, swf), and images (accepted file types: jpe, jpeg, jpg, png, gif, bmp, tif, tiff). Given

the diversity of the various file types accepted by the system, a conversion to a unique file type was

needed, in order to show them all with only one external tool. Audio files are therefore converted

to mp3, file format, while video files are converted to flv file format. Figures 2.4 and 2.5 illustrates

some screen-shots of the clinician module.

Figure 2.4: Interface for the creation of new stimulus.

Patient tracking This module allows the clinician to monitor statistical information related to user-

system interactions and to access the utterances produced by the patient during the therapeutic

15

Figure 2.5: Interface for the management of multimedia resources.

sessions. The statistical information comprises data related to the user’s progress and to the fre-

quency with which users access the system. On the one hand, all the attempts recorded by the

patients are stored in order to allow a re-evaluation by clinicians. This data can be used to identify

possible weaknesses or errors from the recognition engine. On the other hand, monitoring the

usage of the application by the patients will permit the speech-language therapist to assess the

effectiveness of the platform and its impact on the patients’ recovery progress.

2.3.1.2 Platform architecture overview

An ad-hoc multi-tier framework that adheres to the VITHEA requirements has been developed by inte-

grating different heterogeneous technologies. The back-end of the system relies on some of the most

advanced open source frameworks for the development of web applications: Apache Tiles, Apache

Struts 2, Hibernate and Spring. These frameworks follow the best practice and principles of software

engineering, thus guaranteeing the reliability of the system on critical tasks such as databases access,

security, session management etc. The back-end side also integrates the L2F speech recognition system

(AUDIMUS, [Meinedo 03, Meinedo 10]) and TTS synthesizer (DIXI, [Paulo 08]). The ASR component

is the backbone of the system and it is responsible for the validation or rejection of the answers pro-

vided by the user. TTS and facial animation technologies allow the virtual therapist to “speak”the text

associated with a stimulus and supply positive reinforcement to the user. The client side also exploits

Adobe R©Flash R© technology to support rich multimedia interaction, which includes audio and video stim-

uli reproduction and recording and play-back of patients’ answers. Finally, the system implements a data

architecture that allows handling groups of speech-language therapists and groups of patients. Thus,

a user may belong to a specific group of patients and this group can be assigned to a therapist or to

a group of therapists. Therapists who belong to the same group share the clinical information of the

16

patients, the set of therapeutics exercises, and also the set of resources used within the various stimuli.

In this way patients with the same type and/or degree of severity of aphasia can be clustered together

and take advantage of exercises and stimuli that are tailored to their specific disorder, thus improving

the benefits resulting from a therapeutic training session.

This Section is devoted at providing the relevant state of the art for each of the new features, target of

this work.

2.4.1 Content adaptation for mobile devices

To make available the VITHEA services also from mobile equipments, new client applications that adhere

with the specific device standard’s have to be designed and built. This means that two separate software

applications have to be built for Android and iOS based equipments. The target of this work, only address

the Android platform. On the other hand, the server side services already provided by the system,

should preserve their original business logic, thus affecting only the exposition of the services. These

constraints lead toward the direction of a Service-Oriented Architecture (SOA). SOA is a set of principles

and methodologies for designing and developing software in the form of interoperable services. Here,

services are well-defined business functionalities that are built as software components that can be

reused for different purposes. Web services are the typical usage scenario for implementing a SOA

architecture, they allow the functional building-blocks being accessible over standard Internet protocols

in a independent way of platforms and programming languages. In this scenario, the most widely used

technologies that can implement a SOA architecture rely on Simple Object Access Protocol (SOAP),

Remote Procedure Call (RPC), or on Representational State Transfer (REST) approaches.

SOAP is a message transport protocol for exchanging structured information in the implementation

of web services in computer networks, it has been accepted as the default message protocol in SOA.

SOAP messages are created by wrapping application specific XML messages within a standard XML-

based envelope structure. The result is an extensible message structure which can be transported

through most underlying network transports like SMTP and HTTP.

RPC is an inter-process communication allowing to call a procedure in another address space and

exchange data by message passing. Methods stubs on the client process make the call making it appear

as local, while the stubs take care of marshalling the request and sending it to the server process. The

server process then unmarshalls the request and invokes the desired method before replying to the

client with the reverse procedure.

REST is an architectural style for distributed hypermedia systems. It describes an architecture where

each resource, such as a web service, is represented with an unique Uniform Resource Identifiers (URI).

The principle of REST is to use the HTTP protocol as it is modelled, thus accessing and modifying the

resources through the standardized HTTP functions GET, POST, PUT, and DELETE.

One of the main criticisms of SOAP relates to the way the SOAP messages are wrapped within

17

an envelope. Because of the verbose XML format, SOAP can be considerably slower than competing

middle-ware technologies. A disadvantage of RPC is that the set of legal actions that are eligible on the

server has to be explicitly defined at build time, since these actions are wrapped by the method stubs

that are consumed by the client. In a REST scenario, on the other side, the client and the server are

much less tied, the obligation within the two parts is minimal, in the case of HTTP’s implementation of

REST this corresponds to a single URI that can be accessed through a GET request.

Thus, within a larger context, SOAP is the de-facto standard for web service message ex-

change, however within a mobile context the REST architecture is considered as more light-

weight [Richardson 07] than a SOAP based Web service architecture, since it avoids those heavy oper-

ations which in the SOAP approach are needed in order to maintain a standard format [Knutsen 09].

2.4.2 Hands-free speech

One of the main challenges of the implementation of the hands-free interface is the determination of a

robust VAD algorithm. VAD aims at determining the presence or absence of speech. This technique is

useful both for speech coding and speech recognition, thus it has been object of many studies leading

to several different approaches. In [Sangwan 02] the authors designed a customized algorithm for real-

time speech transmission based on the energy of the input signal. This work relies on the estimation

of an adaptive threshold meaningful of the background noise. Two refined strategies are defined to

recover from misclassification errors that may result from the energy detector. The first of this strategies

is based on a feature of the signal, the number of Zero Crossing Rate, while the second relies on the

autocorrelation function.

In [Chuangsuwanich 11] the authors investigate on a VAD approach for real world application using a

two-stage approach based on two distinguishing features of speech, namely harmonicity and modulation

frequency.

In [Ramirez 04] the authors employs long-term signal processing and maximum spectral component

tracking to improve VAD algorithm. With the introduction of a noise reduction stage before the long-

term spectral tracking, the authors are able to recover from misclassification errors also in highly noisy

environments. Experimental results appear to confirm the improvement with respect to VAD methods

based on speech/pause discrimination.

2.4.3 Exploiting IR for improved search functionality

The intrinsic ambiguity of natural language is a well known problem in human understanding that affects

also, with issues far greater, the computational processing of the data related with human computer

interactions. Different issues influence different areas, among these, the possibility to express the same

concept using different synonymy has a strong impact on the recall of most information retrieval systems.

The methods for tackling this problem split into two major classes: global and local methods. The first

includes techniques for expanding or reformulating the original query terms, so to cause the new query to

match other semantically similar terms. These techniques, known as “query expansion”, may be based

18

on controlled vocabulary, manual or automatically derived thesaurus, and on log mining. Local methods,

on the other side, try to adjust a query relative to the results that initially appear to match the query, the

most used techniques in this context are known as “relevance feedback”.

In this work the potential of query expansion will be used to provide an enhanced search experience,

which should allow a better management of the system data. To this purpose, some of the many ap-

proaches that are referred in the literature to address this task, are briefly described, classifying them on

the basis of the method used. Lexical resources like WordNet or UMLS metathesaurus, are commonly

exploited for query expansion. In [Voorhees 94], lexical-semantic relations are used to improve search

performance in large data collections. In [Aronson 97], the authors explore the MetaMap program for as-

sociating meta-thesaurus concepts with the original query in order to retrieve MEDLINE citations. Many

other approaches use corpus or lexical resources to automatically develop a thesaurus. However, most

of these methods are used in domain specific search engines or applications. In [Gong 05], the authors

used WordNet and TSN (Term Semantic Network) developed using word co-occurrence in corpus. Here,

the author used TSN as a filter and supplement for WordNet. To conclude, with the increase in usage

of Web search engines, it is easy to collect and use user query logs. [Cui 02] developed a system that

extracts probabilistic correlations between query terms and documents terms using query logs.

2.4.4 New automatic evocation exercises for therapy treatment

There exist several naming exercises for the recovery of lost communication abilities. Among these we

mention, category naming, confrontation naming, automatic closure naming, automatic serial naming,

recognition naming, repetition naming, and responsive naming. Some of them, are already provided by

the VITHEA system, namely visual confrontation, automatic closure naming, and responsive naming.

Category naming is a task for assessing the ability to classify semantically related words and concepts

in various word frequency categories which are perceptual, conceptual or semantic, and functional cat-

egories. The perceptual categories is classified on the basis of the relevant sensory quality of a stimulus

such as shape, size or colour. The conceptual or semantic category is classified on the basis of a gener-

alized idea of a class of objects. The functional category is classified on the basis of an action of function

associated with a class of objects [Campbell 05, Murray 01].

Automatic serial naming is a task for assessing the ability to produce rote or over learned material. A

patient may be asked to do tasks such as counting from 1 to 20, naming the days of the week, writing

out the letters of the alphabet, and or reciting well-known prayers or nursery rhymes [Campbell 05,

Murray 01].

Recognition naming is a task for assessing the ability of recognizing words. It is used when patients

are unable to name an item. The patient may be required to indicate the correct word from verbal or

written choices. For example for the target stimulus “elephant” the patient has to indicate the correct

word from three verbal or written choices such as “girraff”, “elephant”, “telephone”.

Repetition naming is a task for assessing ability in repetition or copying words of patients who cannot

verbally name or write.

Currently, with the exception of the VITHEA system, does not appear to exist in the literature an

19

automatic implementation through speech recognition of the above mentioned exercises.

2.4.5 Exploiting syllable information in word naming recognition of aphasic

speech

Syllables play an important role in speech recognition, in fact the pronunciation of a given phoneme tends

to vary depending on its location within a syllable. There is a lot of work in the literature on the genera-

tion of new syllables prototypes deriving from several different acoustic-phonetic rules. These are often

exploited to explore complementary acoustic model for speech processing. In [Hunt 04] the authors

showed how, a statistical approach to phonetics could complement and improve current speech recog-

nition by taking syllable boundaries into account. In [Oliveira 05] three methods for dividing European

Portuguese word into syllables are presented. Experimental results have shown a percentage of cor-

rectly recognized syllable boundaries above 99.5%, and comparable word accuracy. Also, in [Code 94]

syllabification is examined in respect of non lexical English and German aphasic speech automatisms

(recurring utterances).

20

3The client side of the VITHEA platform exploits Adobe R©Flash R© technology to record patients’ an-

swers. This module, unfortunately, limits the extension of the system to mobile devices, such as tablets

and smart-phones. This is due to a limitation in the API provided by Adobe R©. In fact, the microphone

class, used to acquire speech input, is not supported by the Flash R© player running in a mobile browser.

Therefore, an ad-hoc application specifically suited for these devices has been designed and imple-

mented. Even though this application theoretically clones the same implementation logic of the web ver-

sion, the underlying technology is different and raised several integration issues due to the heterogeneity

of the standard used. New reusable components have been developed in order to provide server-side

services in a standardized way, accessible by heterogeneous client device running both iOS or Android

operating systems.

Although services have been designed in a service oriented architecture (SOA) fashion that allows

easy deployment of client modules for different systems, in this work we have restricted ourselves to

the development of a client application running only on Android systems. Android has been chosen as a

case study because it is available as open source software, enabling developers to distribute applications

to any Android device trough the Android market.

In this Chapter, Section 3.1 introduces the main standards upon which the mobile version is based,

while the architecture of the final solution is described in Section 3.2. The results of a user experi-

ence evaluation conducted with 16 users are described in Section 3.3, followed by the discussion in

Section 3.4.

In the literature review (Section 2.4.1) it has already been discussed the range of technologies that

are available within the implementation of a SOA. The disadvantages of these standards have been

analysed: the rigidity of RPC and the additional complexity of SOAP propose REST as the favourite

candidate for the implementation of the new server-side services. In the following, the principles guiding

the REST architectural style are initially described, then the data representation format for the exchange

of information between client and server is explained. Finally the Android Platform, used to develop the

client application, is briefly introduced.

3.1.1 Representational State Transfer

A software architecture is an abstraction of the runtime elements of a software system during some

phase of its operation [Fielding 00]. Therefore, an architecture determines how system elements are

identified and allocated, how the elements interact to form a system, the amount and granularity of com-

munication needed for interaction, and the interface protocols used for communication. In this context,

an architectural style is a coordinated set of architectural constraints that restricts the roles and features

of architectural elements, and the allowed relationships among those elements, within any architecture

that conforms to the style.

REST is an architectural style for distributed hypermedia systems. It ignores the details of compo-

nent implementation and protocol syntax in order to focus on the roles of components itself. REST was

introduced and defined by Roy Fielding, author of HTTP 1.1, as a hybrid style derived from several

network-based architectural styles, among which the client-server paradigm, and combined with addi-

tional constraints that define a uniform connector interface [Fielding 02].

Thus, the software engineering principles guiding REST may be defined in terms of the set of con-

straints defining the style and guiding the interactions among architectural elements.

The first remarkable constraints of the style are those of the client-server architecture and comprise

separation of concerns and stateless communication. By separating the user interface concerns from

the data storage concerns, the portability of the user interface across multiple platforms is improved,

together with scalability by simplifying the server components. A stateless communication requires

that each request from client to server must contain all of the information necessary to understand the

request, and cannot take advantage of any stored context on the server. Session state is therefore kept

entirely on the client. Like most architectural choices, the stateless constraint reflects a design trade-off.

The disadvantage is that it may decrease network performance by increasing the repetitive data (per-

interaction overhead) sent in a series of requests, since that data cannot be left on the server in a shared

context.

However, the central feature that distinguishes the REST architectural style from other network-based

styles is its emphasis on a uniform interface between components. This is achieved through four addi-

tional interface constraints: identification of resources, manipulation of resources through representa-

tions, self-descriptive messages, and, hypermedia as the engine of application state. A brief description

of these principles is provided in the following.

A resource can be any information that can be named, a document or image, a temporal service (e.g.,

“today’s weather in Los Angeles”), a collection of other resources, and so on. To identify the particular

resource involved in an interaction between components REST uses a resource identifier. Thus, every

resource and interconnection of resources is uniquely identified and addressable with a URI.

REST components communicate by transferring a representation of the data in a format matching

one of an evolving set of standard data types, selected dynamically based on the capabilities or desires

of the recipient and the nature of the data. That is, RESTful implementations may support more than one

representation of the same resource at the same URI, and allow clients to indicate which representation

of a resource they wish to receive.

Each client request and server response is a message and RESTful implementations expect each

message to be self-descriptive. That is, each message contains all the information necessary to com-

plete the task. REST-ful implementation also operates on the notion of a constrained set of message

22

types that are fully understood by both client and server. These belong to the set of HTTP methods

defined in HTTP 1.1, some of them are: GET, HEAD, OPTIONS, PUT, POST, and DELETE.

Finally, hypermedia as the engine of application state means that changes to the current state of the

application are performed through hypermedia links. That is, clients move from state to state via URIs.

3.1.2 Data representation

As observed in the previous section, one of the principles guiding REST is multiple representation of

the transferred data. That is, the same information could be accessible by different views, dynamically

selected by the client at runtime. This is achieved through the specification of an HTTP header field,

which defines how client and server should communicate and exchange resources. The list of standard

media type understood by clients and servers adheres to the one defined by the IANA registry1, vendor-

specific media types may also exist too. Among the standard media types we cite: text/plain, text/html,

application/xml, and application/json.

{

"employees": [

{

"name": "John Crichton",

"gender": "male"

},

{

"name": "Aeryn Sun",

"gender": "female"

}

]

}

Listing 1: JSON sample code.

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding

documents in a format that is both human-readable and machine-readable. JavaScript Object Nota-

tion (JSON) is a text-based open standard designed for human-readable data interchange. Both XML

and JSON are designed to form a simple and standardized way of describing different hierarchical data

structures and to facilitate their transportation and consumption. XML, however, as the name states, is a

markup language, thus providing the hierarchical elements being described with the possibility of having

additional attributes. This powerful descriptive ability makes it suitable for the representation of docu-

ments and data structures, unfortunately has the undesirable disadvantage of causing an overhead of

data being sent. JSON, on the other hand, has a very concise syntax for defining collection of elements,

which means lesser data, and which makes it preferable for serializing and transmitting structured data

types over a network connection. Listing 2, and listing 1 show the description of two employee with XML

and JSON language respectively.

In this work JSON format has been preferred for data representation.

1http://www.iana.org/assignments/media-types

23

<employees>

<employee>

<name>John Crichton</name>

<gender>male</gender>

</employee>

<employee>

<name>Aeryn Sun</name>

<gender>female</gender>

</employee>

</employees>

Listing 2: XML sample code.

3.1.3 Android Platform

Android is a software stack for mobile devices that includes an operating system, middleware and key

applications. It relies on a Linux 3.x kernel for core system functionality and runs code written in the Java

programming language on a specially designed virtual machine named Dalvik. The client application has

been built in compliance with the Android platform, upon the Android software development kit (SDK).

The Android SDK consist of several tools to help Android application development. These include both

an Eclipse IDE plugin, emulator, debugging tools, visual layout builder, log monitor and more.

The final prototype has been built exploiting the standards described in the previous sections. RESTful

web services have been provided to expose the relevant functionalities to mobile client, adhering to the

constraints that this architectural style requires.

Particularly, the stateless constraint requires that no persistent information is stored on the server

side, meaning that no client context data is maintained between following requests. Each request from

any client should contain all of the information necessary to service the request, and session state is

held in the client. This implies important consequences for the application logic, among which having to

change the authentication modality. In fact, the VITHEA system currently is only accessible after the user

has authenticated properly into the application. Afterwards the system strongly relies on the concept of

session to maintain updated user data. The stateless constraint demands for breaking this mechanism.

3.2.1 REST authentication

There are several solutions for handling authentication in a REST context. Some possible options are

HTTP basic authentication over HTTPS and a dedicated login service.

The first option, relies on the standard HTTPS protocol and on HTTP basic authentication implemen-

tation. Basic authentication is used by most web services since is the simplest technique for enforcing

access controls to web resources, since it does not require cookies or session identifier. Rather, basic

authentication uses static, standard HTTP headers to send user login data. In its typical usage scenario

the user is prompted for the credential just once, then the client software computes the Base64 encoding

24

of the credentials and includes them in each future HTTP request to the server using the Authorization

HTTP header field.

This simple technique presents some known drawbacks. First, it does not provide any confidential-

ity protection for the transmitted credentials. They are merely encoded with Base64 in transit, but not

encrypted or hashed in any way. For these reasons, basic authentication is typically used over HTTPS.

Also, because basic authentication header has to be sent with each HTTP request, the web browser

needs to cache the credentials for a reasonable period to avoid constant prompting user for the user

name and password. Also, this mechanism causes that there is no possibility to automatically expire au-

thenticated credentials after a period of inactivity. Finally, the user name and password are transmitted

over HTTPS to the server, while it should be more secure to let the password stay on the client side,

during keyboard entry, and be stored as secure hash on the server.

The last alternative is to use a dedicated login service that accepts user credentials and returns

a token. This token should then be included, as a URL argument, to each following request. A well

known open standard for authorization is OAuth2. OAuth is an open protocol that allows users to give

permission to a third-party application or web site to access restricted resources on another web site

or service. The third-party application receives an access token with which it can make requests to the

protected service. By using this access token strategy, the users login credentials are never stored within

an application, and are only required when authenticating to the service. Another important advantage of

this approach is that tokens can be created with an expiration date, which is important for some services.

However, even though OAuth is an appealing solution, in the current prototype the authentication

has been implemented following a hybrid approach close to basic authentication. In fact, after analysing

benefits and drawbacks of both solutions, simpler basic authentication over HTTPS was considered

sufficient for the purpose of the system, at least in this first version of the application.

3.2.2 Implemented architecture

To support and adhere with the standards and requirements described in the previous Sections, two

major, widely used frameworks, Spring Security and Spring Web MVC, have been used to implement

the server side services. Spring Security is a non-intrusive framework providing a set of authentication

and access-control services, including HTTP requests authorization, HTTP basic authentication, and

HTTP digest authentication. Spring MVC is a framework that allows building flexible and loosely coupled

web applications and REST services. The Model-view-controller design pattern, core of the framework,

helps in separating the business logic, presentation logic and navigation logic.

For the development of the client application, the Spring for Android framework has been used. This

is an extension of the Spring Framework that aims at simplifying the development of native Android

applications. Spring for Android includes a REST client providing higher level functions that correspond

to six main HTTP methods and several conversion functionalities for the various data representations

supported. The framework also integrates OAuth, which leaves open the option for a future, easy exten-

sion. Figure 3.1 illustrates the overall architecture of the system.

2http://oauth.net/

25

Spring MVC, Spring SecurityRestlet authentication, JSON

Database system

Standard application

context

REST services

Spring for AndroidRest client, JSON

AUDIMUSSpeech

recognizer

Figure 3.1: Architecture.

3.2.2.0.1 Authentication In the current implementation, the user digits its credential at the time of

accessing the system, and these data are then stored in the client application for the whole execution

time. In each following requests, these informations are encrypted with the MD5 message-digest algo-

rithm, added to the Authorization header field, and sent to the server together with the request over

HTTPS. At the server side, data received is verified through the support provided by Spring for basic

authentication. The access restriction to a given resource is performed directly at the configuration file

level. When the request from the client is received the authorization header is checked for user creden-

tials. When found, data present here is compared with the encrypted version of the same data that exists

in the server persistent storage support. If the credentials are correct, the user is granted the access to

the given request, otherwise access is denied.

3.2.2.0.2 Data representation In the most typical RESTful scenario, application data is exchanged

trough HTTP. When data belongs to a simple type, such as a string or integer data type, this is handled

transparently by the protocol. However, quite often applications need to exchange complex data types

that represents information about the state of the system, i.e: a list of books for an online library. In those

cases, complex data objects, such as the java object representing a book, need to be serialized into a

textual format that could exploit JSON or XML representations.

In this work, JSON has been chosen as the data format used for the exchange of information between

client and server. The serialization process takes place upon sending and receiving of data and exploits

the support of Java Architecture for XML Binding (JAXB) for reading and writing JSON data.

26

3.2.3 Client application

The current VITHEA system is the result of the effort of the past few years of research and development.

During this period, the platform went through several phases of improvements and consolidation, till

reaching the current stage in which integrates various heterogeneous technologies. The business logic

of the application has been tailored to the user needs through iterative phases of requirements gathering,

design, and implementation.

The prototype developed in this context, on the other side, has to be considered a proof of concept

of the feasibility of the mobile version and, thus, integrates only the features that are important for a

correct, complete interaction. In addition to authentication, these include, of course, the integration of

the recognition process, of the virtual therapist character, and the application logic that regulates the

exercises flow, thus including listing and navigation, video and audio reproduction.

The virtual therapist’s representation is achieved through the 3D game engine Unity. This environ-

ment provides also the possibility to directly build plugins for Android. In this way, the native module

of the therapist is exported and then integrated into an Android application easily. Thus, similarly to

the standard version, in the mobile version the virtual character guides and interacts with the user by

providing audio-visual feedbacks.

The logic flow of the exercises include the reproduction of video and audio files. The Android plat-

form imposes some constraints on the media formats supported that currently do not totally adhere to

the specification of the system. In fact, none of the video formats accepted by the Android platform is

produced by the clinician platform, which thus should extend its functionality to include the generation of

a new media format.

The speech recognition process is performed remotely by the in-house speech recognizer AUDIMUS.

The audio is acquired through Android microphone, to this purpose Android API provides the Au-

dioRecord class. This class is delegated at the management of the audio resources needed for Java

applications to record audio from the input hardware of the platform. During acquisition, read data is

stored into an internal buffer of the AudioRecord class. When the recording stops, the audio is sent to

the server side through a REST-ful POST request. Here, the in-house speech engine processes the file

and the result of the recognition is returned to the user. It is expected that the microphone quality of

tablet devices could be of poor quality, thus degrading the quality of the recognition. Figure 3.2 shows

some screen-shots of the client application for mobile device during the execution of a training session.

A user experience evaluation based on collected questionnaires has been conducted with 16 users, with

different ages, varying from 23 to 60 years. Users selected for the evaluation have different background

knowledge, ranging from computer science, linguistic and accounting.

Evaluation was held at the workplace where the system has been developed, with the same condi-

tions for all the users. The evaluation process was carried out into two phases. First, the subject was

introduced to the conventional online web browser version of the application, explaining its functionalities

27

Figure 3.2: Screen-shots of the VITHEA mobile patient application.

and requiring the user to directly explore and try it. Then, after getting familiarized with the system, the

user was requested to test the new version of the application for mobile devices.

During the test evaluations, the user was accompanied and observed while executing the test. This

allowed an interactive participation that permitted collecting important feedback, besides the actual ques-

tionnaires, gathered both by observing the user behaviour while using the application and from direct

suggestions of the user itself.

The questionnaire contained 10 questions and it was divided into three sections. The first section,

contains questions related with the overall satisfaction and usability of the application. The second sec-

tion contains questions related with the robustness of the system, while the third contains questions

dedicated to a comparison between the two versions of the client application: the on-line computer-

oriented one and the mobile devices one. Responses are requested by a numerical likert-scale, where

1 is associated to low satisfaction/agreement and 5 equals maximum satisfaction/agreement for each

question or statement. Figure 3.3 illustrates the questionnaire and the results of the evaluation.

Overall, the evaluation provided good results, achieving an average score of 4.14 in a likert scale of 5

points (1 to 5). The items related with the usability of the system have an average score of 4.3, the ones

related with the robustness achieved 4.7, while the items that compares the appreciation of the mobile

version with respect to the online browser version achieved 3.7. The lower score obtained in the last

group is explained with a closer inspection of the evaluation data. Detailed charts in Figure 3.4 show, for

28

Figure 3.3: Results of the evaluation.

each of the four questions, the distributions of the user grades.

According to these charts, in the two questions that are related with the response time of the appli-

cation and the automatic speech recognition, most of the users did not found any relevant differences

among the two systems. This is actually an even better than expected result, since we did not take any

particular action to adapt the speed and the speech recognition engines for this type of the devices.

In fact, the average score for each of the questions is, respectively, 3.3 and 2.9. However, a detailed

inspection on the automatic speech recognition chart shows in fact that 19% of the evaluated users

agreed on the fact that the recognition process was worse. Since the evaluation process was assisted,

we can actually confirm that some users experienced greater problems than others. In these cases, two

main reasons have been identified as the possible source of the recognition errors. On the one side,

we observed that the loudness of the recorded voice for some users was very low, which was partially

due in some cases to the inadequate use of the device’s microphone, for instance, not facing the micro-

phone directly or blocking it accidentally with a finger. On the other side, we discovered an unintentional

misuse of the recording interface. In fact, the mobile version of the recording interface is also based on

a Push-To-Talk (PTT) strategy, but in contrast to the browser version where users have to push to start

recording and push again to stop it, users have to maintain a button pushed while recording. Instead,

it appeared that some users inadvertently had the tendency to stop the recording while they were still

uttering the last syllables of their answer.

The remaining two questions of the this late group, that is, if the mobile version provides a more

comfortable user experience and if the mobile version is preferred over the online browser one, obtained

respectively an average score of 4,2 and 4,4. Particularly, the 87% of the evaluated users strongly

preferred the mobile device version.

29

Figure 3.4: Distribution of the user grades for the questions of the third group.

The user experience evaluation provided interesting feedback and results. Users are more comfortable

with the mobile version and agree with the fact that it would be easy to learn how to use this new

application. The touch-screen capabilities of these devices actually provide a different perception of

the application itself. Thus, the encouraging results obtained here are an incentive for providing a more

complete version of the application that may exploit the different input modalities that these devices offer,

to provide a more interactive and complete experience.

With few exceptions, the speech recognition process shown performance comparable to the online

system. This is also a motivating achievement, in fact one of the possible limitations of this version could

have been the poor quality of the microphone installed on this device. A poor microphone may have

caused poor speech signal and thus, weak recognition performance. Results, overall, did not seem to

confirm this expectation, however, direct user experience also highlighted other possible limitations that

should be addressed to strengthen the recognition results.

30

4An important concern that has been kept into consideration along the whole development of the project

and guided the design of new interfaces and functionalities, is related with the usability of the client

module of the system. Over the years, particular care has been continuously given to the choice and

disposition of the graphical user interface (GUI), in order to comply with an easy to use and understand

layout, such that user interaction could be predictable and unmistakable. Driven by the principle of

accessibility, the characteristics and the needs of the intended users of the application have been indi-

viduated. Two major requirements have emerged from this analysis, the first is related with the fact that,

although aphasia is increasing in the youngest age groups, it is a predominant disorder among elderly

people. This age group is prone to suffer from visual impairments, thus, the graphical elements used

within the client interface were carefully selected, considering only big icons and intuitive images. The

second requirement is somehow related with the more common cause of aphasia. In fact, as mentioned

in Section 2.1, a CVA is considered one of the main sources for aphasia, and thus, it is expected that

those patients may have some forms of physical disabilities such as reduced arm mobility, and therefore

may experience problems using a mouse.

In situations where arm mobility is affected, the support for a hands-free interface will possibly im-

prove the overall usability of the user experience. However, the typical extension of such interfaces for

human-computer interaction consists of voice commands as alternative input modality. In the particular

case of the VITHEA project, since the user of the system is affected from a language disorder, hands-

free computing could not be interpreted as an alternative way of interaction, instead it will be selectively

applied to automate the process of recording the users answers, and thus, provide additional benefits to

people experiencing disabilities.

In fact, the client interface, that allows the recording of user answers, is currently based on a Push-To-

Talk (PTT) strategy, and requires at least two distinct interactions from the patient: the action of starting

the recording and the action of stopping it. In this specified context, the technique that will be exploited

will rely on voice activity detection (VAD) to automatically determine the end of the speech. There are

numerous approaches to address this task that vary also on the basis of the underneath technology

used.

In the following, Section 4.1 starts by describing the VAD task, considering the implications it may

arise and their possible solutions from the perspective of the VITHEA system. Then, it continues detailing

the approach that has been proposed in this work and its resulting architecture. A speech corpus derived

from patient’s daily recording has been created to perform an automated evaluation; the corpus and the

results of the assessment are described in Section 4.2, while Section 4.3 discusses the main conclusions

resulting from this work.

At a very general level, VAD is the binary classification process that tags each speech segment as

containing voice or silence. In practice, to successfully achieve the classification task, VAD approaches

usually implement sophisticated techniques of signal processing to improve the quality and the robust-

ness of the algorithm itself. In the literature review, the variety of different solutions that may be used

to address this task have already been highlighted. The choice of a solution is typically guided by

architectural, implementation constraints or available technology.

For what concerns the VITHEA system, as mentioned above, in the current version recording an

utterance requires two interactions: starting and stopping the recording process. Determining when the

recording process should start could be efficiently detected automatically, by considering as reference

the end of the prompt stimulus spoken by the virtual therapist or the end of subsequent reproduction of

the audio/video file in the case of a multimedia stimulus.

The detection of the end of the speech is a more challenging issue. There are two viable options,

one which will exploit the Adobe R©Flash R© technology in order to observe the energy of the speech

signal acquired through the microphone, and on this basis, detect the end of the speech. This will result

in a more affordable challenge from the technological point of view, but as counterpart, the precision

achieved may not be highly reliable.

On the other hand, the VAD task could be performed by the speech recognition module [Meinedo 03,

Meinedo 10]. This will allow a more refined analysis and better performance, however, it will also require

sending a continuous stream of data to the server. Thus, the main disadvantage of this approach is that

the recorded audio should be transmitted and processed by the speech recognizer before determining

the end of speech, which in the case of network congestion may lead to a non-deterministic behaviour.

Moreover, the need for a continuous stream of audio and real-time processing in the server-side will

represent increased technological challenges.

After carefully evaluating benefits and drawbacks of both the approaches, we decided to initially ex-

periment a simple and light approach deployed in the client side, partially motivated by the particular kind

of speech input that the algorithm is expected to face. We expect that the recordings object of the VAD

analysis will adhere to a well defined format. In fact, the VITHEA system only supports naming exer-

cises that, in the most general case, comprise as possible answer a single word or short sentence. For

these reasons, a simple approach should theoretically perform adequately. On the other hand, the

technological constraints imposed by the server-side solution and the uncertainty about its behaviour in

the application scenario discourage its use. Anyway, as we will see in Section 4.1.2, the implemented

architecture will easily allow a future extension that exploits server-side VAD.

32

4.1.1 Algorithm

In the most basic VAD approach, the signal is first sliced into contiguous frames, then a real value

parameter is associated with each frame. If this parameter exceeds a certain threshold, the frame is

classified as containing speech, otherwise it is classified as containing silence. Here we will follow the

same methodology, however the algorithm has been adapted in order to keep into account specific

application logic constraints.

The measure used to establish if a frame is possibly containing speech is the energy of that frame. If

the length of a frame is of k samples, and being x(i) the ith sample, then energy for a given frame j of

the input signal is computed as:

Ej =1

k

k∑i=0

x2(i) (4.1)

The energy of a frame is a valid parameter for VAD algorithms, however it is not able of distinguish

between loud noise or speech.

An adaptive approach has been used to estimate the value for the threshold. Initially, the VAD algo-

rithm is trained on a sample of speech that is assumed to not contain any sound, the duration of the

sample is 3 seconds. In the application scenario, this is achieved with the introduction of a new module

that is demanded to capture the required amount of sound, compute the threshold and store the value

into the user profile. It is a user requirement the respect of silence for the time required. The module is

activated when there is no previous value for the threshold in the user profile, or on demand, whether for

instance the recording’s condition were changed.

The threshold is computed over all the input sound, by taking the mean of the energies of each

frame, as in:

Eth =1

v

v∑m=0

Em (4.2)

where Eth is the initial threshold estimate and v is the number of frames. However, since background

disturbance is non-stationary, an adaptive threshold is more appropriate. Thus, on the assumption that

the user will not start speaking exactly at the time the recording starts, the first second of each recording

is used to compute another threshold that updates the previous value. Given Et1 the value computed for

the first second of recording, the rule to update the threshold value is:

Es = αEt1 + (1− α)Es−1, s > 0 (4.3)

where E0 = Eth,

α is a smoothing factor, 0 < α < 1,

and s indicates the number of stimuli the user is performing. In this way, on the one side the initial value

of the threshold, Eth, is constantly updated in case of varying conditions, on the other side this update

is smoothed to avoid sudden changes due to, for instance, presence of voice in the first second.

The classification rule for a frame given its energy Ej is then guided by,

SNS = Ej − Es(1 + δ), 0 < δ < 1 (4.4)

33

Thus,

IF SNS > 0

Frame is voice

ELSE

Frame is silence

In the application scenario, VAD computation should not start until voice is detected, and should end

after 3 seconds of silence. The first constraint is achieved verifying that a minimum number of frames

have been classified as voice. When this is verified, we could define the status of the VAD algorithm as

active. The second, to be satisfied, should verify that the VAD status is active and that, since the last

voice frame, a minimum number of frames have been classified as silence. When this is achieved, the

end of the speech is detected.

4.1.2 Architecture

The proposed solution relies on the client side computation of the energy of the speech signal that

is acquired through the microphone. To this purpose, the VAD algorithm is implemented exploiting

Adobe R©Macromedia R©API. The Microphone class from Macromedia R©reference provides a quite re-

fined control over the acquired signal. It allows to specify several interesting properties that may affect

the quality of the input audio. Among these, we mention:

• codec: the codec to use for compressing audio. Two available codec exist: Nellymoser1 and

Speex2;

• enableVAD: enables built-in voice activity detection, available only for Speex codec;

• gain: the amount by which the microphone amplifies the signal;

• noiseSuppressionLevel : the maximum attenuation of the noise expressed in dB, available only for

Speex codec;

• rate: the rate at which the microphone captures sound, expressed in kHz. Acceptable frequencies

are: 5,512 Hz, 8,000 Hz, 11,025 Hz, 22,050 Hz, 44,100 Hz;

• silenceLevel : the amount of sound required to activate the microphone and dispatch the sampling

event. When this value is greater than zero, the sampling event will never be dispatched until the

amount of sound captured by the microphone is greater than the silence level;

• silenceTimeout : the number of milliseconds between the time the microphone stops detecting

sound and the time the sampling event is dispatched. When this value is greater than zero, sound

input will continue being captured for the period specified in silenceTimeout, starting from the time

the microphone does not detect any sound;

1http://www.nellymoser.com/2http://www.speex.org/

34

When capturing speech input, the Microphone class dispatches an event each time new samples are

available. This is regulated by the rate parameter of the Microphone class.

Preliminary experiments performed while studying the available API, have exploited the use of Speex

codec to enable the internal VAD algorithm provided by the API. However, results have shown a very

poor quality of the recorded speech, with frequent truncation of important speech segments, thus tests

in this direction were not continued further.

To enhance security, Adobe R©Flash R©establishes clear requirements and restrictions in terms of user-

initiated action (UIA). These consist of either keyboard or mouse events. Particularly, an important con-

straint that would prevent the implementation of the hands-free interface is related with the HTTP POST

operation. Security restrictions require that performing the equivalent of a file upload to a target server

can only succeed as the result of a user-initiated action. This is to avoid that a Flash application running

into a browser may silently post data to the server hosting the application, without the user explicitly

agreement to that action. In the current architecture, the recorded speech is stored in memory and sent

to the server side with a POST operation, when the recording process is stopped by the user. Thus, a

different strategy has to be followed.

WebSocket3 is a recently standardized protocol, that enables two-way communication between a

client and a server over a TCP connection. HTTP is a stateless, request-response protocol adhering at

client-server model paradigm. The typical scenario is a web browser that acts as a client, submitting a

request to a server which provides the requested resources, responds to the client and close the con-

nection. In this scenario, the server can only respond on client requests. WebSockets, instead, are a

bi-directional, full-duplex, persistent connection from a web browser to a server. WebSockets change

the web programming model from user driven to event driven. By allowing a persistent connection es-

tablished between a client and a server, either the client or the server can send a message at any given

time to the other.

For our purposes, WebSockets have been exploited to overcome the limitations that the Flash Se-

curity model has imposed. When the VAD algorithm determines the end of the speech a WebSocket

connection is open from the Flash Player and the recorder file is sent to the server through the Web-

Socket channel. When the file is successfully received the WebSocket replies to the Flash Player, which

can now send the request for validating the answer to the VITHEA application. This architecture is illus-

trated in Figure 4.1. This implementation also creates a valid baseline for a future version exploiting the

VAD performed by the in-house speech recognizer.

The assessment of the performance of the VAD algorithm has been evaluated through offline tests. To

this purpose, a speech corpus has been derived from daily patient’s recordings stored into the sys-

tem. The evaluation process has been performed in the Matlab environment, simulating the same con-

ditions of the Flash Player environment.

3http://tools.ietf.org/html/rfc6455

35

WebSocketws://vithea.org

Vithea applicationhttp://vithea.org

Vithea.org

WebSocket.onopenWebSocket.send

WebSocket.onmessage

Vithea,org/evalAnswer

Result of evaluation

Figure 4.1: Architectural implementation of the VAD algorithm.

4.2.1 Speech corpus

A development corpus consisting of isolated recordings of nomination tests from real users of VITHEA

has been defined to evaluate the VAD algorithm.

First, recordings corresponding to the data stored in the platform have been automatically filtered

out according to two criteria: select only those recordings that have led to a right answer and those

whose size was larger than a given minimum length. This was done to guarantee that the chosen data

actually contains an answer and it is not just the result of an erroneous or mistaken interaction. Then,

we identified potentially representative data belonging to therapy sessions of three different speech ther-

apists with their patients and five additional individual patients that performed rehabilitation therapy on

their own. Recordings belonging to speech therapists are recorded in the rehabilitation centres and in-

clude data from many different patients. These sessions are accompanied by the therapist, who helps

and stimulates the patient, thus quite often these recordings intermingle clinicians’ speech with patients’

answers. From these recordings, due to time constraints and the need for manually auditing and an-

notating the data, we finally selected the recordings from one speech therapist and one independent

patient that were representative of the characteristics of typical VITHEA interactions. Particularly, these

do not contemplate overlapped speech, thus when the clinician’s speech was present in the audio seg-

ment, it has been discarded. The final selected set consists of 63 recordings. Data from the speech

36

Comprehensive file of silence

Random segment (3.5 ~ 5.5 sec)

r

Patient's answer

+ +

Random segment (5 ~ 6.5 sec)

Figure 4.2: Process of generation of the speech corpus.

therapist apparently belonged to 5 different patients.

These selected recordings have been then used to build an ad hoc dataset that met the standard

working conditions of the designed algorithm.

First, selected exercise recordings have been cut to exactly match the start and the end of each

patients’ speech interaction. For the recordings in which there are long pauses, we have considered

them as a single/continuous speech interaction only if the silence gap is smaller than three seconds.

Then, the remaining segments from all the audio recordings that did not contain speech neither from

the therapist, nor from the patient have been clustered into a unique “background” file. Any noise or

disfluencies and hesitations that appeared in the source recordings, have been also included in this

“background” noise file. The final test segments used in the evaluation are artificially synthesised based

on the concatenation of a random segment of silence of variable length extracted from the “background”

file, followed by the speech segment containing the actual answer, and further concatenated with a final

random segment of variable length of background noise. The process of construction of the evaluation

set is visually shown in Figure 4.2.

In this way, we implicitly obtain the reference for the boundaries of the speech/non-speech regions

and we guarantee to have data similar to the expected working conditions of the online version of the

algorithm. Since this is a random selection from non pure silence, the selected segments may well

contain disturbance effects such as blowing noise into the microphone or cough, exactly as in real

conditions. By adding a segment of background noise at the end of the answer we are able to test the

algorithm for the right detection of the end of the speech.

37

4.2.2 Results

The VAD algorithm described in the previous Section is aimed at detecting both the start and the end

of speech. The former is needed to adhere the application constraint of waiting for a segment of voice.

That is, the system has to let the patient the time he/she may need for answering before ending the

speech recording and for that purpose the system needs to detect first the start of speech. However,

the accurate detection of the two boundary informations is not equally important for the purposes of

our application. In fact, regarding the start of speech detection, the exact location of this boundary is

not relevant for our application purposes, since the system starts to record automatically. A failure in

the algorithm would be caused at this regard only if the start of speech is not detected or if it is very

prematurely detected causing an error in the end of speech detection. In other words, the most important

boundary to be detected is the end of speech. However, not all the errors in the detection of the end of

speech are equally important: a premature detection of this boundary will have a more dramatic impact

in the word naming performance that a delay on its detection. For these reasons, two different metrics

have been used to evaluate the VAD algorithm.

For the start of the speech, we consider as correctly detected, those results for which the absolute

difference between the automatic detected hypothesis and the reference is smaller than a given time

threshold. Thus, we define the correct detection rate for the start of the speech (DRsos) as the number

of correctly identified results, divided the total number of segments:

DRsos =

N∑i=1

H∗[|diff(i)| −max dist]

N(4.5)

where N is the total number of test segments,

diff(i) = (ref(i)− hyp(i)) is the real valued difference between the start of the speech provided by the

reference ref(i) and the one hypothesised by the VAD algorithm hyp(i),

H∗[·] = 1−H[·] is the function so that H[·] is the unit step function,

and max dist is the maximum error distance tolerated for considering a correct detection. During the

evaluation this value has been chosen equal to 0.2 seconds.

A different metric has been adopted for evaluating the detection of the end of the speech. Based on

the previous observations, we consider the case of a premature identification of the end of the speech

differently from the case of a delayed detection. In fact, the first case would mean that the recorded file

has been truncated and, thus, it is incomplete. A delay in the identification of the final boundary, on the

other side, will only impact on network bandwidth and would not affect the recognition process. Thus, the

detection rate for the end of the speech (DReos) considers two different values for the maximum error

distance allowed for the case of early and late detection of the end of speech:

DReos =

N∑i=1

H∗[|diff(i)| −max dist(i)]

N(4.6)

where max dist(i) = max dist early if diff(i) > 0 and max dist(i) = max dist late otherwise. In

38

this evaluation we set max dist early = 0.05 and max dist late = 0.2.

Besides these metrics, also the mean error for the identification of the start and the end of the speech

have been computed (Esos, Eeos). These are defined, over all the elements of the test set, as the mean

of the absolute differences between the automatic detection and the reference.

There are several parameters that may influence the performance of the algorithm. In order to deter-

mine the best configuration, a non-exhaustive search has been performed for four of the most important

variables. These include, the threshold δ, specified in eq. 4.4, which determines the classification rule for

speech or non-speech, the minimum amount of speech required to start the VAD computation min sp,

the length of the window used to compute the frame energy w len, and the amount of silence required

to determine the end of the speech max nonsp.

Table 4.1 shows the baseline configuration that resulted from the non-exhaustive search, while ta-

ble 4.2 shows the error rates and the detection rates that have been achieved with these parameters.

Parameter Baseline valueδ 0.5min sp 0.8max nonsp 3w len 12

Table 4.1: Baseline configuration established through exhaustive search.

Metric ScoreEsos 0.28Eeos 0.60DRsos 0.87DReos 0.75

Table 4.2: Results obtained on the development test set with the baseline configuration.

Using this optimal configuration as a starting point, we examined the performance of the algorithm

by varying the values for one parameter, while keeping fixed the others. Particularly, we present results

for the δ parameter and for the length of the window w len. We note that smaller values for the threshold

δ, cause an higher error rate since background noise is detected as speech. Higher values for the

threshold, on the other side, cause an higher error rate, since low speech segment are missed.

By varying the window size w len a different phenomenon arises, which however conducts to ana-

logues results. In fact, this parameter defines the number of frames whose average will be evaluated

for the classification stage, thus a smaller window is too sensitive to variations in the background noise,

which are detected as segments of voice. For the opposite reasons, a bigger window causes an earlier

detection of the start of speech and, of course, a delay in the detection of the end of the speech.

In order to finally assess the performance of the proposed method, a new evaluation corpus has

been built, following the same generation process used to build the development corpus. To this end,

three random simulations were created with the same data used to build the development set, resulting

in a total amount of 213 new segments. For each one of them, random segments of variable length of

silence have been added at the beginning and at the end of the speech segment containing the right

answer, thus producing different segments to those used for development. Table 4.3 provides the results

39

for this new simulated evaluation set in terms of error and correct detection rates using the previously

found optimal parameters.

Metric Achieved valueEsos 0.27Eeos 0.56DRsos 0.85DReos 0.69

Table 4.3: Results obtained on the evaluation set with the baseline configuration.

As data from table 4.3 shows, the developed algorithm obtained reasonably good results. Particu-

larly, the detection rate for the start of the speech is around 0.85%, which is a quite promising achieve-

ment. On the other side, we saw that results obtained for the detection of the end of the speech worsen

slightly. An analysis of the detection errors in this case, shows that the general trend in the causes of

error is, most of the times, due to a delay in the detection of the end of the speech, which is in fact a

good news since this type of error is not expected to affect speech recognition results.

In this work, a custom hands-free interface has been specifically designed for providing a more comfort-

able experience to some aphasia patients. As already mentioned, it is not so uncommon that patients

suffering from a language disorder may also suffer from a temporary physical disability which reduces

their mobility skills. In all these cases, a hands-free interface may facilitate the performing of the therapy

sessions. The interface has been designed considering the context of the project for which it will be

used. User requirement specification have identified, in fact, that the patient may need some time to

reflect before actually answering to the presented stimuli. These concerns guided the design and the

implementation of the algorithm for the voice activity detection task. Automated evaluation has confirmed

the feasibility of the proposed solution.

40

5The clinician module of the VITHEA system is an important component of the whole platform. As ex-

plained in Section 2.3.1, it is aimed to regulate data user profile, accesses and permissions of both

patients and speech therapists, and to allow the management of the stimuli and resources that con-

stitute the therapeutic exercises. During the development of the project, the relevance of this module

has increased at the same rate with the spread of the platform itself, giving rise to new requirements

encompassing privacy and access policies, and improved management of its contents.

Within the business logic of the system, exercises are classified into three different categories, audio,

visual and text; in compliance with the content of the stimuli they contain. Currently, each category con-

tains, respectively, 308, 742, and 352 stimuli, thus constituting a total of 1402 different stimuli. Video and

audio categories provide a multi-modal presentation of the stimuli through the association of multimedia

resources which easily illustrate their content. Thus, besides stimuli data, the system currently handles

885 multimedia files, shared among the exercises, including images, video and audio.

In the early stage of development of the VITHEA project, this amount of data was not even imagined

and hence the system lacked the provision of an appropriate search functionality, resulting in a really

inefficient data management task. The logical data structure existing behind the stimuli and multimedia

files concepts, suggests and encourages the exploitation of techniques from information retrieval to

return an improved search experience. In this context, an extended search will rely on the support of

additional metadata, ad-hoc extracted from existing data, in order to retrieve the most relevant dataset

of information with respect to the search performed.

In the following, in Section 5.1 the concepts and the techniques used to implement an improved

search experience are introduced and explained. Then, in Section 5.2 the measures of precision and

recall are computed for a well-defined set of test cases. To conclude, a final discussion is reported in

Section 5.3.

In information retrieval, full-text search refers to techniques for searching a match of the query terms

throughout the content of each stored document in a collection. Full-text searching is the type performed

by most web search engines, but it can be extremely helpful also for internal, single site searching. It

involves the operations of storing and indexing a collection of text documents to optimize speed and per-

formance in finding relevant information. Full-text search is a well distinguished concept from searches

based on metadata or on parts of the original texts, such as title or abstracts, typically stored in a

database.

Query expansion is the process of reformulating a query to improve retrieval performance. It involves

evaluating the search terms and expanding them with additional, related information to match additional

results. Query expansion involves techniques such as finding synonyms of a key term and searching

for the synonyms as well or finding the various morphological forms of a key term by stemming and

including them into the final results.

In this work, a hybrid approach was investigated: additional metadata has been generated through

the support of semantic resources and exploited through the query expansion of the search terms. Then

a full-text search engine indexes and manages this data to improve matched results.

5.1.1 Methodology

In the reminder of this Section the type of data that has been object of this study is first introduced,

then a description of the methodology used to achieve the extended search is described, including the

process of metadata and indexes generation.

5.1.1.1 Data description

The VITHEA system stores information about users, such as personal, historical and clinical data, and

about exercises and resources associated to the various stimuli that form an exercise. The last are the

target of the search feature and will be further detailed here. An exercise belongs to a category, which

reflects the type of the stimuli contained in the exercise, and could be in the visual, auditive, or textual

domain. An exercise is composed of several questions, each one is described by a short text, different

possible right answers, and a multimedia resource. The system accepts several formats of images,

video, and audio files. A physical resource is mapped in the application to the concept of document,

this is composed of a title, representative of its content, a link to its location in the file system and other

technical information. In some cases, exercises and images belong to a particular theme (“Animals”,

“Food”). Figure 5.1 illustrates the details of the data associated to each of these concepts and the

relations existing among them. Here the relation existing between documents and themes is not shown

as it is not influenced by database constraints.

The short description of a question is quite repetitive and does not contain relevant semantic informa-

tion (i.e.: “Say the opposite of”), therefore only information associated with the right answers has been

used for improving the query related to an exercise. As concerning documents, the title and the theme,

when applicable, are the most discriminant data and therefore have been chosen for further analysis. As

Figure 5.1 shows, additional metadata information have been stored in the column synset and part-of

of the tables Question and Document.

5.1.1.2 Metadata generation

In order to enhance the retrieval process with techniques such as query expansion and the cre-

ation of extended indexes, additional metadata information has been generated. To this purpose, two

thesaurus-based, lexical resources have been used: MWN.PT1 (MultiWordnet of Portuguese) and PA-1http://mwnpt.di.fc.ul.pt/index.html

42

Figure 5.1: Structure of the objects of the VITHEA system that are of interest for the search functionality.

PEL2 (Palavras Associadas Porto Editora – Linguateca). With their support, the relations of synonymy,

hypernymy, and part-of have been extracted.

Two different ontologies were necessary in order to compensate the high rate of terms that were not

found. In fact, preliminary tests on a reduced set of data have shown poor results in terms of coverage,

resulting around 45% the one provided by MWN.PT for the answers of a stimulus. This gap was partially

filled by PAPEL, leading to an overall coverage of almost 76%. Table 5.1 reports final statistical coverage

data of the full load database. For each column, the first line specifies the total number of items existing

into the database for that field, while the second line reports for how many items of that fields synonyms

or hypernymy or part-of were found.

Question answers Document title Document themeN. of items 1402 885 885N. of items matched 1114 628 609Coverage 79.46% 70.96% 68.81%

Table 5.1: Coverage of the additional metadata generated.

Two different strategies have been followed for questions and documents. For the possible answers

of a question, all the three relations have been extracted as additional metadata. In fact, answers may

already contain the most common synonyms or their opposite, thus, considering only the synonyms may

result in no additional information.2http://www.linguateca.pt/PAPEL

43

For the title of the document, only the synonyms have been considered, since in most cases the

title is represented by a single -composed or not- word and thus, synonyms should suffice. Also, unlike

the questions, most of the documents have associated a semantic category. The hypernym and part-of

relations have been extracted in order to extend the domain of the search, in such a way to consider

the superclass of the data under consideration and consequently descend again through the hierarchy

for including all the subclasses. This additional information has been used to build a virtual document

indexed with the domain of the category and composed of the extended hierarchy of the information

that belongs to this domain. The relations of synonymy, also hypernymy and part-of for the answers of a

question, were considered for performing the query expansion.

5.1.1.3 Indexes generation and management

Apache LuceneTM is a search engine that efficiently supports full-text indexing, ranking and searching

features. The most typical usage of this engine is indexing large amount of textual documents and thus

providing improved searching experiences. However, the flexibility of its architecture, based on the idea

of a document containing fields of text, allows many different data formats to be indexed as long as

textual information can be extracted. Thus, Lucene has been integrated into the VITHEA system, to

exploit the full-text search functionalities. In this case, given the peculiarity of the information contained

within the system, indexes have been generated with the support of the additional data extracted in the

previous phase. Three indexes have been created:

• on the answers of the questions exploiting, besides the original answer, its synonyms, hypernym,

and part-of relation;

• on the title of a document, considering the synonym relation;

• on the category of a document, exploiting both the part-of relation to build a semantic hierarchy

and the information on the title to refine the search.

The fields that compose the indexes have been provided with a weight, which assigns a higher value

to the original data and a smaller value to the generated ones. Lucene offers several ways of achieving

this task. At indexing time, it is possible to specify that certain fields are more important than others,

by assigning a field boost. At search time, it is possible to specify a boost to each query, sub-query,

and each query term. Also, another way of affecting scoring characteristics is to change the similarity

factors. The first two approaches have been followed.

Each time a new document (or question) is inserted, updated or deleted, indexes should be rebuilt,

such that the system would be able to respond coherently to user requests. In fact, if the index is not

updated, the recently modified or inserted object will never be found. However, this process implies a

cost in terms of responsiveness of the system that could not be addressed during normal activity of

the application. This, is not due to the indexes regeneration process, which is indeed quite performing,

but from the generation of the additional metadata. It has been estimated that searching a synonym

in the WNT.PT requires at least two seconds, which is clearly unacceptable. For these reasons, the

44

management of the generated indexes has been scheduled nightly, when the load of the system is

supposed to be lower.

The strategy described in the previous Section has been implemented into the VITHEA system and,

thus, is available through a web interface which allows both to exploit search features and visualization

of the results. This interface has been used to query the system and evaluate the returned results in

terms of precision and recall. Recall is the ratio of the relevant results returned divided by the set of

all the results that are relevant for the query, is a measure of the quantity of relevant results returned

by a search. Precision is the number of relevant results returned divided by the total number of results

returned, is a measure of the quality of the results returned. Free-text searching is likely to retrieve many

documents that are not relevant to the intended search question. Also query expansion is likely to suffer

from the same problem. In fact, by expanding a search query to search for the synonyms of the user

entered term, more documents are matched as the alternate word are matched as well, increasing the

total recall. This comes at the expense of a reduced precision of the result returned.

However, in this particular context, where the amount of available data is limited to a well defined

domain, we consider that the reduced precision is a tolerable price that can be addressed and even

compensated by the value added from the extended retrieval. Here, these measures have been com-

puted for each of the indexes generated, namely on the question’s answers, on the document title, and

on the document category. Table 5.2 reports, for each, average values for precision and recall computed

over a range of ten queries.

Question answers Document title Document themePrecision 0.90 0.93 0.80Recall 0.99 0.95 0.65

Table 5.2: Precision and recall for each of the indexes generated.

Results confirmed the expectations, in fact both for the answers of a question and for the title of a

document, recall has improved at the expense of precision.

In most of the queries, the system either correctly provided an extended result set with relevant

results, or returned the direct match of the query - in the same way of a relational database - when

no relevant data have been found for that query. Figure 5.2 shows an example where the results set

provided for the key term “seco” (dry) on the field answers, has been extended with the key synonyms.

Another more interesting example is represented in Figure 5.3, where the results for the key term

“alimento” (food) are shown. Here the returned item set is not closely related with the intended meaning

of the search term, yet, the result provided is relevant in this context.

In fact, without the extended search capability this query would not have returned any result, since

the searched term does not even exist into the database as a possible answer. For some queries the

system also provided results that were considered totally out-of-domain, such in the case of the search

term “gato” (cat), which included among the returned items also the term “leve” (light).

45

Figure 5.2: Results provided for the search query “seco” (dry) on the field answer of a Question.

Figure 5.3: Results provided for the search query “alimento” (food) on the field answer of a Question.

Table 5.3 reports, for each of the queries used to compute average and recall on the second index,

the number of results returned by the system in the case of using the extended search capability and in

the case of using a standard search.

Concerning the results achieved for the third index, we note an inverse tendency for the values of

precision and recall. These results, however, become clearer after a closer inspection of the generated

metadata. Sometimes, the subset of items belonging to a given category is rather specific and generated

many missed metadata. Besides, the extensiveness of the strategy for the metadata generation for this

field, together with the limited coverage of the lexical resources used, have lead to similar metadata for

the reduced number of fields for which metadata were found. In this way, two typical scenarios verify.

In the first case, data that were considered relevant to the query is not retrieved, only the exact match

of the key term is found, thus reducing the recall. In the second case, since some data share the same

metadata without being related, irrelevant results are introduced. Figure 5.4 illustrates search results for

46

Extended search Standard searchManjar 4 0Bruxa 4 0Regar 5 1Travessia 1 0Carimbo 1 0Ligar 5 1Abrir 4 1Batatas 2 1Meloeiro 2 0Caiota 1 0Pincel 4 1

Table 5.3: Number of results returned by the system using the extended search feature and using astandard search functionality.

the query term “harpia” (harpy) as title within the “animais” (animals) category. It is worth highlighting

that, since there is no document corresponding to the title ”harpia” in the database, with standard search

modality this query would not have returned any result.

Searches performed with the Lucene search engine have shown interesting results in terms of response

time. A slight delay is noticed at the presentation layer, when returning the response containing search

results. This could be motivated by the internal logic of the application. In fact, once the relevant results

are retrieved with the Lucene search engine, these have to be integrated with additional information

extracted on-time from the database, and needed for the presentation layer.

Overall, the functionalities of full-text search have provided interesting results, allowing for the match-

Figure 5.4: Results provided for the search query <“harpia” (harpy), “animais” (animals)> on the fieldstitle and category of a document.

47

ing with the extended data within the semantic category chosen, or simply allowing, through the ex-

ploitation of synonymy relation, the retrieval of information that otherwise would not have been found.

However, the implemented approach also returns a considerable amount of false positive. This could be

partly justified by the choice of integrating such an extended set of relations, but also from errors that are

introduced by the resources themselves. Probably a more refined generation of these relations would

lead to a smaller amount of false positive.

In the future, it would be worthy to explore the integration of a stemming algorithm, such to consider

also the alternate word forms for a key term. Currently, searching for the term “peixe” (fish) will not

produce any results, which are instead returned when searching for “peixes” (fishes).

48

6In the literature review (Section 2.4.4), several naming tasks have been described. Among these, cat-

egory naming and automatic serial naming were identified as a desirable valuable addition that could

improve the user experience. Our interest is particularly focused on semantic category naming, a sub

class of category naming. In fact, even though automatic serial naming and semantic category naming

differ in their domain of application and therapeutic scope, from a technological point of view, both tasks

share the same structure. Both are based on an extended list of words as possible right answers. This

list will constitute the language model of the application, in contrast with the currently available confronta-

tion naming exercises that are based only on a very reduced set of words. In fact, in these exercises,

the language model is built in an ad-hoc fashion at the time of creating the stimulus on the basis of the

possible answers provided by the therapist. Moreover, while automatic serial naming addresses a rela-

tively closed domain, semantic category naming encompasses a more extended domain, and thus the

generation of the possible list of valid answers becomes a challenge. In the case of semantic category

naming this would imply, for instance, the explicit listing of all the known species of animals or the known

list of professions, which is clearly infeasible.

In this work, although the semantic category naming has been implemented and assessed for a

single well-defined domain, the methodology introduced here could be then extended to other domains

or to the serial naming task. The domain chosen is the animal world, since it is the most common

category used for this type of tests, which is therefore commonly referred as Animal Naming or Animal

Fluency task.

The next Section describes some characteristics of the animal naming task that motivated the choice

of methodology, together with the components of the speech recognizer that have been involved. Then,

Section 6.2 introduces the speech corpus collected to perform an automatic evaluation and describes

the iterative process adopted to refine and improve the final results. To conclude, Section 6.3 reports the

results for a manual evaluation of the recognition errors and discusses the main sources of errors.

Animal Naming is a semantic fluency task that consists of naming as many animals as possible within a

one-minute interval. The score of the task corresponds to the sum of all admissible words, where names

of extinct, imaginary or magic animals are considered admissible, while inflected forms and repetitions

are considered inadmissible.

The automation of this task raises several challenges, namely due to the disfluencies that are present

in spontaneous speech, here even more emphasized because of the mental commitment that the test

requires and the duration of the test itself.

It is expected that hesitations, filled pauses and repetitions will be common in the recorded

speech. For this reason, the same keyword spotting based approach adopted in [Abad 12] has been

followed, extending it to address the animal naming task.

6.1.1 Keyword spotting

Keyword spotting techniques aim at detecting a certain set of words of interest in the continuous audio

stream. Possible approaches have already been described in Section 2.2.3, highlighting the option in-

tegrated into the in-house speech recognition engine AUDIMUS. This method, based on the acoustic

matching of speech with keyword models in contrast to a background model, proved to be the most

appropriate approach for dealing with speech disfluencies [Abad 12], and thus, it has also been adopted

here.

In this extension, however, the list of keywords also plays a fundamental role, containing the names

of admissible animals that will be accepted by the speech recognition system. The size of this list may

have a significant impact on the outcome of the recognizer. In fact, if a keyword is missing in the list, it

will never be detected, on the other hand we also expect that a longer list will result in an increase of the

perplexity of the keyword model.

Preliminary experiments using an ad-hoc, reduced subset of the key terms, were performed in order

to assess the viability of the automatic naming task. Two different speech recognition engines were

explored: the one of the Hidden Markov Model Toolkit1 (HTK), freely available after registration, and

the in-house ASR engine AUDIMUS [Meinedo 03, Meinedo 10]. However, these preliminary results

revealed a considerable superiority of the AUDIMUS based system for this particular task. Consequently,

experiments with the HTK tools were not continued further.

6.1.2 Keyword model generation

To automatically build an adequate keyword model for the animal naming task, an existing lexical re-

source has been used as a baseline, consisting of an extensive list of animal names. This resource is

part of the project ”STRING: An Hybrid Statistical and Rule-Based Natural Language Processing Chain

for Portuguese“ [Mamede 12]. It contains 6044 animal names, grouped, classified and labelled with its

semantic category. Within the context of the STRING project, this resource is used by a finite-state incre-

mental parser to add semantic information to the output of a part-of-speech tagger. The list, therefore,

aims to be as complete as possible and its content is wide and detailed. It comprises very specific ani-

mal races, such as cobra coral sul-americana (south-american coral snake). Yet, the list contains some

animal names such as castanha, beta, corredor (fishes and bird names), which in Portuguese also have

another, more common, meaning. Finally, the list intentionally does not contain any inflected form, (i.e

no female or plural forms), only the lemma of a term is considered. The characteristics mentioned above

will affect somehow the generated keyword model and need to be specifically assessed. In fact, as we

1http://htk.eng.cam.ac.uk/

50

will see shortly, the peculiarity of the content of the list will have an impact when trying to establish prob-

ability values for the key terms. Also, since most of the words in Portuguese have a different form for

male and female, we expect that the lack of this information, as well as the lack of the plural form, will

introduce errors to recognition result. However, considering the current size of the list, it is not feasible

to add this information, at this stage, since it would mean, in the best case, a duplication of the list size.

In order to take into account that some names in this extended list will be much more likely to be said

than others, we tried to compute the likelihood of the different target terms, as it is commonly done in

n-gram based language modelling. For instance, the ARPA-MIT language model format allows to store

n-gram definition composed of a probability value stored as log10 followed by a sequence of n words

and a back-off weight. To this purpose, the total number of results provided by any web search engine

for a particular query can be a useful information, meaningful of the term’s popularity. However, the

homonymy presented by some terms may lead to an incorrect count, related to alternative meanings of

the term. Therefore, a more refined retrieval strategy has been implemented, which took into account

the semantic information associated to each key term. The search query, is therefore composed of the

bigram: <animal name> <category>, i. e. : beta peixe. In this work the Bing Search API2 has been

used to obtain the count data.

Thus, an alternative approach has been adopted to compute the final weights for each term. In

fact, the distribution of the generated results revealed a too large support, with values that drastically

decrease within few keywords. This led to weak term likelihood estimations, confirmed by the poor results

achieved with early experiments. The approach consisted of building an histogram of words, where the

original list is divided by C classes. The probability x assigned to each term is the same for all the terms

in the same class, while the probability assigned to each class decreases proportionally with the class

order. For the first class, given that it is the most popular one, its probability will be multiplied by C, while

the likelihood of the last class will be multiplied by 1. In this way, the support is regulated by C, which in

this work has been chosen equal to 10.

6.1.3 Background penalty for keyword spotting tuning

In order to balance the weight of the background speech competing model with respect to the keyword

models in the decoding process, the background scale term (β) has been exploited [Abad 12]. This

exponential term in the likelihood domain (multiplicative in the acoustic score/log-likelihood domain)

permits adjusting the word naming detection system to penalize the background speech model or to

favor it. In this way, it is possible to make the system more prone towards keyword detections (and

possibly false alarms) or towards keyword rejections (and possibly miss detections).

Once obtained an initial list that fulfils the desired requirements, several phases of experimental tests

were performed in order to determine the best compromise between the length of the list and its content.

2http://datamarket.azure.com/dataset/bing/search

51

Before describing these tests, however, we shall characterize the corpus that was specifically collected

for this purpose.

6.2.1 Speech corpus

The corpus includes recordings from 31 healthy adults (16 females and 15 males), native Por-

tuguese. The recordings took place in different conditions, with two different head-set microphones.

No particular constraint over background noise condition was imposed. Each of the sessions consisted

approximately of a one-minute recording, in which the speaker was invited to name the animals he/she

was able to remember within the available time. Data originally captured at 16 kHz was down-sampled

to 8 kHz to match the acoustic models sampling frequency. Orthographic transcriptions were manu-

ally produced for each session. The total duration of the corpus is approximately 32 minutes. The total

number of words produced by each subject is on average 28 words. However, if one considers only

the valid words, i.e. not counting the inflected forms nor the repetitions, the total average decreases to

27. Detailed data are shown in Table 6.1.

User Gender Tot. words Valid words1 f 23 232 m 21 183 m 24 244 f 33 325 m 26 266 f 23 227 m 21 218 m 30 309 f 35 2610 m 34 3311 m 35 3412 m 34 3413 f 19 1314 f 20 1715 m 34 3416 m 25 25avg 27.31 25.75

User Gender Tot. words Valid words17 f 33 3318 m 18 1819 f 35 3520 f 27 2721 f 22 2222 m 35 3523 f 27 2724 f 35 3525 m 30 3026 m 23 2227 f 24 2428 f 33 3329 f 33 3330 f 24 2431 m 36 36

avg 29 28.93

Table 6.1: Speech corpus data, including gender, total number of words and the total number of validwords uttered.

6.2.2 Results

The counts established through Bing Search API reflected well the peculiarity of the list, by collocating

in lower positions the most exotic names. In this way, it is easy to determine a threshold based on this

information, which filters out less probable keywords and thus reduces the size of the list. Different con-

ditions were experimented, here, in table 6.2, we report only the most significant, which were achieved

using the full list and two different thresholds, that resulted into two reduced terms lists. Figure 6.1

illustrates the Word Error Rate (WER) results obtained for each of these configurations.

We observed that the configuration with the shortest list caused an increase in the number of misses

and substitutions for some users. This was expectable since some of the key terms now are missing

52

Figure 6.1: First set of experiments using the keyword model with different values for the threshold.

in the list. On the other hand, we also note that for some users, whose speech recognition results

had shown a high number of insertions with the original list, have benefited from shortening it. In fact,

the number of insertions and substitutions decreased. Given this opposite trend, the average of the

automatic WER computed over all the corpus remains almost stable across different experiments. The

configuration with the middle-size list showed that the impact of missing keywords is not as great as with

the shortest list, but the number of insertions increased, as expected.

A detailed analysis of the recognition results with the shortest list showed that the many of the still

existing insertions are due to the existence in this list of a considerable amount of animal names (mostly

short names such as anu and udu) that are not so common in the daily life. Adjusting the background

penalty term in this case is not sufficient to absorb the insertions that are generated.

Hence, a different methodology needs to be applied in order to filter the elements of the

list that is not solely based on frequency counts, but rather on its content. After evaluating sev-

eral lexical and semantic resources we focused on Onto.PT3, an ontology for the Portuguese lan-

guage [Goncalo Oliveira 12]. Onto.PT is built automatically from other resources and therefore, is not

totally accurate. However, its coverage is reduced and may be useful for our purposes. In fact, since our

baseline list already provided us with the semantic category for each of our key terms, and since this

resource may introduce some errors, it has been used not to confirm the semantic class of an item as

one would expect, but to verify the spread of a term. If one keyword was missing in this ontology, it has

3http://ontopt.dei.uc.pt/

1 Set of experimentsConfig. 1 Config. 2 Config. 3

Threshold – 400 800Number of terms 6044 1447 943Average WER 21.19 21.47 21.36

Table 6.2: Experiments data and resulting average WER, including file size information.

53

been excluded from the list. The same experiments were performed with the 3 filtered lists, leading to a

reduction of the average error of up to 2.0% relative to the previous experiments. The average WER is

shown in Table 6.3, while Figure 6.2 illustrates the result of each user with the various configurations.

2nd Set of experimentsConfig. 1 Config. 2 Config. 3

Threshold – 400 800Number of terms 960 804 629Average WER 19.63 19.48 20.02

Table 6.3: Experiments data and resulting average WER, including file size information.

Figure 6.2: Second set of experiments using the keyword model filtered with Onto.PT and differentvalues for the threshold.

A closer analysis of the results revealed other important patterns in the recognition errors. Some

keywords, such as periquito (parakeet), mosquito (mosquito) or esquilo (squirrel) were typically poorly

identified. The first was used by seven different users, but was correctly recognized only once. The

second has been used nine different times, but was correctly recognized only three times. The third

was used by three different users, but never recognized correctly. Upon inspection of the rule-based

pronunciations that were used in the lexicon, we noticed an error in the rules. After correcting this error,

the three words were correctly recognized 100% of the time and the average WER decreased by 1.8%

with respect to the second configuration of previous experiments.

The final observation concerned the insertion errors caused by hesitations and filled pauses. By

modifying the generated keyword model to include also the various forms for Portuguese filled pauses,

we managed to decrease the number of insertions, reducing the average WER, by an additional 3.1%.

Table 6.4 resumes the above configurations, while Figure 6.3 illustrates the result of each user with

different configurations.

Overall, experiments have conducted to encouraging results, showing the feasibility of the animal

naming task, however it should be noted that the WER does not represent a valid estimation for the

animal naming score, since does not allow to discern between the various sources of error.

54

3rd Set of experimentsConfig. 1 Config. 2Phonetic updates Filled pauses recognition

Avg WER 17.72 14.66

Table 6.4: Experiments data and resulting average WER.

Figure 6.3: Third set of experiments, including phonetic transcription correction and filled pause models.

Twenty three subjects, out of thirty one, used keywords that were missing in the list. Without consid-

ering repetitions, as these are not allowed, the total number of keywords that were missing is thirty. Thir-

teen of the words used were simply lacking from the list, because of the filtering with Onto.PT or mostly

because they were missing from the initial baseline list. Four words were either out-of-vocabulary words

(espetada - skewered) or made-up words perdiniz. The remaining thirteen used words were missing

from the list, since they represented inflected forms.

For these reasons, a manual evaluation has been performed, where the errors due to the inflected

forms have been discounted. Table 6.5 reports the data of this experiment for every user, highlighting

those that were using the inflected forms. The average value of this customized WER is 11.64%. The

average WER computed discounting the four made-up words is 10.30%. In this work, repetitions have

not been discounted.

Also, the substitution in the reference file of the inflected words with their lemma has been ex-

perimented. Unfortunately, this did not provide a great improvement, the automatic WER decrease of

0.3%. However, this was expectable for various reasons. On the one side, Portuguese female forms may

be totally different from their male form, as in the cases cao, cadela (dog) or cavalo, egua (horse). The

same applies to the diminutive form, i.e.: pintainho, pinto (chick). On the other side, it happened that

for a key term both the lemma and the inflected form were missing from the list, as in the case aves,

ave. However, when the lemma was presented and this term was very similar to the corresponding in-

flected form, it was correctly recognized. Example of such cases are patos, pato (duck) and gato, gata

(cat).

55

Animal Naming ScoreUser WER WER -infl. WER -made-up1 4.35 4.35 4.352 38.09 27.78 23.533 12.50 12.50 12.504 18.18 15.63 9.095 7.69 3.85 3.856 17.39 13.64 13.047 19.05 14.29 14.298 0.00 0.00 0.009 54.29 38.46 22.8610 17.65 15.15 12.1211 11.43 8.82 8.8212 11.76 11.76 11.7613 36.84 7.69 7.6914 35.00 23.53 18.7515 17.65 14.71 14.7116 0.00 0.00 0.0017 6.06 6.06 6.0618 5.56 5.56 5.5619 5.71 5.71 5.7120 14.82 14.82 14.8221 0.00 0.00 0.0022 11.43 11.43 11.4323 3.70 3.70 3.7024 11.43 11.43 11.4325 16.67 16.67 16.6726 13.04 9.09 9.0927 12.50 12.50 12.5028 15.15 15.15 12.5029 6.06 6.06 6.0630 22.20 22.20 22.2031 8.33 8.33 4.35avg 14.66 11.64 10.30

Table 6.5: Automatic and manual WER with the configuration 2 of the last set of experiments.

The test represented, for some of the subjects that have participated, a source of stress. Even with

healthy subjects, the idea of uttering as many animals as possible within one minute creates a state of

anxiety. At the end of the test some people were frustrated for not having remembered more animals,

whose names when the test ended came up immediately. This caused the interesting phenomenon

where most of the animal names were quickly uttered in the first seconds of the tests, then the sub-

ject started to think of more animal names, typically intermingling speech disfluencies with silence or

keywords. Blowing noises were a causes of insertions. Another common source of error is the concate-

nation between words or between words and filled pauses. It happens that while thinking out loud some

subject introduced syllables such eeehm then, when they suddenly remembered a new word, this is

concatenated to the previous filled pause, as in the case eeeeepiriquito. Other concatenation errors

happen between two consecutive words, such in the cases rinocerontelefante, mosquitovaca.

56

7During a preliminary set of experiments performed within the context of the VITHEA project, an analysis

of the word detection errors showed, for some patients, a remarkable tendency to slow down the rhythm

of a word in coincidence with its syllables. Preliminary studies examined in the literature review (Section

2.4.5) have also shown that taking into account syllable boundaries may actually improve speech recog-

nition performance. These two reasons have motivated the investigation of an approach that considers

and integrates syllables division in the speech recognition process.

In the following, Section 7.1 introduces the syllabification task and explains how it has been imple-

mented within the in-house speech recognition engine AUDIMUS. Then, the results of an experimental

evaluation are described in Section 7.2, while a final discussion is reported in Section 7.3.

Syllabification is the process consisting of the identification and delineation of syllable boundaries in a

word. Contrary to what could be expected, syllabification is not a task of simple resolution and can be

addressed from different perspectives, namely by considering the orthographic or the phonetic form of

the word. Sometimes syllabification deals more with a concept of syllable that corresponds to the written

form, while in other situations it is correlated in some way with audibility. One can observe that, in a

speech recognition context, syllable boundaries dealing with phonetic parameters would be closer to

actual speech, and indeed these have been the subject of studies aiming at exploring complementary

acoustic models for speech processing. A syllable division that considers phonetic constraints could also

be suitable in the context of a speech recognition process where the lexicon generation is automatically

generated through a grapheme-to-phone module. However, even when approaching the problem from a

purely phonetic perspective, there is still no consensual solution for the syllabification problem. In fact,

there are several approaches to this task that differ in how sounds are grouped and thus, lead to new,

different syllable splitting rules.

7.1.1 Methodology

The development of a tool that would allow the automatic identification and division of syllable boundaries

is out of the scope of this work. Thus, research was directed toward open source solutions freely avail-

able. A software implementation of the syllabification task was kindly distributed from the department

of Electrical and Computer Engineering from the university of Coimbra. The software is a rule-based

approach based on the Maximal Onset Principle for European Portuguese [Candeias 08, Candeias 09].

Rules were driven with a lexicon of almost 400K words, syllabification was performed according with the

orthographic form.

To integrate the generated syllables into the version of AUDIMUS customized for the VITHEA sys-

tem, it is necessary to alter the lexicon used by the recognizer. In practice, for each keyword entry a

new alternative phonetic transcription is generated which consists of the same original phonetic string

with short pause units inserted in between the syllable boundaries. In this way, for each pronuncia-

tion provided by the automatic grapheme-to-phoneme module, an alternative “syllabified” version of the

canonical pronunciation is generated. Unfortunately, the matching of the syllable boundaries produced

for the orthographic transcription with the corresponding phonetic transcription need to be specifically

addressed.

In fact, depending on the stress and the duration imposed to a given phoneme, there may be different

ways of pronouncing the same word, that lead to different phonetic transcriptions. This is the case for

the Portuguese word pente (comb), whose phonetic transcriptions are: p e ∼ t @, and p e ∼ t. In the last

example, the last vowel is not included due to a phenomenon known as vowel reduction, an acoustic

variation of the pronunciation of a vowel that makes it shorter, sometimes almost inaudible. Neverthe-

less, the orthographic syllabification provided by the automated software for the same word is: pen.te.

Exceptions of these kinds has been handled accordingly to the phonetic rule, thus leading to p e ∼ . t

@, p e ∼ . t. This is in accordance with the results obtained by Candeias [Candeias 11], in a work fo-

cused on exploiting acoustic-phonetic constraints to derive new syllable prototypes. In fact, in canonical

Portuguese grammar, a consonant grapheme cannot constitute a syllable, but, if an acoustic-phonetic

constraint is applied, a syllable composed of one consonant in a final word position would be possible.

The assessment of the performance of the recognition process provided with this alternative syllabified

lexicon has been evaluated through automated tests. In particular, in order to measure the achieved

results in terms of overall improvements, the same set of experiments carried out during the VITHEA

project to evaluate the word naming task have also been replicated here [Abad 13]. The corpus used for

the evaluation is described in the following section.

7.2.1 Speech corpus

A corpus of 16 patients, native Portuguese speakers with different types of aphasia, has been collected

in two different therapy centres in two different sessions. The first phase was carried out in February and

March of 2011 and includes speech from 8 aphasia patients. The second data collection was carried out

during May and June of 2011 and includes speech from 8 different aphasia patients. According to the

author, these sets are referred as APS-I and APS-II, respectively. Recordings were performed during

regular speech-language therapy sessions. Each of the sessions consisted of naming exercises with

pictures of objects presented at intervals of at most 15 seconds. The objects and the presentation order

were the same for all patients. The pictures adopted in the nomination exercises were selected from a

standardized set of 260 of black-and-white line drawings that extend and adapt the corpus of Sndograss

58

Figure 7.1: Results for the APS-I comparing the two pronunciation lexicons, the standard and the aug-mented version provided with syllable boundaries.

and Vanderwart [Snodgrass 80].

7.2.2 Results

In the original experiments with this corpus [Abad 13], the automatic word naming recognition module

was evaluated using two different metrics, the word naming score (WNS) and the word verification

rate (WVR). These are computed for each speaker, the first corresponds to the number of positive word

detections divided by the total number of exercises, the latter corresponds to the number of coincidences

between the manual and automatic result divided by the total number of exercises. The word verification

rate (WVR) is a measure of the reliability of the automatic recognition and will be used in the following

to compare the results achieved with the alternative pronunciations with the ones mentioned above.

Surprisingly, the results obtained for the APS-I corpus have shown that the usage of the new aug-

mented pronunciations does not lead to any significant improvements in terms of overall speech recog-

nition performance. Instead, we note that for some patients the WVR worsens. On the other hand, the

results obtained for the APS-II corpus showed encouraging improvements in term of WVR. A detailed

analysis of the original audio transcription, confirmed for some patients the general trend to slow down

the rhythm of a word in correspondence with syllable boundaries. This phenomena was sometime asso-

ciated with the hesitations shown by some patients in uttering a word. Figures 7.1, and 7.2 show, for the

APS-I and APS-2 corpus respectively, the results achieved including syllable boundaries in the keyword

model in comparison with the standard transcription. Data were compared with the previous experiments

for a specific operating point of the system regulated by a parameter, the background penalty term β,

59

already introduced in Section 6.1.3. Here, the operating point chosen is the same of the VITHEA system

and is equal to 0.6. Average WVR achieved for both corpora with the different pronunciation models is

reported on table 7.1.

Average WVRAPS-I APS-II

Syllabified pronunciation 0,79 0.73Standard pronunciation 0,80 0,60

Table 7.1: Average WVR for the APS-I and APS-II corpus with different pronunciation models.

To consolidate the results described above a cross-validation experiment has been carried out. This

is performed in the same fashion as described in [Abad 13], the data from every speaker was randomly

split into two halves. The first half is used to search for the best β parameter on that data sub-set.

Then, the β penalty term is used to process the second half of the data and the WVR is computed on

this second sub-set. Here the experiments were performed with the standard pronunciation model and

with the augmented one, guaranteeing that the same random partition was used for both tests. Overall

results, showed in table 7.2 confirmed again, for some patients, small improvements in the recognition

performance when the syllabic version of the word is provided. However, also in this context the WVR

worsens for some patients. Thus, the average WVR computed for all the patients, shows a more stable

trend, with no meaningful variability.

Figure 7.2: Results for the APS-II comparing the two pronunciation models, the standard and the aug-mented version provided with syllable boundaries.

60

Patient WVR syllabified pron. WVR standard pron.1 0.85 0.852 0.79 0.783 0.84 0.834 0.85 0.855 0.72 0.726 0.92 0.917 0.70 0.718 0.93 0.93

avg 0.83 0.82.

Patient WVR syllabified pron. WVR standard pron.9 0.87 0.9510 0.71 0.7211 0.78 0.7512 0.93 0.9313 0.81 0.8114 0.78 0.8115 0.75 0.7816 0.82 0.88avg 0.81 0.83

Table 7.2: WVR for APS-I and APS-II data sets and average WVR, using automatically calibrated back-ground penalty term.

Results achieved in the last Section have demonstrated that, in some conditions, the introduction of

syllabic boundaries may improve the performance of the recognition results. Patients that present a

slower rhythm in their speaking style or patients that tend to hesitate, may benefit from the introduction of

this information within the pronunciation model. Also, it should be noted that in this work on orthographic

transcription has been manually adapted to a phonetic transcription. It could be worthy to explore the

future integration of syllable boundaries derived with a phonetic perspective.

61

8This last Chapter presents the final remarks of this thesis. The main achievements accomplished are

summarized in Section 8.1, while Section 8.2 concludes presenting some ideas for future work.

This work addressed the development of new features for the VITHEA system, a platform resulting from

the work conducted in the context of a three years national project aiming at the development of a virtual

therapist for the recovery from a language disorder named aphasia.

The project, in which the author was actively involved since the very beginning, is publicly available

since July 2011 and is currently distributed to almost 160 users among speech therapists and patients.

During the last phase of the project, several speech therapists from different institutions were asked

to use and evaluate the program. The assessment was performed through on-line questionnaires and

involved almost 30 speech therapy professionals. The results of this survey were remarkably good,

achieving an average score of 4.14 in a likert scale of 5 points (1 to 5). The average score obtained to

the question “Do you think VITHEA will help you at your work?” was 4.64.

Recently, the project has collected several awards from both the speech and the health-care com-

munities:

• November 2012: The VITHEA project has been presented at the ”VII Jornadas en Tecnologıa del

Habla and III Iberian SLTech Workshop” where received the second best demo award.

• June 2013: The VITHEA project participated in the seventeen edition of the conference Saude

CUF, focused on Mobile Health, where won the ”Call for Papers” concourse for the category ”Pro-

vision of services”.

The work that I have carried out in the context of the VITHEA project and of this thesis have led to

the following publications:

• July 2011: An on-line system for remote treatment of aphasia. Speech and Language Processing

for Assistive Technologies (SLPAT). Anna Pompili, Alberto Abad, Isabel Trancoso, Jose Fonseca,

Isabel P. Martins, Gabriela Leal and Luisa Farrajota.

• November 2011: Vithea. Sistema online para tratamento da afasia. Encontro dos Tecnicos de Di-

agnostico e Terapeutica (Poster presentation). Faculdade de Medicina de Lisboa. Jose Fonseca,

Alberto Abad, Gabriela Leal, Luisa Farrajota, Anna Pompili, Isabel Trancoso, Isabel P. Martins.

• February 2012: VITHEA: Sistema online para tratamento da nomeacao oral na afasia. 6o Con-

gresso Portugues do AVC da Sociedade Portuguesa de AVC (Poster presentation). Jose Fonseca,

Alberto Abad, Gabriela Leal, Luisa Farrajota, Anna Pompili, Isabel Trancoso, and Isabel P. Martins.

• April 2012: VITHEA: On-line therapy for aphasic patients exploiting automatic speech recognition.

International Conference on Computational Processing of the Portuguese Language (Propor 2012)

- Demo Session. Anna Pompili and Alberto Abad.

• September 2012: Automatic word naming recognition for treatment and assessment of aphasia.

13th Annual Conference of the International Speech Communication Association (InterSpeech

2012). Alberto Abad, Anna Pompili, Angela Costa, Isabel Trancoso.

• October 2012: Automatic word naming recognition for an on-line aphasia treatment system. Spe-

cial Issue on Speech Proc. & NLP for AT. Computer Speech and Language, Elsevier. Alberto Abad,

Anna Pompili, Angela Costa, Isabel Trancoso, Jose Fonseca, Gabriela Leal, Luisa Farrajota, Isabel

P. Martins.

• November 2012: VITHEA: On-line word naming therapy in Portuguese for aphasic patients exploit-

ing automatic speech recognition. ”VII Jornadas en Tecnologıa del Habla” and III Iberian SLTech

Workshop (IberSPEECH2012). Anna Pompili, Pedro Fialho and Alberto Abad.

• June 2013: Vithea: Virtual therapist for aphasia treatment.XVII Edition of Conferencias

SAUDECUF. Alberto Abad, Anna Pompili, Isabel Trancoso, Jose Fonseca, Isabel P. Martins.

The success of the system motivated the research on additional features which could extend its

functionalities and robustness. These extensions have been the objectives of the present work and

have concerned with many aspects of the VITHEA platform, from its architecture to one of the main

components of the system: the speech recognition engine.

Probably, the main contribution of this thesis has been the development of a mobile version of the

client module that has shown the feasibility of this kind of systems on such increasingly widespread and

popular devices. User experience evaluation provided remarkably good results, encouraging the further

development of this version.

A custom approach for enabling a hands-free interface, has been designed and implemented. This

required an important architectural update and the exploitation of a recent standardized protocol to

overcome important limitations. The algorithm has been tailored on the speech characteristics that one

would expects from people with language disorders. The design of the implementation was suited to

facilitate the performance of the therapy session.

The administration platform of the project has been provided with an advanced search capability that

will enhance the usability of the application and improve the management of the system resources. Tech-

niques from Information Retrieval have allowed to obtain high values of recall in the results retrieved from

the system.

Another important achievement for the project has been the implementation of a new category of

exercise. A probabilistic keywords model has been generated to support and evaluate the introduction

64

of a specific semantic category naming exercise, the animal naming task. Automated evaluation showed

promising results, that will certainly lead to a future implementation of the exercise in the on-line platform

itself.

To conclude, an alternative pronunciation lexicon has been exploited in order to improve the robust-

ness of the speech recognizer. A rule-based software has been used to generate an alternative lexicon

according to the orthographic form. Recognition results exploiting this lexicon have shown, for some

patients, a slight improvement in the automatic word naming recognition performance.

This work allowed me to apply and deepen many of the topics learnt during the last years. I had

the possibility to exploit new techniques of Information Retrieval, to implement the concepts studied

in software engineering and security system courses, and of course to test my planning and manage-

ment skills. I also finally had the chance to have a closer approach to the challenges that surround the

area of speech recognition, a topic that has particularly atracted my interested after so many years of

involvement in the project.

Some directions for future work have already been identified in the course of this document. Among

these, there is of course the idea of improving with more functionalities and strengthen with advanced

signal processing techniques, besides increasing the robustness of the version of the system dedicated

to mobile devices.

For what concerns the hands-free interface, a future extension that aims at improving the quality

of the speech/non-speech detection, may consider the hypothesis of exploiting the in-house AUDIMUS

speech segmentation system for performing this task in the server-side. Alternatively, another possibility

would be to extend the client-side version of the speech detector by using complementary relevant

features in addition to the signal energy, such as for instance the zero crossing rate.

Related with the extended search capability developed, we plan to refine the metadata generation

process and to exploit a stemming algorithm which may allow to consider for the root of a word instead

of its derivation.

Another important area of the project which deserves further research is the one that concerns to

the current set of available exercises. Besides the future integration of the animal naming task, the

exercise itself could be extended to other types of stimulations, such as automatic serial naming or

picture description.

To a greater extent, with the introduction of new types of exercises, the VITHEA system itself could

be exploited to be applied to other kind of language or yet cognitive disorders. Currently, the VITHEA

system is indeed already being advised from speech therapists to some patients that do not suffer from

aphasia, but apparently have rather similar symptoms. This is the case of a recent patient that is affected

from Amyotrophic lateral sclerosis.

Thus, for instance, Alzheimer’s disease is a neurodegenerative process whose symptoms are mani-

fested predominantly as a disruption of memory processing that secondarily affects other cognitive abili-

65

ties [Ashford 08]. In Alzheimer’s disease linguistic tasks are used to evaluate and monitor the level of the

cognitive dysfunction. To this purpose, a typical task commonly used is the semantic fluency task. This

consists of naming as many words as possible belonging to a specific category, and within a one-minute

interval. The most common category used for this test is the “animals“ category, this subset is therefore

commonly referred as Animal Naming or Animal Fluency and is part of the CERAD (Consortium for the

Establishment of a Registry for Alzheimers Disease).

66

[Abad 08] A. Abad & J. Neto. Automatic classification and transcription of telephone speech

in radio broadcast data. In Proc.International Conference on Computational Pro-

cessing of Portuguese Language (PROPOR), 2008.

[Abad 12] A. Abad, A. Pompili, A. Costa & I. Trancoso. Automatic word naming recogni-

tion for treatment and assessment of aphasia. In 13th Annual Conference of the

International Speech Communication Association (InterSpeech 2012), 2012.

[Abad 13] A. Abad, A. Pompili, A. Costa, I. Trancoso, J. Fonseca, G. Leal, L. Farrajota &

I. P. Martins. Automatic word naming recognition for an on-line aphasia treatment

system. Computer Speech & Language, vol. 27, no. 6, pages 1235–1248, 2013.

Special Issue on Speech and Language Processing for Assistive Technology.

[Adlam 06] A.-L. R. Adlam, K. Patterson, T. T. Rogers, P. J. Nestor, C. H. Salmond, J. Acosta-

Cabronero & J. R. Hodges. Semantic dementia and fluent primary progressive

aphasia: two sides of the same coin? Brain, vol. 129, no. 11, pages 3066–3080,

2006.

[Albert 94] M. L. Albert, R. W. Sparks & N. A. Helm. Report of the Therapeutics and Tech-

nology Assessment Subcommittee of the American Academy of Neurology. Ass-

esment: melodic intonation therapy. Neurology, vol. 44, pages 566–568, 1994.

[Albert 98] M. L. Albert. Treatment of aphasia. In Archive of Neurology, volume 55, pages

1417–1419, 1998.

[Aronson 97] A. R. Aronson & T. C. Rindflesch. Query expansion using the UMLS Metathe-

saurus. Proc AMIA Annu Fall Symp, 1997.

[Ashford 08] J. W. Ashford. Screening for Memory Disorder, Dementia, and Alzheimer’s dis-

ease. In Aging Health), volume 4, pages 399–432, 2008.

[Basso 92] A. Basso. Prognostic factors in aphasia. Aphasiology, vol. 6, no. 4, pages 337–

348, 1992.

[Bell 08] M. Bell. Introduction to Service-Oriented Modeling. Service-Oriented Modeling:

Service analysis, design, and architecture. Wiley & Sons, 2008.

[Bhogal 03] S. K. Bhogal, R. Teasell & M. Speechley. Intensity of aphasia therapy, impact on

recovery. Stroke, pages 987–993, 2003.

67

[Campbell 05] W. W. Campbell. Dejong’s the neurologic examination, chapitre Disorders of

Speech and Language. 2005.

[Candeias 08] S. Candeias & F. Perdigao. Conversor de grafemas para fones baseado em regras

para portugues. In Proc. 10 Years of Linguateca - PROPOR 2008, 2008.

[Candeias 09] S. Candeias & F. Perdigao. Syllable Structure Prototype for Portuguese Teach-

ing/Learning. In Proc. Athens Institute for Education and Research International

Conf. on Literatures, Languages, 2009.

[Candeias 11] S. Candeias & F. Perdigao. Syllable Structure Prototype for Portuguese Teach-

ing/Learning. In The 17th International Congress of Phonetic Sciences (ICPhS

XVII), 2011.

[Caseiro 02] D. Caseiro, I. Trancoso, L. Oliveira & C. Viana. Grapheme-to-phone using finite-

state transducers. In Proceedings of 2002 IEEE Workshop on Speech Synthesis,

pages 215 – 218, 2002.

[Caseiro 06] D. Caseiro & I. Trancoso. A specialized on-the-fly algorithm for lexicon and lan-

guage model composition. IEEE Transactions on Audio, Speech & Language

Processing, vol. 14, no. 4, pages 1281–1291, 2006.

[Chuangsuwanich 11] E. Chuangsuwanich & J. R. Glass. Robust Voice Activity Detector for Real World

Applications Using Harmonicity and Modulation Frequency. In INTERSPEECH,

pages 2645–2648. ISCA, 2011.

[Code 94] C. Code & M. J. Ball. Syllabification in aphasic recurring utterances: contributions

of sonority theory. Journal of Neurolinguistics, vol. 8, no. 4, pages 257 – 265,

1994.

[Cui 02] H. Cui, J.-R. Wen, J.-Y. Nie & W.-Y Ma. Probabilistic query expansion using query

logs. In Proceedings of the 11th international conference on World Wide Web,

WWW ’02, pages 325–332, New York, NY, USA, 2002. ACM.

[Davis 85] G. A. Davis & M. L. Wilcox. Adult aphasia rehabilitation: Applied pragmatics.

College Hill Press, 1985.

[Ferro 99] J. M. Ferro, G. Mariano & S Madureira. Recovery from Aphasia and Neglect.

Cerebrovasc Dis, vol. 9, pages 6–22, 1999.

[Fielding 00] R. T. Fielding. Architectural Styles and the Design of Network-based Software

Architectures. PhD thesis, UNIVERSITY OF CALIFORNIA, IRVINE, 2000.

[Fielding 02] R. T. Fielding & R. N. Taylor. Principled design of the modern Web architecture.

ACM Trans. Internet Technol., vol. 2, no. 2, pages 115–150, May 2002.

68

[Goncalo Oliveira 12] H. Goncalo Oliveira, L. Anton Perez & P. Gomes. Integrating lexical-semantic

knowledge to build a public lexical ontology for portuguese. In Proceedings of the

17th international conference on Applications of Natural Language Processing

and Information Systems, NLDB’12, pages 210–215, Berlin, Heidelberg, 2012.

Springer-Verlag.

[Gong 05] Z. Gong, C. W. Cheang & L. Hou U. Web query expansion by Wordnet. In DEXA,

pages 166–175, 2005.

[Goodglass 93] H. Goodglass. Understanding aphasia: technical report. Rapport technique, Uni-

versity of California. San Diego. Academic Press, 1993.

[HBP 12] The Human Brain Project Pilot Report, 2012.

[Hermansky 90] H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of

the Acoustical Society of America, vol. 87, no. 4, pages 1738–1752, 1990.

[Hermansky 92] H. Hermansky, N. Morgan, A. Bayya & P. Kohn. RASTA-PLP speech analysis tech-

nique. In Proceedings of the 1992 IEEE international conference on Acoustics,

speech and signal processing, volume 1 of ICASSP’92, pages 121–124, 1992.

[Howard 85] D. Howard, K. Patterson, S. Franklin, V. Orchard-Lisle & J. Morton. The facilitation

of picture naming in aphasia. Cognitive Neuropsychology, vol. 2, pages 49–80,

1985.

[Hunt 04] M. Hunt. Speech recognition, syllabification and statistical phonetics. In Proc.

International Conference on Speech and Language Processing, Interspeech-04,

October 2004.

[Jamaati 08] M. Jamaati, H. Marvi & M. Lankarany. Vowels recognition using mellin transform

and plp-based feature extraction. J Acoust Soc Am, vol. 123, no. 5, page 3177,

2008.

[Kingsbury 98] B. E. D. Kingsbury, N. Morgan & S. Greenberg. Robust speech recognition using

the modulation spectrogram. Speech Communication, vol. 25, no. 1-3, pages 117

– 132, 1998.

[Knutsen 09] J. Knutsen. Web service clients on mobile android devices - a study on architec-

tural alternatives and client performance. Master’s thesis, Norwegian University

of Science and Technology, 2009.

[Koller 10] O. T. A. Koller. Automatic speech recognition and identfication of african por-

tuguese. Diploma thesis, Berlin University of Technology, June 2010.

[Maier 09] A. Maier, T. Haderlein, U. Eysholdt, F. Rosanowski, A. Batliner, M. Schuster &

E. Noth. {PEAKS} – A system for the automatic evaluation of voice and speech

disorders. Speech Communication, vol. 51, no. 5, pages 425 – 437, 2009.

69

[Mamede 12] N. J. Mamede, J. Baptista, C. Diniz & V. Cabarrao. STRING: An Hybrid Statistical

and Rule-Based Natural Language Processing Chain for Portuguese. In Interna-

tional Conference on Computational Processing of Portuguese, Propor, 2012.

[Meinedo 00] H. Meinedo & J. P. Neto. Combination Of Acoustic Models In Continuous Speech

Recognition Hybrid Systems. vol. 2, pages 931–934, 2000.

[Meinedo 03] H. Meinedo, D. Caseiro, J. Neto & I. Trancoso. AUDIMUS.Media: a Broadcast

News speech recognition system for the European Portuguese language. In Proc.

International Conference on Computational Processing of Portuguese Language

(PROPOR), 2003.

[Meinedo 10] H. Meinedo, A. Abad, T. Pellegrini, I. Trancoso & J. P. Neto. The L2F Broadcast

News Speech Recognition System. In Fala2010, Vigo, Spain, 2010.

[Mohri 02] M. Mohri, F. Pereira & M. Riley. Weighted Finite-State Transducers in Speech

Recognition. Computer Speech and Language, vol. 16, pages 69–88, 2002.

[Morgan 95] N. Morgan & H. Bourlad. An introduction to hybrid HMM/connectionist continuous

speech recognition. IEEE Signal Processing Magazine, vol. 12, no. 3, pages 25–

42, 1995.

[Murray 01] L. L. Murray & R. Chapey. Assesment of Language Disorders in Adults. In

R. Chapey, editeur, Language Intervention Strategies in Aphasia and Related

Neurogenic Communication Disorders. Lippincott Williams & Wilkins, 4 edition,

2001.

[Neto 96] J. P. Neto, C. Martins & L. B. Almeida. An Incremental Speaker-Adaptation Tech-

nique For Hybrid Hmm-Mlp Recognizer. In Recognizer, Proceedings ICSLP 96,

pages 1289–1292, 1996.

[Oliveira 05] C. Oliveira, L. C. Moutinho & A. J. S. Teixeira. On european Portuguese automatic

syllabification. In INTERSPEECH 2005 - Eurospeech, 9th European Conference

on Speech Communication and Technology, Lisbon, Portugal, pages 2933–2936.

ISCA, 2005.

[Ortmanns 00] S. Ortmanns & H. Ney. The time-conditioned approach in dynamic programming

search for LVCSR. Speech and Audio Processing, IEEE Transactions on, vol. 8,

no. 6, pages 676 –687, 2000.

[Paulo 08] S. Paulo, L. C. Oliveira, C. Mendes, L. Figueira, R. Cassaca, C. Viana & H. Moniz.

DIXI – A Generic Text-to-Speech System for European Portuguese. In Computa-

tional Processing of the Portuguese Language, volume 5190 of Lecture Notes in

Computer Science, pages 91–100. Springer Berlin Heidelberg, 2008.

70

[Pedersen 95] P. M. Pedersen, H. Stig Jørgensen, H. Nakayama, H. O. Raaschou & T. S. Olsen.

Aphasia in acute stroke: Incidence, determinants, and recovery. Annals of Neu-

rology, vol. 38, no. 4, pages 659–666, 1995.

[Pinto 07] J. Pinto, A. Lovitt & H. Hermansky. Exploiting Phoneme Similarities in Hybrid

HMM-ANN Keyword Spotting. In Proc. Interspeech, pages 1610–1613, 2007.

[Pompili 11] A. Pompili, A. Abad, I. Trancoso, J. Fonseca, I. P. Martins, G. Leal & L. Farrajota.

An on-line system for remote treatment of aphasia. In Proceedings of the Sec-

ond Workshop on Speech and Language Processing for Assistive Technologies,

SLPAT ’11, pages 1–10. Association for Computational Linguistics, 2011.

[Pompili 13] A. Pompili. New features for on-line aphasia therapy. Master’s thesis, Instituto

Superior Tecnico, 2013.

[Rabiner 89] L. R. Rabiner. A tutorial on hidden markov models and selected applications in

speech recognition. Proceedings of the IEEE, vol. 77, no. 2, page 257–285, 1989.

[Rabiner 93] L. R. Rabiner & B. H. Juang. Fundamentals of speech recognition. Prentice Hall,

1993.

[Ramirez 04] J. Ramirez, J. C. Segura, C. Benitez, A. de la Torre & A. Rubio. Voice activity

detection with noise reduction and long-term spectral divergence estimation. In

Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP ’04).

IEEE International Conference on, volume 2, pages ii–1093–6 vol.2, 2004.

[Richardson 07] L. Richardson & S. Ruby. Restful web services. O’Reilly Media, 2007.

[Sangwan 02] A. Sangwan, M. C. Chiranth, H. S. Jamadagni, R. Sah, R. Venkatesha Prasad

& V. Gaurav. VAD techniques for real-time speech transmission on the Internet.

In High Speed Networks and Multimedia Communications 5th IEEE International

Conference on, pages 46–50, 2002.

[Sarno 81] M. T. Sarno. Recovery and rehabilitation in aphasia. In Acquired Aphasia, pages

485–530. New York, Academic Press, 1981.

[Snodgrass 80] J. G. Snodgrass & M. Vanderwart. A standardized set of 260 pictures: Norms

for name agreement, image agreement, familiarity, and visual complexity. Journal

of Experimental Psychology: Human Learning and Memory, vol. 6, no. 2, pages

174–215, 1980.

[Szoke 05] I. Szoke, P. Schwarz, P. Matejka, L. Burget, M. Karafiat, M. Fapso & J. Cernocky.

Comparison of Keyword Spotting Approaches for Informal Continuous Speech. In

Proc. Interspeech, pages 633–636, 2005.

[Tebelskis 95] J. Tebelskis. Speech recognition using neural networks. Phd thesis, Carnegie

Mellon University, 1995.

71

[Trancoso 03] I. Trancoso, C. Viana, M. Barros, D. Caseiro & S. Paulo. From Portuguese to

Mirandese: fast porting of a letter-to-sound module using FSTs. In Proceedings of

the 6th international conference on Computational processing of the Portuguese

language, PROPOR’03, pages 49–56, Berlin, Heidelberg, 2003.

[Voorhees 94] E. M. Voorhees. Query Expansion using Lexical-Semantic Relations. In Proceed-

ings of the 17th Annual International ACM SIGIR conference on Research and

Development in Information Retrieval, SIGIR ’94, pages 61–69. Springer London,

1994.

[Weinrich 91] M. Weinrich. Computerized visual communication as an alternative communica-

tion system and therapeutic tool. Neurolinguist., vol. 6, pages 159–176, 1991.

[Wilshire 00] C.E. Wilshire & H.B. Coslett. Disorders of word retrieval in aphasia: theories and

potential applications. In L. J. G. Rothi e B. Crosson S. E. Nadeau, editeur, Apha-

sia and Language. Theory to practice, pages 82–107. New York: The Guilford

Press, 2000.

72

New features for on-line aphasia therapy - INESC-ID · New features for on-line aphasia therapy...

Documents

Transcript of New features for on-line aphasia therapy - INESC-ID · New features for on-line aphasia therapy...