New features for on-line aphasia therapy - INESC-ID · New features for on-line aphasia therapy...
Transcript of New features for on-line aphasia therapy - INESC-ID · New features for on-line aphasia therapy...
New features for on-line aphasia therapy
Anna Maria Pompili
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Examination Committee
Chairperson: Prof. Pedro Manuel Moreira Vaz Antunes de SousaSupervisor: Prof. Isabel Maria Martins TrancosoSupervisor: Dr. Alberto Abad GaretaMember of the Committee: Prof. Alfredo Manuel dos Santos Ferreira Júnior
November 2013
ii
To Giuseppina and Francesco.
iii
iv
My deepest gratitude goes to Professor Alberto Abad. He always guided and supported me in the most
comprehensive and constructive way, providing brilliant ideas, showing me the right approach to address
complicated problems, and readily helping me to overcome the many difficulties that I had to tackle while
pursuing the objectives of this thesis. His guidance, constant incentives and endless availability have
been fundamental for the achievement of this result.
I wish to express my gratitude to Professor Isabel Trancoso, not only for her valuable guidance, but
also for having welcome me in the L2F group. During the time spent here, she always motivated me with
inspiring discussions, and provided me her full support and availability. She never missed a chance to
demonstrate me her trust, and constantly accompanied my work, enlightening my way with her unique
ability to identify innovative research directions and enticing applications for the results achieved during
this work.
I owe a very special acknowledgment to Isabel Pavao Martins, Jose Fonseca, Gabriela Leal, Luisa
Farrajota, and Sofia Clerigo, from the Language Research Laboratory group (LEL - Laboratorio de
Estudos de Linguagem) of the Lisbon Faculty of Medicine. Their cooperation has been fundamental
to allow the project VITHEA to become a reality.
I also want to thank Professor Nuno Mamede and Professor Sara Candeias from the L2F group, for
having kindly provided important resources that have constituted the baseline for some of the results
achieved with this work. Without these initial onsets, those results would not have been possible.
Thank you also to all the colleagues and room-mates that I have had the pleasure to know during
these years. They have been supporters of this experience not only with their kindness and friendship,
but also by providing active participation in the data collection and user evaluation experience.
Finally, my special thanks go to Paolo, my companion. His advices, cares, and support have been
invaluable to me to overcome the hardest difficulties.
v
vi
Afasia e um tipo particular de disturbio da comunicacao causada por lesoes de uma ou mais areas
do cerebro que afectam diferentes funcionalidades da linguagem e da fala. Os acidentes vasculares
cerebrais sao uma das causas mais comuns dessa doenca.
VITHEA (Terapeuta Virtual para o tratamento da afasia) e uma plataforma on -line desenvolvida
para o tratamento de doentes afasicos, incorporando os recentes avancos das tecnologias de fala para
proporcionar exercıcios de nomeacao a pessoas com uma reduzida capacidade de nomear objetos.
O sistema, disponıvel ao publico desde Julho de 2011, recebeu ja varios premios nacionais e interna-
cionais e e atualmente distribuıdo a cerca de 160 utilizadores entre profissionais de saude e doentes.
O foco deste trabalho e investigar a viabilidade da incorporacao de funcionalidades adicionais que
podem potenciar o sistema VITHEA. Essas funcionalidades visam tanto estender a usabilidade do sis-
tema quanto reforcar o seu desempenho, considerando assim varias areas heterogeneas do projeto.
Entre estas funcionalidades destacam-se: uma nova versao do aplicativo cliente para estender a porta-
bilidade da plataforma a dispositivos moveis, uma interface hands- free para facilitar os doentes por-
tadores de deficiencias fısicas, e uma funcionalidade de pesquisa avancada para melhorar a gestao
dos dados da aplicacao. Foi tambem estudada a viabilidade de um novo tipo de exercıcios e avaliado
o desempenho de um novo lexico de pronuncia com o objectivo de melhorar os resultados de recon-
hecimento. Em geral, os resultados de questionarios de satisfacao dos utilizadores e as avaliacoes
automaticas tem proporcionado feedback encorajador sobre as melhorias desenvolvidas.
Palavras-chave: Afasia, recuperacao da linguagem, terapia virtual, disturbio da fala, nomeacao
oral, reconhecimento de fala
vii
viii
Aphasia is a particular type of communication disorder caused by the damage of one or more language
areas of the brain affecting various speech and language functionalities. Cerebral vascular accidents
are one of the most common causes.
VITHEA (Virtual Therapist for Aphasia Treatment) is an on-line platform developed for the treatment
of aphasic patients, incorporating recent advances of speech and language technologies to provide word
naming exercises to individuals with lost or reduced word naming ability. The system, publicly available
since July 2011, received several national and international awards and is currently distributed to almost
160 users among professional health-care and patients.
The focus of this thesis is to investigate the feasibility of incorporating additional functionalities that
may enhance the VITHEA system. These features aimed at both extending the usability of the system
and strengthening its performance, and thus involve several heterogeneous areas of the project. The
main new features were: a new version of the client application to extend the portability of the platform
to mobile devices, an ad-hoc hands-free interface to facilitate patients with physical disabilities, and an
advanced search capability to improve the management of the application data. This study also included
the assessment of the feasibility of a new type of exercises, and the evaluation of a new pronunciation
lexicon aimed at improving recognition results. Overall, the results of user interaction satisfaction ques-
tionnaires and the automatic evaluations have provided encouraging feedback on the outcome of the
developed improvements.
Keywords: Aphasia, language recovery, virtual therapy, speech disorder, word naming, speech
recognition
ix
x
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
List of abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Structure of this Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 5
2.1 Aphasia language disorder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Aphasia symptoms classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Aphasia treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Brief introduction to automatic speech recognition . . . . . . . . . . . . . . . . . . 7
2.2.2 AUDIMUS speech recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Automatic word verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3.0.1 Word verification based on keyword spotting . . . . . . . . . . . 10
2.2.3.0.2 Keyword spotting with AUDIMUS . . . . . . . . . . . . . . . . . . 11
2.3 Platform for speech therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 VITHEA: An on-line system for virtual treatment of aphasia . . . . . . . . . . . . . 12
2.3.1.1 The patient and the clinician applications . . . . . . . . . . . . . . . . . . 13
2.3.1.1.1 Patient application module . . . . . . . . . . . . . . . . . . . . . 13
2.3.1.1.2 Virtual character animation and speech synthesis . . . . . . . . 13
2.3.1.1.3 Speech synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1.1.4 Clinician application module . . . . . . . . . . . . . . . . . . . . 14
2.3.1.2 Platform architecture overview . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 New features for aphasia therapy: State of the art . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Content adaptation for mobile devices . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Hands-free speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Exploiting IR for improved search functionality . . . . . . . . . . . . . . . . . . . . . 18
xi
2.4.4 New automatic evocation exercises for therapy treatment . . . . . . . . . . . . . . 19
2.4.5 Exploiting syllable information in word naming recognition of aphasic speech . . . 20
3 Content adaptation for mobile devices 21
3.1 Service Oriented Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Representational State Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 Android Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Architectural overview of the implemented prototype . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 REST authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Implemented architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2.0.1 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2.0.2 Data representation . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 Client application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 User experience evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Hands-free speech recording 31
4.1 Voice activity detection task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Exploiting IR for improved search functionality 41
5.1 Extended search functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1.2 Metadata generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1.3 Indexes generation and management . . . . . . . . . . . . . . . . . . . . 44
5.2 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 New automatic evocation exercises for therapy treatment 49
6.1 Automatic animal naming recognition task . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.1 Keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1.2 Keyword model generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1.3 Background penalty for keyword spotting tuning . . . . . . . . . . . . . . . . . . . . 51
6.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
xii
6.2.1 Speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7 Exploiting syllable information in word naming recognition of aphasic speech 57
7.1 Syllabification task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2.1 Speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8 Conclusions 63
8.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Bibliography 72
xiii
xiv
4.1 Baseline configuration established through exhaustive search. . . . . . . . . . . . . . . . 39
4.2 Results obtained on the development test set with the baseline configuration. . . . . . . . 39
4.3 Results obtained on the evaluation set with the baseline configuration. . . . . . . . . . . . 40
5.1 Coverage of the additional metadata generated. . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Precision and recall for each of the indexes generated. . . . . . . . . . . . . . . . . . . . . 45
5.3 Number of results returned by the system using the extended search feature and using a
standard search functionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1 Speech corpus data, including gender, total number of words and the total number of valid
words uttered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Experiments data and resulting average WER, including file size information. . . . . . . . 53
6.3 Experiments data and resulting average WER, including file size information. . . . . . . . 54
6.4 Experiments data and resulting average WER. . . . . . . . . . . . . . . . . . . . . . . . . 55
6.5 Automatic and manual WER with the configuration 2 of the last set of experiments. . . . 56
7.1 Average WVR for the APS-I and APS-II corpus with different pronunciation models. . . . . 60
7.2 WVR for APS-I and APS-II data sets and average WVR, using automatically calibrated
background penalty term. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xv
xvi
2.1 Block diagram of AUDIMUS speech recognition system. . . . . . . . . . . . . . . . . . . . 10
2.2 Comprehensive overview of the VITHEA system. . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Screen-shots of the VITHEA patient application. . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Interface for the creation of new stimulus. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Interface for the management of multimedia resources. . . . . . . . . . . . . . . . . . . . . 16
3.1 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Screen-shots of the VITHEA mobile patient application. . . . . . . . . . . . . . . . . . . . 28
3.3 Results of the evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Distribution of the user grades for the questions of the third group. . . . . . . . . . . . . . 30
4.1 Architectural implementation of the VAD algorithm. . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Process of generation of the speech corpus. . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 Structure of the objects of the VITHEA system that are of interest for the search functionality. 43
5.2 Results provided for the search query “seco” (dry) on the field answer of a Question. . . . 46
5.3 Results provided for the search query “alimento” (food) on the field answer of a Question. 46
5.4 Results provided for the search query <“harpia” (harpy), “animais” (animals)> on the
fields title and category of a document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1 First set of experiments using the keyword model with different values for the threshold. . 53
6.2 Second set of experiments using the keyword model filtered with Onto.PT and different
values for the threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Third set of experiments, including phonetic transcription correction and filled pause models. 55
7.1 Results for the APS-I comparing the two pronunciation lexicons, the standard and the
augmented version provided with syllable boundaries. . . . . . . . . . . . . . . . . . . . . 59
7.2 Results for the APS-II comparing the two pronunciation models, the standard and the
augmented version provided with syllable boundaries. . . . . . . . . . . . . . . . . . . . . 60
xvii
xviii
ANN Artificial Neural Network
ASR Automatic Speech Recognition
CSR Continuous Speech Recognition
CVA Cerebral Vascular Accident
JSON JavaScript Object Notation
IWR Isolated Word Recognition
LVCSR Large Vocabulary Continuous Speech Recognition
MLP Multilayer Perceptron
REST Representational State Transfer
RPC Remote Procedure Call
WFST Weighted Finite State Transducer
TTS Text-To-Speech
SOA Service-Oriented Architecture
SOAP Simple Object Access Protocol
URI Uniform Resource Identifiers
XML Extensible Markup Language
WVR word verification rate
WER Word Error Rate
VAD Voice Activity Detection
xix
xx
1Aphasia is a particular type of communication disorder caused by the damage of one or more language
areas of the brain affecting various speech and language functionalities. Cerebral vascular accidents
are one of the most common causes. A frequent syndrome is the difficulty to recall names or words.
Typically, such problems can be treated through word naming therapeutic exercises. In fact, frequency
and intensity of speech therapy is a key factor in the recovery, thus motivating the development of
automatic therapy methods that may be used remotely.
VITHEA (Virtual Therapist for Aphasia Treatment) is an on-line platform developed for the treatment
of aphasic patients, incorporating recent advances of speech and language technologies to provide
word naming exercises to individuals with lost or reduced word naming ability. The project started in
June 2010 and saw the release of the first public prototype in July 2011. Since then, the system has
continuously evolved with improvements both on the speech recognition techniques used and on the
functionalities provided to patients and therapists. After three years of active development the project is
now used daily by patients and speech therapists and has been awarded from both the speech and the
health-care communities.
The success of the system motivated the research on additional features which could extend its
functionalities and robustness. The new features will, indeed, consider a heterogeneous domain of
enhancements that includes, among others, the evaluation of new approaches to improve the recognition
quality and the recording process, the development of new interfaces to improve user experience, and
the experimental implementation of new types of exercises.
The focus of the present work is, as the name states, to investigate on the feasibility of incorporating
additional functionalities that may improve the VITHEA system. These features aim at both extend the
usability of the system and strengthen its performance. The first will be achieved by providing a new
version of the client application for mobile device, a hands-free interface for an easier recording experi-
ence, an advanced search functionality for an improved management of platform data, and new type of
exercises. For what concerns system performance, a new approach that considers the syllabic division
of words will be studied and tested within the current speech recognition process.
The VITHEA system comprises two specific modules, dedicated respectively to the patients for car-
rying out the therapy sessions and to the clinicians for the administration of the functionalities related to
them.
Since smartphone, tablet and cellphone have become mainstream over the last few years, mobile
services are more and more integrated in everyday’s life. In some cases, a smartphone may be cheaper
than a computer, more practical, and even easier to use as it does not require the use of an external
input device. Thus, the adaptation of the VITHEA platform to mobile devices has been considered of high
importance for the diffusion of the system. However, this extension is currently limited by the recording
module of the application. Here, an architecture compliant with the new requirements and a client version
version for mobile devices have been designed and implemented in order to verify the performance and
the level of appreciation of the user for the mobile version.
Related also with the client module, it is worth noticing that most of the times aphasia is the con-
sequence of a Cerebral Vascular Accident (CVA) and, in those cases, affected patients may also expe-
rience some sort of physical disability in arm mobility. In such situations, the support for a hands-free
interface will notably improve the usability of the system. However, the typical extension of these inter-
faces for human-computer interaction consists of voice commands as alternative input modality. In the
particular case of the VITHEA project, being the user of the system affected from a language disor-
der, hands-free computing could not be interpreted as an alternative way of interaction, instead it will
be selectively applied to automate the process of recording the users answers, and thus, provide addi-
tional benefits to people experiencing disabilities. Determining when the recording process should start
could be efficiently detected automatically, by considering as reference the end of the description of the
stimulus spoken by the virtual therapist or the end of subsequent reproduction of the audio/video file
in the case of a multimedia stimulus. The detection of the end of the speech is a more challenging is-
sue. Common solutions relies on Voice Activity Detection (VAD) approaches, which automatically try to
determine the presence or absence of voice relying on some features of the input signal. Depending on
the implementation, the features used may vary. In this work, the energy of the speech has been used
as a baseline to develop an algorithm that implements the automatic detection of the end of the speech.
On the other hand, the objective of the clinician module is to allow the management of patient data
as well as of the collection of exercises and resources associated to them. During the last years sev-
eral improvements were introduced to allow the incorporation of new exercises and the creation and the
management of groups of speech-language therapists and patients. However, many important function-
alities that affect the overall usability of the clinician module were still missing. The management of the
exercises data, which now exceeds one thousand of stimuli, only provide a listing functionality, missing
the option to search for a given stimuli. Considering the amount of data stored in the system, the lack of
a search feature strongly affects the daily usage of the module. Besides, it should be noted that the data
constituting the exercises and the stimuli is somehow peculiar in its format. In fact, most of the time, this
is represented by a single keyword (i.e.: the title of a document). This means that if the therapist does
not remember the exact term he/she is looking for, the search will probably fail. For these reasons, it is
important that the search functionality keeps into account these constraints and thus provides extended
search capabilities. Techniques, such as Query Expansion, from the area of Information Retrieval will be
exploited to achieve this purpose.
For what concerns the therapeutic exercises, there are several naming tasks for assessing the pa-
tient’s ability to provide the verbal label of objects, actions, events, attributes, and relationship. There
2
are different types of naming tasks, such as category naming, confrontation naming, automatic clo-
sure naming, automatic serial naming, recognition naming, repetition naming, and responsive nam-
ing [Campbell 05, Murray 01]. Currently, the VITHEA system supports exercises based on visual con-
frontation, automatic closure naming, and responsive naming. The integration of automatic serial naming
or semantic category naming exercises would be of valuable help for patients in recovering from aphasia.
Finally, during preliminary experiments evaluating the performance of the word naming recog-
nition task within the VITHEA system, an analysis of word detections errors have been per-
formed [Abad 13]. From these results emerged that some characteristics of aphasic patients that
sometimes causes keywords to be missed are both pauses in between syllables or mispronounced
phonemes. Recordings have confirmed that some patients have the tendency to speak with a slow
rhythm, almost as if they were dividing the word into syllables. This phenomenon, even more stressed,
was also directly observed in different sessions of experimentation of the system, either performed from
a patient or a healthy subject. In these contexts, when the system failed to recognize the user answer,
the user then typically starts at syllabify the word. These reasons have motivated the idea of investi-
gating on the integrating of an external speech tool that would perform the syllabification of words. The
syllabified version will constitute an augmented grammar for the recognizer that will hopefully improve
its performance.
All the objectives that were identified in the thesis proposal [Pompili 13] were implemented, with the
exception of the ”awareness and profiling” functionality. This feature has been substituted with the pro-
viding of an advanced search capability that has been integrated into the clinician module. In fact, during
the evolution of the thesis work, this feature appeared more interesting and useful for the improvement
of the project at the point of justifying the introduction of this amendment. Thus the main goals of this
works now are:
• content adaptation for mobile devices,
• hands-free speech interface,
• advanced search capability,
• new naming exercises,
• syllabification tool.
The present thesis consists of 8 chapters, structured as follows:
• Chapter 2 starts by reporting on background concepts and on the state of the art of on-line plat-
forms dedicated to speech disorders, describing with further details the VITHEA system. Then, it
3
focuses on the specification of the new features that represents the target of this work, reporting
the current state of the art, where applicable.
• Chapter 3 reports on the architecture, the design, and the security pattern that have been followed
to develop a new version of the system supported by mobile devices. The constraints that have
guided the ultimate prototype and the choices that have been taken, are here justified, motivated
and explained. Then, the Chapter concludes with the description of the results of a user experience
evaluation.
• Chapter 4 describes in detail both the options chosen for the implementation of a VAD approach
carried out at the same time of the recording process and the architectural updates that are in-
volved for its integration within the VITHEA system. The Chapter ends with the evaluation of the
algorithm through automated tests carried out with the recordings of daily users of the system.
• In Chapter 5 the focus is on improving the management of the data of the system, by providing
an advanced search feature. This is achieved through the generation of metadata information
provided by ontological resources. These data are then exploited from a query expansion process
and a full-text search engine for providing an extended set of results. Precision and recall measures
for a given test set of queries are reported at the end of the Chapter.
• Chapter 6 explains the concept of evocation exercises and how a specific subclass, the animal
naming, has been implemented through an iterative process of enhancements. The construction
of the baseline list of admissible animals, constituting a key component for the recognition process,
is detailed together with the automated evaluation carried out through the collection of a speech
corpus.
• Chapter 7 introduces the issues that surround the task of syllabic division of words and describes
how an external software that provides the orthographic syllabification, has been adapted to the
architecture of the speech recognition engine AUDIMUS. Then, the results of the automated test,
carried out with the corpus of aphasic patients collected during the VITHEA project, is described.
• Finally, Chapter 8 presents the conclusions and future work.
4
2This chapter aims at providing both important background knowledge that will be referred in the rest
of this document and the relevant state of the art for the new targeted features. It is divided into 4
main Sections, first a short background on aphasia and common therapeutic approaches are described
(Section 2.1), then an overview of an Automatic Speech Recognition (ASR) system is provided in Sec-
tion 2.2, focusing on AUDIMUS, the in-house speech recognition engine used. Section 2.3 describes
current known platforms providing on-line tools for voice disorders, with a deep focus on the VITHEA
system. Finally, Section 2.4 is devoted at describing the state of the art that is relevant for each of the
new features object of this work.
Aphasia is a speech disorder which comprises difficulties in both production and comprehension of spo-
ken or written language. It is caused by damage to one or more of the language areas of the brain,
typically it occurs after brain injuries. There are several causes of brain injuries affecting communication
skills, such as brain tumours, brain infections, severe head injuries, and most commonly, cerebral vas-
cular accidents (CVA). Among the effects of aphasia, the difficulty to recall words or names is the most
common disorder presented by aphasic individuals. In fact, it has been reported in some cases as the
only residual deficit after rehabilitation [Wilshire 00]. Several studies about aphasia have demonstrated
the positive effect of speech-language therapy activities for the improvement of social communication
abilities [Basso 92]. Moreover, it has been shown that the intensity of therapy positively influences
speech and language recovery in aphasic patients [Bhogal 03].
2.1.1 Aphasia symptoms classification
We can classify various aphasia syndromes by characterizing the speech output in two broad categories:
fluent and non-fluent aphasia [Goodglass 93]. Fluent aphasia has normal articulation and rhythm of
speech, but is deficient in meaning. Typically, there are word-finding problems that most affect nouns
and picturable action words. Non-fluent aphasic speech is slow and laboured with short utterance length.
The flow of speech is more or less impaired at the levels of speech initiation, the finding and sequenc-
ing of articulatory movements, and the production of grammatical sequences. Following the above
classification, we list the major types of aphasia and their properties:
1. Fluent
(a) Wernicke’s aphasia is caused by damage to the temporal lobe of the brain, is one of the most
common syndromes in fluent aphasia. People with Wernicke’s aphasia may speak in long
sentences that have no meaning, adding unnecessary or made-up words. Individuals with
Wernicke’s aphasia usually have great difficulty understanding the speech of both themselves
and others and are therefore often unaware of their mistakes.
(b) Transcortical aphasia presents similar deficits as in Wernicke’s aphasia, but repetition ability
remains intact.
(c) Conduction aphasia is caused by deficits in the connections between the speech-
comprehension and speech-production areas. Auditory comprehension is near normal, and
oral expression is fluent with occasional paraphasic errors. Repetition ability is poor.
(d) Anomic aphasia with anomic aphasia the individual may have difficulties naming certain
words, linked by their grammatical type (e.g. difficulty naming verbs and not nouns) or by
their semantic category (e.g. difficulty naming words relating to photography but nothing else)
or a more general naming difficulty.
2. Non-fluent
(a) Broca’s aphasia is caused by damage to the frontal lobe of the brain. People with Broca’s
aphasia may speak in short phrases that make sense but are produced with great effort.
People with Broca’s aphasia typically understand the speech of others fairly well. Because of
this, they are often aware of their difficulties and can become easily frustrated.
(b) Global aphasia presents severe communication difficulties, individuals with global aphasia will
be extremely limited in their ability to speak or comprehend language. They may be totally
non-verbal, and/or only use facial expressions and gestures to communicate.
(c) Transcortical Motor aphasia presents similar deficits as Broca’s aphasia, except repetition
ability remains intact. Auditory comprehension is generally fine for simple conversations, but
declines rapidly for more complex conversations.
2.1.2 Aphasia treatment
In some cases, a person will completely recover from aphasia without treatment. This type of sponta-
neous recovery usually occurs following a type of stroke in which blood flow to the brain is temporarily
interrupted, but quickly restored, called a transient ischemic attack. In these circumstances, language
abilities may return in a few hours or a few days. For most cases, however, language recovery is not
as quick or as complete. While many people with aphasia experience partial spontaneous recovery, in
which some language abilities return a few days to a month after the brain injury, some residual disor-
ders typically remain. In these instances, most clinicians would recommend speech-language therapy.
The recovery process usually continues over a two-year period, although clinicians believe that the most
effective treatment begins early in the recovery process.
There are multiple modalities of speech therapy [Albert 98]. The most commonly used techniques
are focused on improving expressive output, such as the stimulation-response method and the Melod-
6
ical Intonation Therapy (MIT). MIT is a formal, hierarchically structured treatment program based on
the assumption that the stress, intonation, and melodic patterns of language output are controlled pri-
marily by the right hemisphere and, thus, are available for use in the individual with aphasia with left
hemisphere damage [Albert 94]. Other methods are linguistic-oriented learning approaches, such as
the lexical-semantic therapy or the mapping technique for the treatment of agrammatism. Still, other
techniques such as Promoting Aphasics’ Communicative Effectiveness (PACE), focus on enhancing
communicative ability, non-verbal as well as verbal, in pragmatically realistic settings [Davis 85]. Several
non-verbal methods for the treatment of severe global aphasics rely on computer-aided therapy such
as the visual analogue communication, iconic communication, visual action and drawing therapies are
currently used [Sarno 81]. An example is Computerized visual communication (or C-VIC) designed as
an alternative communication system for patients with severe aphasia and is based on the notion that
people with severe aphasia can learn an alternative symbol system and can use this alternative system
to communicate [Weinrich 91].
Furthermore, although there exists such an extended list of treatments specifically thought to recover
from a different disorder caused by aphasia, one class of treatment especially important is the one
devoted to help improving word retrieval problems, since as noticed, it is one of the most common
residual disorder in all aphasia syndromes. Naming abilities problems are typically treated with semantic
exercises like Naming Objects or Naming common actions where commonly the patient is asked to name
a subject represented in a picture [Adlam 06].
Speech recognition is the translation, operated by a machine, of spoken words into text. It is a difficult
task, whose automation involves many areas of computer science, from signal processing, to statistical
frameworks and machine learning techniques. In the following, in order to describe the components of
the ASR module that are of relevance for the project, a brief introduction to the speech recognition topics
is provided.
2.2.1 Brief introduction to automatic speech recognition
Speech recognition systems do not actually perform the recognition or decoding step directly on the
speech signal. Rather, the speech waveform is divided into short frames of samples, which are con-
verted to a meaningful set of features. The duration of the frames is selected so that the speech wave-
form can be regarded as being stationary. In addition to this transformation, some pre-processing tech-
niques are applied to the waveform signal in order to enhance it and to better prepare it for the speech
recognition.
In the feature extraction step, the sampled speech signal is parametrized. The goal is to extract
a number of parameters (‘features’) from each frame of the signal containing the relevant speech in-
formation and being robust to acoustic variations and sensitive to linguistic context. More in detail,
features should be robust against noise and factors that are irrelevant for the recognition process, also
7
features that are discriminant and allow to distinguish between different linguistic units (e.g., phones)
are required.
Then, the next stage in the recognition process is to do a mapping of the speech vectors found at
the previous step and the wanted underlying sequence of acoustic classes modelling concrete symbols
(phonemes, letters, words...). Acoustic modelling is arguably the central part of any speech recognition
system, it plays a critical role in improving ASR performance. The practical challenge is how to build ac-
curate acoustic models, that can truly reflect the spoken language to be recognized. Typically, sub-word
models like phonemes, diphones or triphones, are more often used as the unit of acoustic model with
respect to word model. An extended and successful statistical parametric approach to speech recogni-
tion is the Hidden Markov Model (HMM) paradigm [Rabiner 89, Rabiner 93] that supports both acoustic
and temporal modelling. HMMs model the sequence of feature vectors as a piecewise stationary pro-
cess. An utterance X = x1, . . . , xn, . . . , xN is modelled as a succession of discrete stationary states
Q = q1, . . . , qk, . . . , qK , K < N , with instantaneous transitions between these states. An HMM is typi-
cally defined as a stochastic finite state automaton, usually with a left-to-right topology. It is called ”hid-
den” Markov model because the underlying stochastic process (the sequence of states) is not directly
observable, but still affects the observed sequence of acoustic features. Alternatively, Artificial Neu-
ral Network (ANN) have been proposed as an efficient approach to acoustic modelling [Tebelskis 95].
Although for the past thirty years ANNs have been used for difficult problems in pattern recognition,
more recently many researchers have shown that these nets can be used to estimate probabilities that
are useful for speech recognition. Multilayer Perceptron (MLP)s, are the most common ANN used for
speech recognition. Typically, MLPs have a layered feedforward architecture with an input layer, zero
or more hidden layers, and one output layer. ANN-HMM hybrid systems have been focus of research in
order to combine the strengths of the two approaches [Morgan 95]. Systems based on this connectionist
approach have performed very well on Large Vocabulary Continuous Speech Recognition (LVCSR).
Knowledge of the rules of a language, the way in which words are connected together into phrases,
is expressed by the language model. It is an important building block in the recognition process, it is
used to guide the search for an interpretation of the acoustic input. There are two types of models
that describe a language: grammar-based and statistical-based language models. When the range of
sentences to be recognized is very small, it can be captured by a deterministic grammar that describes
the set of allowed phrases. In large vocabulary applications, on the other hand, it is too difficult to write a
grammar with sufficient coverage of the language, therefore a stochastic grammar, typically an n-gram
model is often used. An n-gram grammar is a representation of an n-th order Markov language model
in which the probability of occurrence of a symbol is conditioned upon the prior occurrence of n−1 other
symbols. When sub-word models are used, the word model is then obtained by concatenating the sub-
word models according to the pronunciation transcription of the words in a dictionary or lexical model. Its
purpose is to map the orthography of the words in the search vocabulary to the units that model the
actual acoustic realization of the vocabulary entries. Lexicon generation may rely on manual dictionaries
or on automatic grapheme-to-phoneme modules, both rule-based or data-driven learned approaches (or
hybrid).
8
The last step in the recognition process is the decoding phase, whose objective is to find a sequence
of words whose corresponding acoustic and language models best match the input signal. Therefore,
such a decoding process with trained acoustic and language models is often referred as a search pro-
cess. Its complexity varies according to the recognition strategy and to the size of the vocabulary. With
Isolated Word Recognition (IWR) word boundaries are known, the word with highest forward probability
is chosen as the recognized word and the search problem becomes a simple pattern recognition prob-
lem. Search in Continuous Speech Recognition (CSR), on the other side, is more complicated since the
search algorithm has to consider the possibility of each word starting at any arbitrary time frame. Also, for
small vocabulary tasks, it is possible to expand the whole search network defined by the language and
lexical restrictions to directly apply conventional time-synchronous Viterbi search. However, in LVCSR
systems different strategies should be addressed. These, span from graph compaction techniques, on-
the-fly expansion of the search space [Ortmanns 00] and heuristic methods.
2.2.2 AUDIMUS speech recognizer
AUDIMUS is the ASR system developed by the Spoken Language Processing Lab of INESC-ID (L2F)
group and integrated into the VITHEA system. It is the result of several years of dedicated research
efforts to the development of ASR systems. AUDIMUS is a hybrid recognizer that follows the above
mentioned connectionist approach [Morgan 95]. It combines the temporal modelling capacity of HMMs
with the pattern discriminative classification of MLP. A Markov process is used to model the basic
temporal nature of the speech signal, while an ANN is used to estimate posterior phone probabilities
given the acoustic data at each frame. As shown in Fig. 2.1, the baseline system combines three MLP
outputs trained with different feature sets: Perceptual Linear Predective (PLP, 13 static + first deriva-
tive) [Hermansky 90], log-RelAtive SpecTrAl (RASTA, 13 static + first derivative) [Hermansky 92], and
Modulation SpectroGram (MSG, 28 static) [Kingsbury 98]. This merged approach has proved being
more efficient and robust with respect to using one of the feature individually [Meinedo 00]. This is ex-
plained by the integration of the advantages of these three feature sets: the inclusion of the attributes
of the psychological processes of human hearing into the analysis used with PLP [Jamaati 08] makes
the speech perception more human-like, the compensation for linear channel distortions provided by
RASTA, the improved performance in terms of stability provided by MSG in the presence of acoustic
interferences, like high levels of background noise and reverberation [Koller 10]. The AUDIMUS decoder
is based on a Weighted Finite State Transducer (WFST) approach to large vocabulary speech recogni-
tion [Mohri 02, Caseiro 06]. AUDIMUS integrates a rule-based grapheme-to-phone conversion module
based on WFSTs for European Portuguese [Caseiro 02]. The acoustic model integrated in VITHEA was
trained with 57 hours of downsampled Broadcast News data and 58 hours of mixed fixed-telephone and
mobile-telephone data in European Portuguese [Abad 08].
9
Figure 2.1: Block diagram of AUDIMUS speech recognition system.
2.2.3 Automatic word verification
The task that performs the evaluation of the utterances spoken by the patients, in a similar way to the
role of the therapist in a rehabilitation’s session, is referred as word verification. This task consists of
deciding whether a claimed word W is uttered in a given speech segment S or not. In the simplest case,
a true/false answer is provided, but a verification score might be also generated. It should be noted that
the task has been called word verification, although it actually refers to term verification, since a keyword
may in fact consist of more than one word (e.g. rocking chair).
2.2.3.0.1 Word verification based on keyword spotting Several approaches exist based on speech
recognition technology to tackle the word verification problem. Given that word W is known, forced
alignment with an ASR system could be one of the most straightforward possibilities. However, speech
from aphasic patients contains a considerable amount of hesitations, doubts, repetitions, descriptions
and other speech disturbing factors that are known to degrade ASR performance, and consequently, this
will further affect the alignment process. These issues led to consider the forced alignment approach
inconvenient for the word verification task. Alternatively, keyword spotting methods can better deal with
unexpected speech effects. The object of keyword spotting is to detect a certain set of words of interest
in the continuous audio stream. In fact, word verification can be considered a particular case of keyword
spotting (with a single search term) and similar approaches can be used.
Keyword spotting approaches can be broadly classified into two categories [Szoke 05]: based on
LVCSR or based on acoustic matching of speech with keyword models in contrast to a background
model. Methods based on LVCSR search for the target keywords in the recognition results, usually
in lattices, confusion networks or n-best hypothesis results since they allow improved performances
compared to searching in the 1-best raw output result. The training process of an LVCSR system
requires large amounts of audio and text data, which may be a limitation in some cases. Additionally,
LVCSR systems make use of fixed large vocabularies (>100K words), but when a specific keyword is
not included in the dictionary, it is never detected. Acoustic approaches are very closely related to IWR.
They basically extend the IWR framework by incorporating an alternative competing model to the list of
keywords generally known as background, garbage or filler speech model. A robust background speech
model must be able to provide low recognition likelihoods for the keywords and high likelihoods for out-
10
of-vocabulary words in order to minimize false alarms and false rejections when CSR is performed. Like
in the IWR framework, keyword models can be word-based or phonetic-based (or sub-phonetic). The
latter allows simple modification of the target keywords since they are described by their sequence of
phonetic units.
In order to choose the best approach for this task, preliminary experiments were conducted on a tele-
phone speech corpus considering both LVCSR and acoustic matching approach [Abad 13]. According
to the results obtained, it was considered that acoustic based approaches were more adequate for the
type of problem addressed in the on-line therapy system.
2.2.3.0.2 Keyword spotting with AUDIMUS To accomplish the technique described in the previous
section and a successful integration into the VITHEA system, the baseline ASR system was modified
to incorporate a competing background speech model that is estimated without the need for acoustic
model re-training.
While keyword models are described by their sequence of phonetic units provided by an automatic
grapheme-to-phoneme module, the problem of background speech modelling must be specifically ad-
dressed. The most common approach consists of building a new phoneme classification network that in
addition to the conventional phoneme set, also models the posterior probability of a background speech
unit representing “general speech”. This is usually done by using all the training speech as positive
examples for background modelling and requires re-training the acoustic networks. Alternatively, the
posterior probability of the background unit can be estimated based on the posterior probabilities of
the other phones [Pinto 07]. The second approach has been followed, estimating the likelihoods of a
background speech unit as the mean of the top-6 most likely outputs of the phonetic network at each
time frame.In this way, there is no need for acoustic network re-training. The minimum duration for the
background speech word is fixed to 250 msec.
Up to our knowledge, there are only a few of therapeutic tools that support automatic evaluation through
speech recognition. Two of the most outstanding are PEAKS (Program for Evaluation and Analysis of all
Kinds of Speech disorders) and VITHEA (Virtual Therapist for Aphasia Treatment). PEAKS [Maier 09]
is an on-line recording and analysis environment for the automatic or manual evaluation of voice and
speech disorders. Once connected to the system, a patient may perform a standardized test which
is then analysed by automatic speech recognition and prosodic analysis. The result is presented to
the user, and can be compared to previous recordings of the same patient or to recordings from other
patients.
VITHEA [Abad 13] is an on-line platform designed to act as a “virtual therapist” for the treatment of
Portuguese speaking aphasic patients. The system allows word naming exercise, wherein the patient
is asked to recall the content presented in a photo or picture shown. By means of the use of automatic
speech recognition, the system processes what is said by the patient and decides if it is correct or wrong.
The program provides feedback both as a written solution and as a spoken message produced by an
11
Databasesystem
AUDIMUSsystem
WebApplication
serverClient
Figure 2.2: Comprehensive overview of the VITHEA system.
animated agent using text-to-speech synthesis.
The VITHEA system, target of this work, will be described deeply in the following Sections.
2.3.1 VITHEA: An on-line system for virtual treatment of aphasia
The on-line system described in [Pompili 11] is the first prototype for aphasia treatment resulting from the
collaboration of the Spoken Language Processing Lab of INESC-ID (L2F) and the Language Research
Laboratory of the Lisbon Faculty of Medicine (LEL), which has been developed in the context of the
activities of the Portuguese national project VITHEA1. It consists of a web-based platform that permits
speech-language therapists to easily create therapy exercises that can be later accessed by aphasia
patients using a web-browser. During the training sessions, the role of the therapist is taken by a “virtual
therapist”that presents the exercises and that is able to validate the patients’ answers. The overall flow
of the system can be described as follows: when a therapy session starts, the virtual therapist shows to
the patient, one at a time, a series of visual or auditory stimuli. The patient is then required to respond
verbally to these stimuli by naming the contents of the object or action that is represented. The utterance
produced is recorded, encoded and sent via network to the server side. Here, a web application server
receives the audio file and processes it by an ASR module, which generates a textual representation.
This result is then compared with a set of predetermined textual answers (for the given question) in order
to verify the correctness of the patient’s input. Finally, feedback is sent back to the patient. Figure 2.2
shows a comprehensive view of this process. In practice, the platform is intended not only to serve as an
alternative, but most importantly, as a complement to conventional speech-language therapy sessions,
permitting intensive and inexpensive therapy to patients, besides providing to the therapists a tool to
assess and track the evolution of their patients.
The various approaches for aphasia rehabilitation introduced in Section 2.1.2 aim at different pur-
poses. Most of them are focused on restoring language abilities, others are intended to compensate for
language problems and learn other methods of communicating. The approach followed by the VITHEA
1http://www.vithea.org
12
system falls in the first category, aiming at restoring linguistic processing by means of linguistic ex-
ercises. In particular, the focus of the system is on the recovery of word naming ability for aphasic
patients. Exercises are designed for Portuguese speaking aphasia patients.
2.3.1.1 The patient and the clinician applications
The system comprises two specific modules, dedicated respectively to the patients for carrying out the
therapy sessions and to the clinicians for the administration of the functionalities related to them. The
two modules adhere to different requirements that have been defined for the particular class of user for
which they have been developed. Nonetheless they share the set of training exercises, that are built by
the clinicians and performed by the patients.
2.3.1.1.1 Patient application module The patient module is meant to be used by aphasic individuals
to perform the therapeutic exercises. Figure 3.2 illustrates some screen-shots of the Patient Module.
Exercise protocol Following the common therapeutic approach for treatment of word finding difficul-
ties, a training exercise is composed of several semantic stimuli items. Stimuli may be of several
different types (text, audio, image and video) and they are classified according to themes, in order
to immerse the individual in a pragmatic, familiar environment. Like in ordinary speech-language
therapy sessions, once the patient is logged into the system, the virtual therapist guides him/her
in carrying out the training sessions, providing a list of possible exercises to be performed. When
the patient chooses to start a training exercise, the system presents target stimuli one at a time in
a random way and he/she is asked to respond to each stimulus verbally. After the evaluation of the
patient’s answer by the system, the patient can listen again to his/her previous answer, record an
utterance in case of invalid answer or skip to the next exercise.
Exercise interface The exercise interface has been designed to cope with the functionalities needed
for automatic word recalling therapy exercises, which includes among others the integration of
an animated virtual character (the virtual therapist), Text-To-Speech (TTS) synthesized voice, im-
age and video displaying, speech recording and play-back functionalities, automatic word naming
recognition and exercise validation and feed-back prompting, besides conventional exercise navi-
gation options. Additionally, the exercise interface has also been designed to maximize simplicity
and accessibility. First, because most of the users for whom this application is intended suffered a
CVA and they may also have some sort of physical disability. Second, because aphasia is a pre-
dominant disorder among elderly people, which are more prone to suffer from visual impairments.
Thus, the graphic elements chosen, were carefully considered, using big icons for representing
the interface.
2.3.1.1.2 Virtual character animation and speech synthesis The virtual therapist’s representation
to the user is achieved through a tri-dimensional (3D) game environment with speech synthesis capa-
bilities. Within the context of the VITHEA application, the game environment is essentially dedicated to
graphical computations, which are performed locally in the user’s computer. Speech synthesis genera-
tion occurs in a remote server, thus ensuring proper hardware performance. The game environment is
13
Figure 2.3: Screen-shots of the VITHEA patient application.
based on the Unity2 game engine, it contains a low poly 3D model of a cartoon character with visemes
and facial emotions, which receives and forwards text (dynamically generated according to the sys-
tem’s flow) to the TTS server. Upon server reply, the character’s lips are synchronized with synthesized
speech.
2.3.1.1.3 Speech synthesis DIXI [Paulo 08] is the TTS engine developed by the Spoken Language
Processing Lab of INESC-ID (L2F) group and integrated into the game environment. It has been con-
figured for unit selection synthesis with an open domain cluster voice for European Portuguese. DIXI is
used to gather SAMPA phonemes [Trancoso 03], their timings and raw audio signal information, which
is lossy encoded for usage in the client game. The phoneme timings are essential for a visual output
of the synthesized speech, since the difference between consecutive phoneme timings determines the
amount of time a viseme should be animated.
2.3.1.1.4 Clinician application module The clinician module is specifically designed to allow clini-
cians to manage patient data, to regulate the creation of new stimuli and the alteration of the existing
ones, and to monitor user performance in terms of frequency of access to the system and user progress.
The module is composed of three sub-modules:
User management This module allows the management of a knowledge base of patients that can be
edited by the therapist at any time. Besides basic information related to the user personal profile,
2http://unity3d.com/
14
the database also stores for each individual his/her type of aphasia, his/her aphasia severity (7-
level subjective scale) and aphasia quotient (AQ) information from the Western Aphasia Battery.
Exercise editor This module allows the clinician to create, update, preview and delete stimuli from an
exercise in an intuitive fashion similar in style to a WYSIWYG editor. In addition to the canonical
valid answer, the system accepts for each stimulus an extended word list comprising the most
frequent synonyms and diminutives.
Since the stimuli are associated with a wide assortment of multimedia files, besides their manage-
ment, the module also provides a rich Web based interface to manage the database of multimedia
resources used within the stimuli. The system is capable of handling a wide range of multimedia
encoding: audio (accepted file types: wav, mp3), video (accepted file types: wmv, avi, mov, mp4,
mpe, mpeg, mpg, swf), and images (accepted file types: jpe, jpeg, jpg, png, gif, bmp, tif, tiff). Given
the diversity of the various file types accepted by the system, a conversion to a unique file type was
needed, in order to show them all with only one external tool. Audio files are therefore converted
to mp3, file format, while video files are converted to flv file format. Figures 2.4 and 2.5 illustrates
some screen-shots of the clinician module.
Figure 2.4: Interface for the creation of new stimulus.
Patient tracking This module allows the clinician to monitor statistical information related to user-
system interactions and to access the utterances produced by the patient during the therapeutic
15
Figure 2.5: Interface for the management of multimedia resources.
sessions. The statistical information comprises data related to the user’s progress and to the fre-
quency with which users access the system. On the one hand, all the attempts recorded by the
patients are stored in order to allow a re-evaluation by clinicians. This data can be used to identify
possible weaknesses or errors from the recognition engine. On the other hand, monitoring the
usage of the application by the patients will permit the speech-language therapist to assess the
effectiveness of the platform and its impact on the patients’ recovery progress.
2.3.1.2 Platform architecture overview
An ad-hoc multi-tier framework that adheres to the VITHEA requirements has been developed by inte-
grating different heterogeneous technologies. The back-end of the system relies on some of the most
advanced open source frameworks for the development of web applications: Apache Tiles, Apache
Struts 2, Hibernate and Spring. These frameworks follow the best practice and principles of software
engineering, thus guaranteeing the reliability of the system on critical tasks such as databases access,
security, session management etc. The back-end side also integrates the L2F speech recognition system
(AUDIMUS, [Meinedo 03, Meinedo 10]) and TTS synthesizer (DIXI, [Paulo 08]). The ASR component
is the backbone of the system and it is responsible for the validation or rejection of the answers pro-
vided by the user. TTS and facial animation technologies allow the virtual therapist to “speak”the text
associated with a stimulus and supply positive reinforcement to the user. The client side also exploits
Adobe R©Flash R© technology to support rich multimedia interaction, which includes audio and video stim-
uli reproduction and recording and play-back of patients’ answers. Finally, the system implements a data
architecture that allows handling groups of speech-language therapists and groups of patients. Thus,
a user may belong to a specific group of patients and this group can be assigned to a therapist or to
a group of therapists. Therapists who belong to the same group share the clinical information of the
16
patients, the set of therapeutics exercises, and also the set of resources used within the various stimuli.
In this way patients with the same type and/or degree of severity of aphasia can be clustered together
and take advantage of exercises and stimuli that are tailored to their specific disorder, thus improving
the benefits resulting from a therapeutic training session.
This Section is devoted at providing the relevant state of the art for each of the new features, target of
this work.
2.4.1 Content adaptation for mobile devices
To make available the VITHEA services also from mobile equipments, new client applications that adhere
with the specific device standard’s have to be designed and built. This means that two separate software
applications have to be built for Android and iOS based equipments. The target of this work, only address
the Android platform. On the other hand, the server side services already provided by the system,
should preserve their original business logic, thus affecting only the exposition of the services. These
constraints lead toward the direction of a Service-Oriented Architecture (SOA). SOA is a set of principles
and methodologies for designing and developing software in the form of interoperable services. Here,
services are well-defined business functionalities that are built as software components that can be
reused for different purposes. Web services are the typical usage scenario for implementing a SOA
architecture, they allow the functional building-blocks being accessible over standard Internet protocols
in a independent way of platforms and programming languages. In this scenario, the most widely used
technologies that can implement a SOA architecture rely on Simple Object Access Protocol (SOAP),
Remote Procedure Call (RPC), or on Representational State Transfer (REST) approaches.
SOAP is a message transport protocol for exchanging structured information in the implementation
of web services in computer networks, it has been accepted as the default message protocol in SOA.
SOAP messages are created by wrapping application specific XML messages within a standard XML-
based envelope structure. The result is an extensible message structure which can be transported
through most underlying network transports like SMTP and HTTP.
RPC is an inter-process communication allowing to call a procedure in another address space and
exchange data by message passing. Methods stubs on the client process make the call making it appear
as local, while the stubs take care of marshalling the request and sending it to the server process. The
server process then unmarshalls the request and invokes the desired method before replying to the
client with the reverse procedure.
REST is an architectural style for distributed hypermedia systems. It describes an architecture where
each resource, such as a web service, is represented with an unique Uniform Resource Identifiers (URI).
The principle of REST is to use the HTTP protocol as it is modelled, thus accessing and modifying the
resources through the standardized HTTP functions GET, POST, PUT, and DELETE.
One of the main criticisms of SOAP relates to the way the SOAP messages are wrapped within
17
an envelope. Because of the verbose XML format, SOAP can be considerably slower than competing
middle-ware technologies. A disadvantage of RPC is that the set of legal actions that are eligible on the
server has to be explicitly defined at build time, since these actions are wrapped by the method stubs
that are consumed by the client. In a REST scenario, on the other side, the client and the server are
much less tied, the obligation within the two parts is minimal, in the case of HTTP’s implementation of
REST this corresponds to a single URI that can be accessed through a GET request.
Thus, within a larger context, SOAP is the de-facto standard for web service message ex-
change, however within a mobile context the REST architecture is considered as more light-
weight [Richardson 07] than a SOAP based Web service architecture, since it avoids those heavy oper-
ations which in the SOAP approach are needed in order to maintain a standard format [Knutsen 09].
2.4.2 Hands-free speech
One of the main challenges of the implementation of the hands-free interface is the determination of a
robust VAD algorithm. VAD aims at determining the presence or absence of speech. This technique is
useful both for speech coding and speech recognition, thus it has been object of many studies leading
to several different approaches. In [Sangwan 02] the authors designed a customized algorithm for real-
time speech transmission based on the energy of the input signal. This work relies on the estimation
of an adaptive threshold meaningful of the background noise. Two refined strategies are defined to
recover from misclassification errors that may result from the energy detector. The first of this strategies
is based on a feature of the signal, the number of Zero Crossing Rate, while the second relies on the
autocorrelation function.
In [Chuangsuwanich 11] the authors investigate on a VAD approach for real world application using a
two-stage approach based on two distinguishing features of speech, namely harmonicity and modulation
frequency.
In [Ramirez 04] the authors employs long-term signal processing and maximum spectral component
tracking to improve VAD algorithm. With the introduction of a noise reduction stage before the long-
term spectral tracking, the authors are able to recover from misclassification errors also in highly noisy
environments. Experimental results appear to confirm the improvement with respect to VAD methods
based on speech/pause discrimination.
2.4.3 Exploiting IR for improved search functionality
The intrinsic ambiguity of natural language is a well known problem in human understanding that affects
also, with issues far greater, the computational processing of the data related with human computer
interactions. Different issues influence different areas, among these, the possibility to express the same
concept using different synonymy has a strong impact on the recall of most information retrieval systems.
The methods for tackling this problem split into two major classes: global and local methods. The first
includes techniques for expanding or reformulating the original query terms, so to cause the new query to
match other semantically similar terms. These techniques, known as “query expansion”, may be based
18
on controlled vocabulary, manual or automatically derived thesaurus, and on log mining. Local methods,
on the other side, try to adjust a query relative to the results that initially appear to match the query, the
most used techniques in this context are known as “relevance feedback”.
In this work the potential of query expansion will be used to provide an enhanced search experience,
which should allow a better management of the system data. To this purpose, some of the many ap-
proaches that are referred in the literature to address this task, are briefly described, classifying them on
the basis of the method used. Lexical resources like WordNet or UMLS metathesaurus, are commonly
exploited for query expansion. In [Voorhees 94], lexical-semantic relations are used to improve search
performance in large data collections. In [Aronson 97], the authors explore the MetaMap program for as-
sociating meta-thesaurus concepts with the original query in order to retrieve MEDLINE citations. Many
other approaches use corpus or lexical resources to automatically develop a thesaurus. However, most
of these methods are used in domain specific search engines or applications. In [Gong 05], the authors
used WordNet and TSN (Term Semantic Network) developed using word co-occurrence in corpus. Here,
the author used TSN as a filter and supplement for WordNet. To conclude, with the increase in usage
of Web search engines, it is easy to collect and use user query logs. [Cui 02] developed a system that
extracts probabilistic correlations between query terms and documents terms using query logs.
2.4.4 New automatic evocation exercises for therapy treatment
There exist several naming exercises for the recovery of lost communication abilities. Among these we
mention, category naming, confrontation naming, automatic closure naming, automatic serial naming,
recognition naming, repetition naming, and responsive naming. Some of them, are already provided by
the VITHEA system, namely visual confrontation, automatic closure naming, and responsive naming.
Category naming is a task for assessing the ability to classify semantically related words and concepts
in various word frequency categories which are perceptual, conceptual or semantic, and functional cat-
egories. The perceptual categories is classified on the basis of the relevant sensory quality of a stimulus
such as shape, size or colour. The conceptual or semantic category is classified on the basis of a gener-
alized idea of a class of objects. The functional category is classified on the basis of an action of function
associated with a class of objects [Campbell 05, Murray 01].
Automatic serial naming is a task for assessing the ability to produce rote or over learned material. A
patient may be asked to do tasks such as counting from 1 to 20, naming the days of the week, writing
out the letters of the alphabet, and or reciting well-known prayers or nursery rhymes [Campbell 05,
Murray 01].
Recognition naming is a task for assessing the ability of recognizing words. It is used when patients
are unable to name an item. The patient may be required to indicate the correct word from verbal or
written choices. For example for the target stimulus “elephant” the patient has to indicate the correct
word from three verbal or written choices such as “girraff”, “elephant”, “telephone”.
Repetition naming is a task for assessing ability in repetition or copying words of patients who cannot
verbally name or write.
Currently, with the exception of the VITHEA system, does not appear to exist in the literature an
19
automatic implementation through speech recognition of the above mentioned exercises.
2.4.5 Exploiting syllable information in word naming recognition of aphasic
speech
Syllables play an important role in speech recognition, in fact the pronunciation of a given phoneme tends
to vary depending on its location within a syllable. There is a lot of work in the literature on the genera-
tion of new syllables prototypes deriving from several different acoustic-phonetic rules. These are often
exploited to explore complementary acoustic model for speech processing. In [Hunt 04] the authors
showed how, a statistical approach to phonetics could complement and improve current speech recog-
nition by taking syllable boundaries into account. In [Oliveira 05] three methods for dividing European
Portuguese word into syllables are presented. Experimental results have shown a percentage of cor-
rectly recognized syllable boundaries above 99.5%, and comparable word accuracy. Also, in [Code 94]
syllabification is examined in respect of non lexical English and German aphasic speech automatisms
(recurring utterances).
20
3The client side of the VITHEA platform exploits Adobe R©Flash R© technology to record patients’ an-
swers. This module, unfortunately, limits the extension of the system to mobile devices, such as tablets
and smart-phones. This is due to a limitation in the API provided by Adobe R©. In fact, the microphone
class, used to acquire speech input, is not supported by the Flash R© player running in a mobile browser.
Therefore, an ad-hoc application specifically suited for these devices has been designed and imple-
mented. Even though this application theoretically clones the same implementation logic of the web ver-
sion, the underlying technology is different and raised several integration issues due to the heterogeneity
of the standard used. New reusable components have been developed in order to provide server-side
services in a standardized way, accessible by heterogeneous client device running both iOS or Android
operating systems.
Although services have been designed in a service oriented architecture (SOA) fashion that allows
easy deployment of client modules for different systems, in this work we have restricted ourselves to
the development of a client application running only on Android systems. Android has been chosen as a
case study because it is available as open source software, enabling developers to distribute applications
to any Android device trough the Android market.
In this Chapter, Section 3.1 introduces the main standards upon which the mobile version is based,
while the architecture of the final solution is described in Section 3.2. The results of a user experi-
ence evaluation conducted with 16 users are described in Section 3.3, followed by the discussion in
Section 3.4.
In the literature review (Section 2.4.1) it has already been discussed the range of technologies that
are available within the implementation of a SOA. The disadvantages of these standards have been
analysed: the rigidity of RPC and the additional complexity of SOAP propose REST as the favourite
candidate for the implementation of the new server-side services. In the following, the principles guiding
the REST architectural style are initially described, then the data representation format for the exchange
of information between client and server is explained. Finally the Android Platform, used to develop the
client application, is briefly introduced.
3.1.1 Representational State Transfer
A software architecture is an abstraction of the runtime elements of a software system during some
phase of its operation [Fielding 00]. Therefore, an architecture determines how system elements are
identified and allocated, how the elements interact to form a system, the amount and granularity of com-
munication needed for interaction, and the interface protocols used for communication. In this context,
an architectural style is a coordinated set of architectural constraints that restricts the roles and features
of architectural elements, and the allowed relationships among those elements, within any architecture
that conforms to the style.
REST is an architectural style for distributed hypermedia systems. It ignores the details of compo-
nent implementation and protocol syntax in order to focus on the roles of components itself. REST was
introduced and defined by Roy Fielding, author of HTTP 1.1, as a hybrid style derived from several
network-based architectural styles, among which the client-server paradigm, and combined with addi-
tional constraints that define a uniform connector interface [Fielding 02].
Thus, the software engineering principles guiding REST may be defined in terms of the set of con-
straints defining the style and guiding the interactions among architectural elements.
The first remarkable constraints of the style are those of the client-server architecture and comprise
separation of concerns and stateless communication. By separating the user interface concerns from
the data storage concerns, the portability of the user interface across multiple platforms is improved,
together with scalability by simplifying the server components. A stateless communication requires
that each request from client to server must contain all of the information necessary to understand the
request, and cannot take advantage of any stored context on the server. Session state is therefore kept
entirely on the client. Like most architectural choices, the stateless constraint reflects a design trade-off.
The disadvantage is that it may decrease network performance by increasing the repetitive data (per-
interaction overhead) sent in a series of requests, since that data cannot be left on the server in a shared
context.
However, the central feature that distinguishes the REST architectural style from other network-based
styles is its emphasis on a uniform interface between components. This is achieved through four addi-
tional interface constraints: identification of resources, manipulation of resources through representa-
tions, self-descriptive messages, and, hypermedia as the engine of application state. A brief description
of these principles is provided in the following.
A resource can be any information that can be named, a document or image, a temporal service (e.g.,
“today’s weather in Los Angeles”), a collection of other resources, and so on. To identify the particular
resource involved in an interaction between components REST uses a resource identifier. Thus, every
resource and interconnection of resources is uniquely identified and addressable with a URI.
REST components communicate by transferring a representation of the data in a format matching
one of an evolving set of standard data types, selected dynamically based on the capabilities or desires
of the recipient and the nature of the data. That is, RESTful implementations may support more than one
representation of the same resource at the same URI, and allow clients to indicate which representation
of a resource they wish to receive.
Each client request and server response is a message and RESTful implementations expect each
message to be self-descriptive. That is, each message contains all the information necessary to com-
plete the task. REST-ful implementation also operates on the notion of a constrained set of message
22
types that are fully understood by both client and server. These belong to the set of HTTP methods
defined in HTTP 1.1, some of them are: GET, HEAD, OPTIONS, PUT, POST, and DELETE.
Finally, hypermedia as the engine of application state means that changes to the current state of the
application are performed through hypermedia links. That is, clients move from state to state via URIs.
3.1.2 Data representation
As observed in the previous section, one of the principles guiding REST is multiple representation of
the transferred data. That is, the same information could be accessible by different views, dynamically
selected by the client at runtime. This is achieved through the specification of an HTTP header field,
which defines how client and server should communicate and exchange resources. The list of standard
media type understood by clients and servers adheres to the one defined by the IANA registry1, vendor-
specific media types may also exist too. Among the standard media types we cite: text/plain, text/html,
application/xml, and application/json.
{
"employees": [
{
"name": "John Crichton",
"gender": "male"
},
{
"name": "Aeryn Sun",
"gender": "female"
}
]
}
Listing 1: JSON sample code.
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding
documents in a format that is both human-readable and machine-readable. JavaScript Object Nota-
tion (JSON) is a text-based open standard designed for human-readable data interchange. Both XML
and JSON are designed to form a simple and standardized way of describing different hierarchical data
structures and to facilitate their transportation and consumption. XML, however, as the name states, is a
markup language, thus providing the hierarchical elements being described with the possibility of having
additional attributes. This powerful descriptive ability makes it suitable for the representation of docu-
ments and data structures, unfortunately has the undesirable disadvantage of causing an overhead of
data being sent. JSON, on the other hand, has a very concise syntax for defining collection of elements,
which means lesser data, and which makes it preferable for serializing and transmitting structured data
types over a network connection. Listing 2, and listing 1 show the description of two employee with XML
and JSON language respectively.
In this work JSON format has been preferred for data representation.
1http://www.iana.org/assignments/media-types
23
<employees>
<employee>
<name>John Crichton</name>
<gender>male</gender>
</employee>
<employee>
<name>Aeryn Sun</name>
<gender>female</gender>
</employee>
</employees>
Listing 2: XML sample code.
3.1.3 Android Platform
Android is a software stack for mobile devices that includes an operating system, middleware and key
applications. It relies on a Linux 3.x kernel for core system functionality and runs code written in the Java
programming language on a specially designed virtual machine named Dalvik. The client application has
been built in compliance with the Android platform, upon the Android software development kit (SDK).
The Android SDK consist of several tools to help Android application development. These include both
an Eclipse IDE plugin, emulator, debugging tools, visual layout builder, log monitor and more.
The final prototype has been built exploiting the standards described in the previous sections. RESTful
web services have been provided to expose the relevant functionalities to mobile client, adhering to the
constraints that this architectural style requires.
Particularly, the stateless constraint requires that no persistent information is stored on the server
side, meaning that no client context data is maintained between following requests. Each request from
any client should contain all of the information necessary to service the request, and session state is
held in the client. This implies important consequences for the application logic, among which having to
change the authentication modality. In fact, the VITHEA system currently is only accessible after the user
has authenticated properly into the application. Afterwards the system strongly relies on the concept of
session to maintain updated user data. The stateless constraint demands for breaking this mechanism.
3.2.1 REST authentication
There are several solutions for handling authentication in a REST context. Some possible options are
HTTP basic authentication over HTTPS and a dedicated login service.
The first option, relies on the standard HTTPS protocol and on HTTP basic authentication implemen-
tation. Basic authentication is used by most web services since is the simplest technique for enforcing
access controls to web resources, since it does not require cookies or session identifier. Rather, basic
authentication uses static, standard HTTP headers to send user login data. In its typical usage scenario
the user is prompted for the credential just once, then the client software computes the Base64 encoding
24
of the credentials and includes them in each future HTTP request to the server using the Authorization
HTTP header field.
This simple technique presents some known drawbacks. First, it does not provide any confidential-
ity protection for the transmitted credentials. They are merely encoded with Base64 in transit, but not
encrypted or hashed in any way. For these reasons, basic authentication is typically used over HTTPS.
Also, because basic authentication header has to be sent with each HTTP request, the web browser
needs to cache the credentials for a reasonable period to avoid constant prompting user for the user
name and password. Also, this mechanism causes that there is no possibility to automatically expire au-
thenticated credentials after a period of inactivity. Finally, the user name and password are transmitted
over HTTPS to the server, while it should be more secure to let the password stay on the client side,
during keyboard entry, and be stored as secure hash on the server.
The last alternative is to use a dedicated login service that accepts user credentials and returns
a token. This token should then be included, as a URL argument, to each following request. A well
known open standard for authorization is OAuth2. OAuth is an open protocol that allows users to give
permission to a third-party application or web site to access restricted resources on another web site
or service. The third-party application receives an access token with which it can make requests to the
protected service. By using this access token strategy, the users login credentials are never stored within
an application, and are only required when authenticating to the service. Another important advantage of
this approach is that tokens can be created with an expiration date, which is important for some services.
However, even though OAuth is an appealing solution, in the current prototype the authentication
has been implemented following a hybrid approach close to basic authentication. In fact, after analysing
benefits and drawbacks of both solutions, simpler basic authentication over HTTPS was considered
sufficient for the purpose of the system, at least in this first version of the application.
3.2.2 Implemented architecture
To support and adhere with the standards and requirements described in the previous Sections, two
major, widely used frameworks, Spring Security and Spring Web MVC, have been used to implement
the server side services. Spring Security is a non-intrusive framework providing a set of authentication
and access-control services, including HTTP requests authorization, HTTP basic authentication, and
HTTP digest authentication. Spring MVC is a framework that allows building flexible and loosely coupled
web applications and REST services. The Model-view-controller design pattern, core of the framework,
helps in separating the business logic, presentation logic and navigation logic.
For the development of the client application, the Spring for Android framework has been used. This
is an extension of the Spring Framework that aims at simplifying the development of native Android
applications. Spring for Android includes a REST client providing higher level functions that correspond
to six main HTTP methods and several conversion functionalities for the various data representations
supported. The framework also integrates OAuth, which leaves open the option for a future, easy exten-
sion. Figure 3.1 illustrates the overall architecture of the system.
2http://oauth.net/
25
Spring MVC, Spring SecurityRestlet authentication, JSON
Database system
Standard application
context
REST services
Spring for AndroidRest client, JSON
AUDIMUSSpeech
recognizer
Figure 3.1: Architecture.
3.2.2.0.1 Authentication In the current implementation, the user digits its credential at the time of
accessing the system, and these data are then stored in the client application for the whole execution
time. In each following requests, these informations are encrypted with the MD5 message-digest algo-
rithm, added to the Authorization header field, and sent to the server together with the request over
HTTPS. At the server side, data received is verified through the support provided by Spring for basic
authentication. The access restriction to a given resource is performed directly at the configuration file
level. When the request from the client is received the authorization header is checked for user creden-
tials. When found, data present here is compared with the encrypted version of the same data that exists
in the server persistent storage support. If the credentials are correct, the user is granted the access to
the given request, otherwise access is denied.
3.2.2.0.2 Data representation In the most typical RESTful scenario, application data is exchanged
trough HTTP. When data belongs to a simple type, such as a string or integer data type, this is handled
transparently by the protocol. However, quite often applications need to exchange complex data types
that represents information about the state of the system, i.e: a list of books for an online library. In those
cases, complex data objects, such as the java object representing a book, need to be serialized into a
textual format that could exploit JSON or XML representations.
In this work, JSON has been chosen as the data format used for the exchange of information between
client and server. The serialization process takes place upon sending and receiving of data and exploits
the support of Java Architecture for XML Binding (JAXB) for reading and writing JSON data.
26
3.2.3 Client application
The current VITHEA system is the result of the effort of the past few years of research and development.
During this period, the platform went through several phases of improvements and consolidation, till
reaching the current stage in which integrates various heterogeneous technologies. The business logic
of the application has been tailored to the user needs through iterative phases of requirements gathering,
design, and implementation.
The prototype developed in this context, on the other side, has to be considered a proof of concept
of the feasibility of the mobile version and, thus, integrates only the features that are important for a
correct, complete interaction. In addition to authentication, these include, of course, the integration of
the recognition process, of the virtual therapist character, and the application logic that regulates the
exercises flow, thus including listing and navigation, video and audio reproduction.
The virtual therapist’s representation is achieved through the 3D game engine Unity. This environ-
ment provides also the possibility to directly build plugins for Android. In this way, the native module
of the therapist is exported and then integrated into an Android application easily. Thus, similarly to
the standard version, in the mobile version the virtual character guides and interacts with the user by
providing audio-visual feedbacks.
The logic flow of the exercises include the reproduction of video and audio files. The Android plat-
form imposes some constraints on the media formats supported that currently do not totally adhere to
the specification of the system. In fact, none of the video formats accepted by the Android platform is
produced by the clinician platform, which thus should extend its functionality to include the generation of
a new media format.
The speech recognition process is performed remotely by the in-house speech recognizer AUDIMUS.
The audio is acquired through Android microphone, to this purpose Android API provides the Au-
dioRecord class. This class is delegated at the management of the audio resources needed for Java
applications to record audio from the input hardware of the platform. During acquisition, read data is
stored into an internal buffer of the AudioRecord class. When the recording stops, the audio is sent to
the server side through a REST-ful POST request. Here, the in-house speech engine processes the file
and the result of the recognition is returned to the user. It is expected that the microphone quality of
tablet devices could be of poor quality, thus degrading the quality of the recognition. Figure 3.2 shows
some screen-shots of the client application for mobile device during the execution of a training session.
A user experience evaluation based on collected questionnaires has been conducted with 16 users, with
different ages, varying from 23 to 60 years. Users selected for the evaluation have different background
knowledge, ranging from computer science, linguistic and accounting.
Evaluation was held at the workplace where the system has been developed, with the same condi-
tions for all the users. The evaluation process was carried out into two phases. First, the subject was
introduced to the conventional online web browser version of the application, explaining its functionalities
27
Figure 3.2: Screen-shots of the VITHEA mobile patient application.
and requiring the user to directly explore and try it. Then, after getting familiarized with the system, the
user was requested to test the new version of the application for mobile devices.
During the test evaluations, the user was accompanied and observed while executing the test. This
allowed an interactive participation that permitted collecting important feedback, besides the actual ques-
tionnaires, gathered both by observing the user behaviour while using the application and from direct
suggestions of the user itself.
The questionnaire contained 10 questions and it was divided into three sections. The first section,
contains questions related with the overall satisfaction and usability of the application. The second sec-
tion contains questions related with the robustness of the system, while the third contains questions
dedicated to a comparison between the two versions of the client application: the on-line computer-
oriented one and the mobile devices one. Responses are requested by a numerical likert-scale, where
1 is associated to low satisfaction/agreement and 5 equals maximum satisfaction/agreement for each
question or statement. Figure 3.3 illustrates the questionnaire and the results of the evaluation.
Overall, the evaluation provided good results, achieving an average score of 4.14 in a likert scale of 5
points (1 to 5). The items related with the usability of the system have an average score of 4.3, the ones
related with the robustness achieved 4.7, while the items that compares the appreciation of the mobile
version with respect to the online browser version achieved 3.7. The lower score obtained in the last
group is explained with a closer inspection of the evaluation data. Detailed charts in Figure 3.4 show, for
28
Figure 3.3: Results of the evaluation.
each of the four questions, the distributions of the user grades.
According to these charts, in the two questions that are related with the response time of the appli-
cation and the automatic speech recognition, most of the users did not found any relevant differences
among the two systems. This is actually an even better than expected result, since we did not take any
particular action to adapt the speed and the speech recognition engines for this type of the devices.
In fact, the average score for each of the questions is, respectively, 3.3 and 2.9. However, a detailed
inspection on the automatic speech recognition chart shows in fact that 19% of the evaluated users
agreed on the fact that the recognition process was worse. Since the evaluation process was assisted,
we can actually confirm that some users experienced greater problems than others. In these cases, two
main reasons have been identified as the possible source of the recognition errors. On the one side,
we observed that the loudness of the recorded voice for some users was very low, which was partially
due in some cases to the inadequate use of the device’s microphone, for instance, not facing the micro-
phone directly or blocking it accidentally with a finger. On the other side, we discovered an unintentional
misuse of the recording interface. In fact, the mobile version of the recording interface is also based on
a Push-To-Talk (PTT) strategy, but in contrast to the browser version where users have to push to start
recording and push again to stop it, users have to maintain a button pushed while recording. Instead,
it appeared that some users inadvertently had the tendency to stop the recording while they were still
uttering the last syllables of their answer.
The remaining two questions of the this late group, that is, if the mobile version provides a more
comfortable user experience and if the mobile version is preferred over the online browser one, obtained
respectively an average score of 4,2 and 4,4. Particularly, the 87% of the evaluated users strongly
preferred the mobile device version.
29
Figure 3.4: Distribution of the user grades for the questions of the third group.
The user experience evaluation provided interesting feedback and results. Users are more comfortable
with the mobile version and agree with the fact that it would be easy to learn how to use this new
application. The touch-screen capabilities of these devices actually provide a different perception of
the application itself. Thus, the encouraging results obtained here are an incentive for providing a more
complete version of the application that may exploit the different input modalities that these devices offer,
to provide a more interactive and complete experience.
With few exceptions, the speech recognition process shown performance comparable to the online
system. This is also a motivating achievement, in fact one of the possible limitations of this version could
have been the poor quality of the microphone installed on this device. A poor microphone may have
caused poor speech signal and thus, weak recognition performance. Results, overall, did not seem to
confirm this expectation, however, direct user experience also highlighted other possible limitations that
should be addressed to strengthen the recognition results.
30
4An important concern that has been kept into consideration along the whole development of the project
and guided the design of new interfaces and functionalities, is related with the usability of the client
module of the system. Over the years, particular care has been continuously given to the choice and
disposition of the graphical user interface (GUI), in order to comply with an easy to use and understand
layout, such that user interaction could be predictable and unmistakable. Driven by the principle of
accessibility, the characteristics and the needs of the intended users of the application have been indi-
viduated. Two major requirements have emerged from this analysis, the first is related with the fact that,
although aphasia is increasing in the youngest age groups, it is a predominant disorder among elderly
people. This age group is prone to suffer from visual impairments, thus, the graphical elements used
within the client interface were carefully selected, considering only big icons and intuitive images. The
second requirement is somehow related with the more common cause of aphasia. In fact, as mentioned
in Section 2.1, a CVA is considered one of the main sources for aphasia, and thus, it is expected that
those patients may have some forms of physical disabilities such as reduced arm mobility, and therefore
may experience problems using a mouse.
In situations where arm mobility is affected, the support for a hands-free interface will possibly im-
prove the overall usability of the user experience. However, the typical extension of such interfaces for
human-computer interaction consists of voice commands as alternative input modality. In the particular
case of the VITHEA project, since the user of the system is affected from a language disorder, hands-
free computing could not be interpreted as an alternative way of interaction, instead it will be selectively
applied to automate the process of recording the users answers, and thus, provide additional benefits to
people experiencing disabilities.
In fact, the client interface, that allows the recording of user answers, is currently based on a Push-To-
Talk (PTT) strategy, and requires at least two distinct interactions from the patient: the action of starting
the recording and the action of stopping it. In this specified context, the technique that will be exploited
will rely on voice activity detection (VAD) to automatically determine the end of the speech. There are
numerous approaches to address this task that vary also on the basis of the underneath technology
used.
In the following, Section 4.1 starts by describing the VAD task, considering the implications it may
arise and their possible solutions from the perspective of the VITHEA system. Then, it continues detailing
the approach that has been proposed in this work and its resulting architecture. A speech corpus derived
from patient’s daily recording has been created to perform an automated evaluation; the corpus and the
results of the assessment are described in Section 4.2, while Section 4.3 discusses the main conclusions
resulting from this work.
At a very general level, VAD is the binary classification process that tags each speech segment as
containing voice or silence. In practice, to successfully achieve the classification task, VAD approaches
usually implement sophisticated techniques of signal processing to improve the quality and the robust-
ness of the algorithm itself. In the literature review, the variety of different solutions that may be used
to address this task have already been highlighted. The choice of a solution is typically guided by
architectural, implementation constraints or available technology.
For what concerns the VITHEA system, as mentioned above, in the current version recording an
utterance requires two interactions: starting and stopping the recording process. Determining when the
recording process should start could be efficiently detected automatically, by considering as reference
the end of the prompt stimulus spoken by the virtual therapist or the end of subsequent reproduction of
the audio/video file in the case of a multimedia stimulus.
The detection of the end of the speech is a more challenging issue. There are two viable options,
one which will exploit the Adobe R©Flash R© technology in order to observe the energy of the speech
signal acquired through the microphone, and on this basis, detect the end of the speech. This will result
in a more affordable challenge from the technological point of view, but as counterpart, the precision
achieved may not be highly reliable.
On the other hand, the VAD task could be performed by the speech recognition module [Meinedo 03,
Meinedo 10]. This will allow a more refined analysis and better performance, however, it will also require
sending a continuous stream of data to the server. Thus, the main disadvantage of this approach is that
the recorded audio should be transmitted and processed by the speech recognizer before determining
the end of speech, which in the case of network congestion may lead to a non-deterministic behaviour.
Moreover, the need for a continuous stream of audio and real-time processing in the server-side will
represent increased technological challenges.
After carefully evaluating benefits and drawbacks of both the approaches, we decided to initially ex-
periment a simple and light approach deployed in the client side, partially motivated by the particular kind
of speech input that the algorithm is expected to face. We expect that the recordings object of the VAD
analysis will adhere to a well defined format. In fact, the VITHEA system only supports naming exer-
cises that, in the most general case, comprise as possible answer a single word or short sentence. For
these reasons, a simple approach should theoretically perform adequately. On the other hand, the
technological constraints imposed by the server-side solution and the uncertainty about its behaviour in
the application scenario discourage its use. Anyway, as we will see in Section 4.1.2, the implemented
architecture will easily allow a future extension that exploits server-side VAD.
32
4.1.1 Algorithm
In the most basic VAD approach, the signal is first sliced into contiguous frames, then a real value
parameter is associated with each frame. If this parameter exceeds a certain threshold, the frame is
classified as containing speech, otherwise it is classified as containing silence. Here we will follow the
same methodology, however the algorithm has been adapted in order to keep into account specific
application logic constraints.
The measure used to establish if a frame is possibly containing speech is the energy of that frame. If
the length of a frame is of k samples, and being x(i) the ith sample, then energy for a given frame j of
the input signal is computed as:
Ej =1
k
k∑i=0
x2(i) (4.1)
The energy of a frame is a valid parameter for VAD algorithms, however it is not able of distinguish
between loud noise or speech.
An adaptive approach has been used to estimate the value for the threshold. Initially, the VAD algo-
rithm is trained on a sample of speech that is assumed to not contain any sound, the duration of the
sample is 3 seconds. In the application scenario, this is achieved with the introduction of a new module
that is demanded to capture the required amount of sound, compute the threshold and store the value
into the user profile. It is a user requirement the respect of silence for the time required. The module is
activated when there is no previous value for the threshold in the user profile, or on demand, whether for
instance the recording’s condition were changed.
The threshold is computed over all the input sound, by taking the mean of the energies of each
frame, as in:
Eth =1
v
v∑m=0
Em (4.2)
where Eth is the initial threshold estimate and v is the number of frames. However, since background
disturbance is non-stationary, an adaptive threshold is more appropriate. Thus, on the assumption that
the user will not start speaking exactly at the time the recording starts, the first second of each recording
is used to compute another threshold that updates the previous value. Given Et1 the value computed for
the first second of recording, the rule to update the threshold value is:
Es = αEt1 + (1− α)Es−1, s > 0 (4.3)
where E0 = Eth,
α is a smoothing factor, 0 < α < 1,
and s indicates the number of stimuli the user is performing. In this way, on the one side the initial value
of the threshold, Eth, is constantly updated in case of varying conditions, on the other side this update
is smoothed to avoid sudden changes due to, for instance, presence of voice in the first second.
The classification rule for a frame given its energy Ej is then guided by,
SNS = Ej − Es(1 + δ), 0 < δ < 1 (4.4)
33
Thus,
IF SNS > 0
Frame is voice
ELSE
Frame is silence
In the application scenario, VAD computation should not start until voice is detected, and should end
after 3 seconds of silence. The first constraint is achieved verifying that a minimum number of frames
have been classified as voice. When this is verified, we could define the status of the VAD algorithm as
active. The second, to be satisfied, should verify that the VAD status is active and that, since the last
voice frame, a minimum number of frames have been classified as silence. When this is achieved, the
end of the speech is detected.
4.1.2 Architecture
The proposed solution relies on the client side computation of the energy of the speech signal that
is acquired through the microphone. To this purpose, the VAD algorithm is implemented exploiting
Adobe R©Macromedia R©API. The Microphone class from Macromedia R©reference provides a quite re-
fined control over the acquired signal. It allows to specify several interesting properties that may affect
the quality of the input audio. Among these, we mention:
• codec: the codec to use for compressing audio. Two available codec exist: Nellymoser1 and
Speex2;
• enableVAD: enables built-in voice activity detection, available only for Speex codec;
• gain: the amount by which the microphone amplifies the signal;
• noiseSuppressionLevel : the maximum attenuation of the noise expressed in dB, available only for
Speex codec;
• rate: the rate at which the microphone captures sound, expressed in kHz. Acceptable frequencies
are: 5,512 Hz, 8,000 Hz, 11,025 Hz, 22,050 Hz, 44,100 Hz;
• silenceLevel : the amount of sound required to activate the microphone and dispatch the sampling
event. When this value is greater than zero, the sampling event will never be dispatched until the
amount of sound captured by the microphone is greater than the silence level;
• silenceTimeout : the number of milliseconds between the time the microphone stops detecting
sound and the time the sampling event is dispatched. When this value is greater than zero, sound
input will continue being captured for the period specified in silenceTimeout, starting from the time
the microphone does not detect any sound;
1http://www.nellymoser.com/2http://www.speex.org/
34
When capturing speech input, the Microphone class dispatches an event each time new samples are
available. This is regulated by the rate parameter of the Microphone class.
Preliminary experiments performed while studying the available API, have exploited the use of Speex
codec to enable the internal VAD algorithm provided by the API. However, results have shown a very
poor quality of the recorded speech, with frequent truncation of important speech segments, thus tests
in this direction were not continued further.
To enhance security, Adobe R©Flash R©establishes clear requirements and restrictions in terms of user-
initiated action (UIA). These consist of either keyboard or mouse events. Particularly, an important con-
straint that would prevent the implementation of the hands-free interface is related with the HTTP POST
operation. Security restrictions require that performing the equivalent of a file upload to a target server
can only succeed as the result of a user-initiated action. This is to avoid that a Flash application running
into a browser may silently post data to the server hosting the application, without the user explicitly
agreement to that action. In the current architecture, the recorded speech is stored in memory and sent
to the server side with a POST operation, when the recording process is stopped by the user. Thus, a
different strategy has to be followed.
WebSocket3 is a recently standardized protocol, that enables two-way communication between a
client and a server over a TCP connection. HTTP is a stateless, request-response protocol adhering at
client-server model paradigm. The typical scenario is a web browser that acts as a client, submitting a
request to a server which provides the requested resources, responds to the client and close the con-
nection. In this scenario, the server can only respond on client requests. WebSockets, instead, are a
bi-directional, full-duplex, persistent connection from a web browser to a server. WebSockets change
the web programming model from user driven to event driven. By allowing a persistent connection es-
tablished between a client and a server, either the client or the server can send a message at any given
time to the other.
For our purposes, WebSockets have been exploited to overcome the limitations that the Flash Se-
curity model has imposed. When the VAD algorithm determines the end of the speech a WebSocket
connection is open from the Flash Player and the recorder file is sent to the server through the Web-
Socket channel. When the file is successfully received the WebSocket replies to the Flash Player, which
can now send the request for validating the answer to the VITHEA application. This architecture is illus-
trated in Figure 4.1. This implementation also creates a valid baseline for a future version exploiting the
VAD performed by the in-house speech recognizer.
The assessment of the performance of the VAD algorithm has been evaluated through offline tests. To
this purpose, a speech corpus has been derived from daily patient’s recordings stored into the sys-
tem. The evaluation process has been performed in the Matlab environment, simulating the same con-
ditions of the Flash Player environment.
3http://tools.ietf.org/html/rfc6455
35
WebSocketws://vithea.org
Vithea applicationhttp://vithea.org
Vithea.org
WebSocket.onopenWebSocket.send
WebSocket.onmessage
Vithea,org/evalAnswer
Result of evaluation
Figure 4.1: Architectural implementation of the VAD algorithm.
4.2.1 Speech corpus
A development corpus consisting of isolated recordings of nomination tests from real users of VITHEA
has been defined to evaluate the VAD algorithm.
First, recordings corresponding to the data stored in the platform have been automatically filtered
out according to two criteria: select only those recordings that have led to a right answer and those
whose size was larger than a given minimum length. This was done to guarantee that the chosen data
actually contains an answer and it is not just the result of an erroneous or mistaken interaction. Then,
we identified potentially representative data belonging to therapy sessions of three different speech ther-
apists with their patients and five additional individual patients that performed rehabilitation therapy on
their own. Recordings belonging to speech therapists are recorded in the rehabilitation centres and in-
clude data from many different patients. These sessions are accompanied by the therapist, who helps
and stimulates the patient, thus quite often these recordings intermingle clinicians’ speech with patients’
answers. From these recordings, due to time constraints and the need for manually auditing and an-
notating the data, we finally selected the recordings from one speech therapist and one independent
patient that were representative of the characteristics of typical VITHEA interactions. Particularly, these
do not contemplate overlapped speech, thus when the clinician’s speech was present in the audio seg-
ment, it has been discarded. The final selected set consists of 63 recordings. Data from the speech
36
Comprehensive file of silence
Random segment (3.5 ~ 5.5 sec)
r
Patient's answer
+ +
Random segment (5 ~ 6.5 sec)
Figure 4.2: Process of generation of the speech corpus.
therapist apparently belonged to 5 different patients.
These selected recordings have been then used to build an ad hoc dataset that met the standard
working conditions of the designed algorithm.
First, selected exercise recordings have been cut to exactly match the start and the end of each
patients’ speech interaction. For the recordings in which there are long pauses, we have considered
them as a single/continuous speech interaction only if the silence gap is smaller than three seconds.
Then, the remaining segments from all the audio recordings that did not contain speech neither from
the therapist, nor from the patient have been clustered into a unique “background” file. Any noise or
disfluencies and hesitations that appeared in the source recordings, have been also included in this
“background” noise file. The final test segments used in the evaluation are artificially synthesised based
on the concatenation of a random segment of silence of variable length extracted from the “background”
file, followed by the speech segment containing the actual answer, and further concatenated with a final
random segment of variable length of background noise. The process of construction of the evaluation
set is visually shown in Figure 4.2.
In this way, we implicitly obtain the reference for the boundaries of the speech/non-speech regions
and we guarantee to have data similar to the expected working conditions of the online version of the
algorithm. Since this is a random selection from non pure silence, the selected segments may well
contain disturbance effects such as blowing noise into the microphone or cough, exactly as in real
conditions. By adding a segment of background noise at the end of the answer we are able to test the
algorithm for the right detection of the end of the speech.
37
4.2.2 Results
The VAD algorithm described in the previous Section is aimed at detecting both the start and the end
of speech. The former is needed to adhere the application constraint of waiting for a segment of voice.
That is, the system has to let the patient the time he/she may need for answering before ending the
speech recording and for that purpose the system needs to detect first the start of speech. However,
the accurate detection of the two boundary informations is not equally important for the purposes of
our application. In fact, regarding the start of speech detection, the exact location of this boundary is
not relevant for our application purposes, since the system starts to record automatically. A failure in
the algorithm would be caused at this regard only if the start of speech is not detected or if it is very
prematurely detected causing an error in the end of speech detection. In other words, the most important
boundary to be detected is the end of speech. However, not all the errors in the detection of the end of
speech are equally important: a premature detection of this boundary will have a more dramatic impact
in the word naming performance that a delay on its detection. For these reasons, two different metrics
have been used to evaluate the VAD algorithm.
For the start of the speech, we consider as correctly detected, those results for which the absolute
difference between the automatic detected hypothesis and the reference is smaller than a given time
threshold. Thus, we define the correct detection rate for the start of the speech (DRsos) as the number
of correctly identified results, divided the total number of segments:
DRsos =
N∑i=1
H∗[|diff(i)| −max dist]
N(4.5)
where N is the total number of test segments,
diff(i) = (ref(i)− hyp(i)) is the real valued difference between the start of the speech provided by the
reference ref(i) and the one hypothesised by the VAD algorithm hyp(i),
H∗[·] = 1−H[·] is the function so that H[·] is the unit step function,
and max dist is the maximum error distance tolerated for considering a correct detection. During the
evaluation this value has been chosen equal to 0.2 seconds.
A different metric has been adopted for evaluating the detection of the end of the speech. Based on
the previous observations, we consider the case of a premature identification of the end of the speech
differently from the case of a delayed detection. In fact, the first case would mean that the recorded file
has been truncated and, thus, it is incomplete. A delay in the identification of the final boundary, on the
other side, will only impact on network bandwidth and would not affect the recognition process. Thus, the
detection rate for the end of the speech (DReos) considers two different values for the maximum error
distance allowed for the case of early and late detection of the end of speech:
DReos =
N∑i=1
H∗[|diff(i)| −max dist(i)]
N(4.6)
where max dist(i) = max dist early if diff(i) > 0 and max dist(i) = max dist late otherwise. In
38
this evaluation we set max dist early = 0.05 and max dist late = 0.2.
Besides these metrics, also the mean error for the identification of the start and the end of the speech
have been computed (Esos, Eeos). These are defined, over all the elements of the test set, as the mean
of the absolute differences between the automatic detection and the reference.
There are several parameters that may influence the performance of the algorithm. In order to deter-
mine the best configuration, a non-exhaustive search has been performed for four of the most important
variables. These include, the threshold δ, specified in eq. 4.4, which determines the classification rule for
speech or non-speech, the minimum amount of speech required to start the VAD computation min sp,
the length of the window used to compute the frame energy w len, and the amount of silence required
to determine the end of the speech max nonsp.
Table 4.1 shows the baseline configuration that resulted from the non-exhaustive search, while ta-
ble 4.2 shows the error rates and the detection rates that have been achieved with these parameters.
Parameter Baseline valueδ 0.5min sp 0.8max nonsp 3w len 12
Table 4.1: Baseline configuration established through exhaustive search.
Metric ScoreEsos 0.28Eeos 0.60DRsos 0.87DReos 0.75
Table 4.2: Results obtained on the development test set with the baseline configuration.
Using this optimal configuration as a starting point, we examined the performance of the algorithm
by varying the values for one parameter, while keeping fixed the others. Particularly, we present results
for the δ parameter and for the length of the window w len. We note that smaller values for the threshold
δ, cause an higher error rate since background noise is detected as speech. Higher values for the
threshold, on the other side, cause an higher error rate, since low speech segment are missed.
By varying the window size w len a different phenomenon arises, which however conducts to ana-
logues results. In fact, this parameter defines the number of frames whose average will be evaluated
for the classification stage, thus a smaller window is too sensitive to variations in the background noise,
which are detected as segments of voice. For the opposite reasons, a bigger window causes an earlier
detection of the start of speech and, of course, a delay in the detection of the end of the speech.
In order to finally assess the performance of the proposed method, a new evaluation corpus has
been built, following the same generation process used to build the development corpus. To this end,
three random simulations were created with the same data used to build the development set, resulting
in a total amount of 213 new segments. For each one of them, random segments of variable length of
silence have been added at the beginning and at the end of the speech segment containing the right
answer, thus producing different segments to those used for development. Table 4.3 provides the results
39
for this new simulated evaluation set in terms of error and correct detection rates using the previously
found optimal parameters.
Metric Achieved valueEsos 0.27Eeos 0.56DRsos 0.85DReos 0.69
Table 4.3: Results obtained on the evaluation set with the baseline configuration.
As data from table 4.3 shows, the developed algorithm obtained reasonably good results. Particu-
larly, the detection rate for the start of the speech is around 0.85%, which is a quite promising achieve-
ment. On the other side, we saw that results obtained for the detection of the end of the speech worsen
slightly. An analysis of the detection errors in this case, shows that the general trend in the causes of
error is, most of the times, due to a delay in the detection of the end of the speech, which is in fact a
good news since this type of error is not expected to affect speech recognition results.
In this work, a custom hands-free interface has been specifically designed for providing a more comfort-
able experience to some aphasia patients. As already mentioned, it is not so uncommon that patients
suffering from a language disorder may also suffer from a temporary physical disability which reduces
their mobility skills. In all these cases, a hands-free interface may facilitate the performing of the therapy
sessions. The interface has been designed considering the context of the project for which it will be
used. User requirement specification have identified, in fact, that the patient may need some time to
reflect before actually answering to the presented stimuli. These concerns guided the design and the
implementation of the algorithm for the voice activity detection task. Automated evaluation has confirmed
the feasibility of the proposed solution.
40
5The clinician module of the VITHEA system is an important component of the whole platform. As ex-
plained in Section 2.3.1, it is aimed to regulate data user profile, accesses and permissions of both
patients and speech therapists, and to allow the management of the stimuli and resources that con-
stitute the therapeutic exercises. During the development of the project, the relevance of this module
has increased at the same rate with the spread of the platform itself, giving rise to new requirements
encompassing privacy and access policies, and improved management of its contents.
Within the business logic of the system, exercises are classified into three different categories, audio,
visual and text; in compliance with the content of the stimuli they contain. Currently, each category con-
tains, respectively, 308, 742, and 352 stimuli, thus constituting a total of 1402 different stimuli. Video and
audio categories provide a multi-modal presentation of the stimuli through the association of multimedia
resources which easily illustrate their content. Thus, besides stimuli data, the system currently handles
885 multimedia files, shared among the exercises, including images, video and audio.
In the early stage of development of the VITHEA project, this amount of data was not even imagined
and hence the system lacked the provision of an appropriate search functionality, resulting in a really
inefficient data management task. The logical data structure existing behind the stimuli and multimedia
files concepts, suggests and encourages the exploitation of techniques from information retrieval to
return an improved search experience. In this context, an extended search will rely on the support of
additional metadata, ad-hoc extracted from existing data, in order to retrieve the most relevant dataset
of information with respect to the search performed.
In the following, in Section 5.1 the concepts and the techniques used to implement an improved
search experience are introduced and explained. Then, in Section 5.2 the measures of precision and
recall are computed for a well-defined set of test cases. To conclude, a final discussion is reported in
Section 5.3.
In information retrieval, full-text search refers to techniques for searching a match of the query terms
throughout the content of each stored document in a collection. Full-text searching is the type performed
by most web search engines, but it can be extremely helpful also for internal, single site searching. It
involves the operations of storing and indexing a collection of text documents to optimize speed and per-
formance in finding relevant information. Full-text search is a well distinguished concept from searches
based on metadata or on parts of the original texts, such as title or abstracts, typically stored in a
database.
Query expansion is the process of reformulating a query to improve retrieval performance. It involves
evaluating the search terms and expanding them with additional, related information to match additional
results. Query expansion involves techniques such as finding synonyms of a key term and searching
for the synonyms as well or finding the various morphological forms of a key term by stemming and
including them into the final results.
In this work, a hybrid approach was investigated: additional metadata has been generated through
the support of semantic resources and exploited through the query expansion of the search terms. Then
a full-text search engine indexes and manages this data to improve matched results.
5.1.1 Methodology
In the reminder of this Section the type of data that has been object of this study is first introduced,
then a description of the methodology used to achieve the extended search is described, including the
process of metadata and indexes generation.
5.1.1.1 Data description
The VITHEA system stores information about users, such as personal, historical and clinical data, and
about exercises and resources associated to the various stimuli that form an exercise. The last are the
target of the search feature and will be further detailed here. An exercise belongs to a category, which
reflects the type of the stimuli contained in the exercise, and could be in the visual, auditive, or textual
domain. An exercise is composed of several questions, each one is described by a short text, different
possible right answers, and a multimedia resource. The system accepts several formats of images,
video, and audio files. A physical resource is mapped in the application to the concept of document,
this is composed of a title, representative of its content, a link to its location in the file system and other
technical information. In some cases, exercises and images belong to a particular theme (“Animals”,
“Food”). Figure 5.1 illustrates the details of the data associated to each of these concepts and the
relations existing among them. Here the relation existing between documents and themes is not shown
as it is not influenced by database constraints.
The short description of a question is quite repetitive and does not contain relevant semantic informa-
tion (i.e.: “Say the opposite of”), therefore only information associated with the right answers has been
used for improving the query related to an exercise. As concerning documents, the title and the theme,
when applicable, are the most discriminant data and therefore have been chosen for further analysis. As
Figure 5.1 shows, additional metadata information have been stored in the column synset and part-of
of the tables Question and Document.
5.1.1.2 Metadata generation
In order to enhance the retrieval process with techniques such as query expansion and the cre-
ation of extended indexes, additional metadata information has been generated. To this purpose, two
thesaurus-based, lexical resources have been used: MWN.PT1 (MultiWordnet of Portuguese) and PA-1http://mwnpt.di.fc.ul.pt/index.html
42
Figure 5.1: Structure of the objects of the VITHEA system that are of interest for the search functionality.
PEL2 (Palavras Associadas Porto Editora – Linguateca). With their support, the relations of synonymy,
hypernymy, and part-of have been extracted.
Two different ontologies were necessary in order to compensate the high rate of terms that were not
found. In fact, preliminary tests on a reduced set of data have shown poor results in terms of coverage,
resulting around 45% the one provided by MWN.PT for the answers of a stimulus. This gap was partially
filled by PAPEL, leading to an overall coverage of almost 76%. Table 5.1 reports final statistical coverage
data of the full load database. For each column, the first line specifies the total number of items existing
into the database for that field, while the second line reports for how many items of that fields synonyms
or hypernymy or part-of were found.
Question answers Document title Document themeN. of items 1402 885 885N. of items matched 1114 628 609Coverage 79.46% 70.96% 68.81%
Table 5.1: Coverage of the additional metadata generated.
Two different strategies have been followed for questions and documents. For the possible answers
of a question, all the three relations have been extracted as additional metadata. In fact, answers may
already contain the most common synonyms or their opposite, thus, considering only the synonyms may
result in no additional information.2http://www.linguateca.pt/PAPEL
43
For the title of the document, only the synonyms have been considered, since in most cases the
title is represented by a single -composed or not- word and thus, synonyms should suffice. Also, unlike
the questions, most of the documents have associated a semantic category. The hypernym and part-of
relations have been extracted in order to extend the domain of the search, in such a way to consider
the superclass of the data under consideration and consequently descend again through the hierarchy
for including all the subclasses. This additional information has been used to build a virtual document
indexed with the domain of the category and composed of the extended hierarchy of the information
that belongs to this domain. The relations of synonymy, also hypernymy and part-of for the answers of a
question, were considered for performing the query expansion.
5.1.1.3 Indexes generation and management
Apache LuceneTM is a search engine that efficiently supports full-text indexing, ranking and searching
features. The most typical usage of this engine is indexing large amount of textual documents and thus
providing improved searching experiences. However, the flexibility of its architecture, based on the idea
of a document containing fields of text, allows many different data formats to be indexed as long as
textual information can be extracted. Thus, Lucene has been integrated into the VITHEA system, to
exploit the full-text search functionalities. In this case, given the peculiarity of the information contained
within the system, indexes have been generated with the support of the additional data extracted in the
previous phase. Three indexes have been created:
• on the answers of the questions exploiting, besides the original answer, its synonyms, hypernym,
and part-of relation;
• on the title of a document, considering the synonym relation;
• on the category of a document, exploiting both the part-of relation to build a semantic hierarchy
and the information on the title to refine the search.
The fields that compose the indexes have been provided with a weight, which assigns a higher value
to the original data and a smaller value to the generated ones. Lucene offers several ways of achieving
this task. At indexing time, it is possible to specify that certain fields are more important than others,
by assigning a field boost. At search time, it is possible to specify a boost to each query, sub-query,
and each query term. Also, another way of affecting scoring characteristics is to change the similarity
factors. The first two approaches have been followed.
Each time a new document (or question) is inserted, updated or deleted, indexes should be rebuilt,
such that the system would be able to respond coherently to user requests. In fact, if the index is not
updated, the recently modified or inserted object will never be found. However, this process implies a
cost in terms of responsiveness of the system that could not be addressed during normal activity of
the application. This, is not due to the indexes regeneration process, which is indeed quite performing,
but from the generation of the additional metadata. It has been estimated that searching a synonym
in the WNT.PT requires at least two seconds, which is clearly unacceptable. For these reasons, the
44
management of the generated indexes has been scheduled nightly, when the load of the system is
supposed to be lower.
The strategy described in the previous Section has been implemented into the VITHEA system and,
thus, is available through a web interface which allows both to exploit search features and visualization
of the results. This interface has been used to query the system and evaluate the returned results in
terms of precision and recall. Recall is the ratio of the relevant results returned divided by the set of
all the results that are relevant for the query, is a measure of the quantity of relevant results returned
by a search. Precision is the number of relevant results returned divided by the total number of results
returned, is a measure of the quality of the results returned. Free-text searching is likely to retrieve many
documents that are not relevant to the intended search question. Also query expansion is likely to suffer
from the same problem. In fact, by expanding a search query to search for the synonyms of the user
entered term, more documents are matched as the alternate word are matched as well, increasing the
total recall. This comes at the expense of a reduced precision of the result returned.
However, in this particular context, where the amount of available data is limited to a well defined
domain, we consider that the reduced precision is a tolerable price that can be addressed and even
compensated by the value added from the extended retrieval. Here, these measures have been com-
puted for each of the indexes generated, namely on the question’s answers, on the document title, and
on the document category. Table 5.2 reports, for each, average values for precision and recall computed
over a range of ten queries.
Question answers Document title Document themePrecision 0.90 0.93 0.80Recall 0.99 0.95 0.65
Table 5.2: Precision and recall for each of the indexes generated.
Results confirmed the expectations, in fact both for the answers of a question and for the title of a
document, recall has improved at the expense of precision.
In most of the queries, the system either correctly provided an extended result set with relevant
results, or returned the direct match of the query - in the same way of a relational database - when
no relevant data have been found for that query. Figure 5.2 shows an example where the results set
provided for the key term “seco” (dry) on the field answers, has been extended with the key synonyms.
Another more interesting example is represented in Figure 5.3, where the results for the key term
“alimento” (food) are shown. Here the returned item set is not closely related with the intended meaning
of the search term, yet, the result provided is relevant in this context.
In fact, without the extended search capability this query would not have returned any result, since
the searched term does not even exist into the database as a possible answer. For some queries the
system also provided results that were considered totally out-of-domain, such in the case of the search
term “gato” (cat), which included among the returned items also the term “leve” (light).
45
Figure 5.2: Results provided for the search query “seco” (dry) on the field answer of a Question.
Figure 5.3: Results provided for the search query “alimento” (food) on the field answer of a Question.
Table 5.3 reports, for each of the queries used to compute average and recall on the second index,
the number of results returned by the system in the case of using the extended search capability and in
the case of using a standard search.
Concerning the results achieved for the third index, we note an inverse tendency for the values of
precision and recall. These results, however, become clearer after a closer inspection of the generated
metadata. Sometimes, the subset of items belonging to a given category is rather specific and generated
many missed metadata. Besides, the extensiveness of the strategy for the metadata generation for this
field, together with the limited coverage of the lexical resources used, have lead to similar metadata for
the reduced number of fields for which metadata were found. In this way, two typical scenarios verify.
In the first case, data that were considered relevant to the query is not retrieved, only the exact match
of the key term is found, thus reducing the recall. In the second case, since some data share the same
metadata without being related, irrelevant results are introduced. Figure 5.4 illustrates search results for
46
Extended search Standard searchManjar 4 0Bruxa 4 0Regar 5 1Travessia 1 0Carimbo 1 0Ligar 5 1Abrir 4 1Batatas 2 1Meloeiro 2 0Caiota 1 0Pincel 4 1
Table 5.3: Number of results returned by the system using the extended search feature and using astandard search functionality.
the query term “harpia” (harpy) as title within the “animais” (animals) category. It is worth highlighting
that, since there is no document corresponding to the title ”harpia” in the database, with standard search
modality this query would not have returned any result.
Searches performed with the Lucene search engine have shown interesting results in terms of response
time. A slight delay is noticed at the presentation layer, when returning the response containing search
results. This could be motivated by the internal logic of the application. In fact, once the relevant results
are retrieved with the Lucene search engine, these have to be integrated with additional information
extracted on-time from the database, and needed for the presentation layer.
Overall, the functionalities of full-text search have provided interesting results, allowing for the match-
Figure 5.4: Results provided for the search query <“harpia” (harpy), “animais” (animals)> on the fieldstitle and category of a document.
47
ing with the extended data within the semantic category chosen, or simply allowing, through the ex-
ploitation of synonymy relation, the retrieval of information that otherwise would not have been found.
However, the implemented approach also returns a considerable amount of false positive. This could be
partly justified by the choice of integrating such an extended set of relations, but also from errors that are
introduced by the resources themselves. Probably a more refined generation of these relations would
lead to a smaller amount of false positive.
In the future, it would be worthy to explore the integration of a stemming algorithm, such to consider
also the alternate word forms for a key term. Currently, searching for the term “peixe” (fish) will not
produce any results, which are instead returned when searching for “peixes” (fishes).
48
6In the literature review (Section 2.4.4), several naming tasks have been described. Among these, cat-
egory naming and automatic serial naming were identified as a desirable valuable addition that could
improve the user experience. Our interest is particularly focused on semantic category naming, a sub
class of category naming. In fact, even though automatic serial naming and semantic category naming
differ in their domain of application and therapeutic scope, from a technological point of view, both tasks
share the same structure. Both are based on an extended list of words as possible right answers. This
list will constitute the language model of the application, in contrast with the currently available confronta-
tion naming exercises that are based only on a very reduced set of words. In fact, in these exercises,
the language model is built in an ad-hoc fashion at the time of creating the stimulus on the basis of the
possible answers provided by the therapist. Moreover, while automatic serial naming addresses a rela-
tively closed domain, semantic category naming encompasses a more extended domain, and thus the
generation of the possible list of valid answers becomes a challenge. In the case of semantic category
naming this would imply, for instance, the explicit listing of all the known species of animals or the known
list of professions, which is clearly infeasible.
In this work, although the semantic category naming has been implemented and assessed for a
single well-defined domain, the methodology introduced here could be then extended to other domains
or to the serial naming task. The domain chosen is the animal world, since it is the most common
category used for this type of tests, which is therefore commonly referred as Animal Naming or Animal
Fluency task.
The next Section describes some characteristics of the animal naming task that motivated the choice
of methodology, together with the components of the speech recognizer that have been involved. Then,
Section 6.2 introduces the speech corpus collected to perform an automatic evaluation and describes
the iterative process adopted to refine and improve the final results. To conclude, Section 6.3 reports the
results for a manual evaluation of the recognition errors and discusses the main sources of errors.
Animal Naming is a semantic fluency task that consists of naming as many animals as possible within a
one-minute interval. The score of the task corresponds to the sum of all admissible words, where names
of extinct, imaginary or magic animals are considered admissible, while inflected forms and repetitions
are considered inadmissible.
The automation of this task raises several challenges, namely due to the disfluencies that are present
in spontaneous speech, here even more emphasized because of the mental commitment that the test
requires and the duration of the test itself.
It is expected that hesitations, filled pauses and repetitions will be common in the recorded
speech. For this reason, the same keyword spotting based approach adopted in [Abad 12] has been
followed, extending it to address the animal naming task.
6.1.1 Keyword spotting
Keyword spotting techniques aim at detecting a certain set of words of interest in the continuous audio
stream. Possible approaches have already been described in Section 2.2.3, highlighting the option in-
tegrated into the in-house speech recognition engine AUDIMUS. This method, based on the acoustic
matching of speech with keyword models in contrast to a background model, proved to be the most
appropriate approach for dealing with speech disfluencies [Abad 12], and thus, it has also been adopted
here.
In this extension, however, the list of keywords also plays a fundamental role, containing the names
of admissible animals that will be accepted by the speech recognition system. The size of this list may
have a significant impact on the outcome of the recognizer. In fact, if a keyword is missing in the list, it
will never be detected, on the other hand we also expect that a longer list will result in an increase of the
perplexity of the keyword model.
Preliminary experiments using an ad-hoc, reduced subset of the key terms, were performed in order
to assess the viability of the automatic naming task. Two different speech recognition engines were
explored: the one of the Hidden Markov Model Toolkit1 (HTK), freely available after registration, and
the in-house ASR engine AUDIMUS [Meinedo 03, Meinedo 10]. However, these preliminary results
revealed a considerable superiority of the AUDIMUS based system for this particular task. Consequently,
experiments with the HTK tools were not continued further.
6.1.2 Keyword model generation
To automatically build an adequate keyword model for the animal naming task, an existing lexical re-
source has been used as a baseline, consisting of an extensive list of animal names. This resource is
part of the project ”STRING: An Hybrid Statistical and Rule-Based Natural Language Processing Chain
for Portuguese“ [Mamede 12]. It contains 6044 animal names, grouped, classified and labelled with its
semantic category. Within the context of the STRING project, this resource is used by a finite-state incre-
mental parser to add semantic information to the output of a part-of-speech tagger. The list, therefore,
aims to be as complete as possible and its content is wide and detailed. It comprises very specific ani-
mal races, such as cobra coral sul-americana (south-american coral snake). Yet, the list contains some
animal names such as castanha, beta, corredor (fishes and bird names), which in Portuguese also have
another, more common, meaning. Finally, the list intentionally does not contain any inflected form, (i.e
no female or plural forms), only the lemma of a term is considered. The characteristics mentioned above
will affect somehow the generated keyword model and need to be specifically assessed. In fact, as we
1http://htk.eng.cam.ac.uk/
50
will see shortly, the peculiarity of the content of the list will have an impact when trying to establish prob-
ability values for the key terms. Also, since most of the words in Portuguese have a different form for
male and female, we expect that the lack of this information, as well as the lack of the plural form, will
introduce errors to recognition result. However, considering the current size of the list, it is not feasible
to add this information, at this stage, since it would mean, in the best case, a duplication of the list size.
In order to take into account that some names in this extended list will be much more likely to be said
than others, we tried to compute the likelihood of the different target terms, as it is commonly done in
n-gram based language modelling. For instance, the ARPA-MIT language model format allows to store
n-gram definition composed of a probability value stored as log10 followed by a sequence of n words
and a back-off weight. To this purpose, the total number of results provided by any web search engine
for a particular query can be a useful information, meaningful of the term’s popularity. However, the
homonymy presented by some terms may lead to an incorrect count, related to alternative meanings of
the term. Therefore, a more refined retrieval strategy has been implemented, which took into account
the semantic information associated to each key term. The search query, is therefore composed of the
bigram: <animal name> <category>, i. e. : beta peixe. In this work the Bing Search API2 has been
used to obtain the count data.
Thus, an alternative approach has been adopted to compute the final weights for each term. In
fact, the distribution of the generated results revealed a too large support, with values that drastically
decrease within few keywords. This led to weak term likelihood estimations, confirmed by the poor results
achieved with early experiments. The approach consisted of building an histogram of words, where the
original list is divided by C classes. The probability x assigned to each term is the same for all the terms
in the same class, while the probability assigned to each class decreases proportionally with the class
order. For the first class, given that it is the most popular one, its probability will be multiplied by C, while
the likelihood of the last class will be multiplied by 1. In this way, the support is regulated by C, which in
this work has been chosen equal to 10.
6.1.3 Background penalty for keyword spotting tuning
In order to balance the weight of the background speech competing model with respect to the keyword
models in the decoding process, the background scale term (β) has been exploited [Abad 12]. This
exponential term in the likelihood domain (multiplicative in the acoustic score/log-likelihood domain)
permits adjusting the word naming detection system to penalize the background speech model or to
favor it. In this way, it is possible to make the system more prone towards keyword detections (and
possibly false alarms) or towards keyword rejections (and possibly miss detections).
Once obtained an initial list that fulfils the desired requirements, several phases of experimental tests
were performed in order to determine the best compromise between the length of the list and its content.
2http://datamarket.azure.com/dataset/bing/search
51
Before describing these tests, however, we shall characterize the corpus that was specifically collected
for this purpose.
6.2.1 Speech corpus
The corpus includes recordings from 31 healthy adults (16 females and 15 males), native Por-
tuguese. The recordings took place in different conditions, with two different head-set microphones.
No particular constraint over background noise condition was imposed. Each of the sessions consisted
approximately of a one-minute recording, in which the speaker was invited to name the animals he/she
was able to remember within the available time. Data originally captured at 16 kHz was down-sampled
to 8 kHz to match the acoustic models sampling frequency. Orthographic transcriptions were manu-
ally produced for each session. The total duration of the corpus is approximately 32 minutes. The total
number of words produced by each subject is on average 28 words. However, if one considers only
the valid words, i.e. not counting the inflected forms nor the repetitions, the total average decreases to
27. Detailed data are shown in Table 6.1.
User Gender Tot. words Valid words1 f 23 232 m 21 183 m 24 244 f 33 325 m 26 266 f 23 227 m 21 218 m 30 309 f 35 2610 m 34 3311 m 35 3412 m 34 3413 f 19 1314 f 20 1715 m 34 3416 m 25 25avg 27.31 25.75
User Gender Tot. words Valid words17 f 33 3318 m 18 1819 f 35 3520 f 27 2721 f 22 2222 m 35 3523 f 27 2724 f 35 3525 m 30 3026 m 23 2227 f 24 2428 f 33 3329 f 33 3330 f 24 2431 m 36 36
avg 29 28.93
Table 6.1: Speech corpus data, including gender, total number of words and the total number of validwords uttered.
6.2.2 Results
The counts established through Bing Search API reflected well the peculiarity of the list, by collocating
in lower positions the most exotic names. In this way, it is easy to determine a threshold based on this
information, which filters out less probable keywords and thus reduces the size of the list. Different con-
ditions were experimented, here, in table 6.2, we report only the most significant, which were achieved
using the full list and two different thresholds, that resulted into two reduced terms lists. Figure 6.1
illustrates the Word Error Rate (WER) results obtained for each of these configurations.
We observed that the configuration with the shortest list caused an increase in the number of misses
and substitutions for some users. This was expectable since some of the key terms now are missing
52
Figure 6.1: First set of experiments using the keyword model with different values for the threshold.
in the list. On the other hand, we also note that for some users, whose speech recognition results
had shown a high number of insertions with the original list, have benefited from shortening it. In fact,
the number of insertions and substitutions decreased. Given this opposite trend, the average of the
automatic WER computed over all the corpus remains almost stable across different experiments. The
configuration with the middle-size list showed that the impact of missing keywords is not as great as with
the shortest list, but the number of insertions increased, as expected.
A detailed analysis of the recognition results with the shortest list showed that the many of the still
existing insertions are due to the existence in this list of a considerable amount of animal names (mostly
short names such as anu and udu) that are not so common in the daily life. Adjusting the background
penalty term in this case is not sufficient to absorb the insertions that are generated.
Hence, a different methodology needs to be applied in order to filter the elements of the
list that is not solely based on frequency counts, but rather on its content. After evaluating sev-
eral lexical and semantic resources we focused on Onto.PT3, an ontology for the Portuguese lan-
guage [Goncalo Oliveira 12]. Onto.PT is built automatically from other resources and therefore, is not
totally accurate. However, its coverage is reduced and may be useful for our purposes. In fact, since our
baseline list already provided us with the semantic category for each of our key terms, and since this
resource may introduce some errors, it has been used not to confirm the semantic class of an item as
one would expect, but to verify the spread of a term. If one keyword was missing in this ontology, it has
3http://ontopt.dei.uc.pt/
1 Set of experimentsConfig. 1 Config. 2 Config. 3
Threshold – 400 800Number of terms 6044 1447 943Average WER 21.19 21.47 21.36
Table 6.2: Experiments data and resulting average WER, including file size information.
53
been excluded from the list. The same experiments were performed with the 3 filtered lists, leading to a
reduction of the average error of up to 2.0% relative to the previous experiments. The average WER is
shown in Table 6.3, while Figure 6.2 illustrates the result of each user with the various configurations.
2nd Set of experimentsConfig. 1 Config. 2 Config. 3
Threshold – 400 800Number of terms 960 804 629Average WER 19.63 19.48 20.02
Table 6.3: Experiments data and resulting average WER, including file size information.
Figure 6.2: Second set of experiments using the keyword model filtered with Onto.PT and differentvalues for the threshold.
A closer analysis of the results revealed other important patterns in the recognition errors. Some
keywords, such as periquito (parakeet), mosquito (mosquito) or esquilo (squirrel) were typically poorly
identified. The first was used by seven different users, but was correctly recognized only once. The
second has been used nine different times, but was correctly recognized only three times. The third
was used by three different users, but never recognized correctly. Upon inspection of the rule-based
pronunciations that were used in the lexicon, we noticed an error in the rules. After correcting this error,
the three words were correctly recognized 100% of the time and the average WER decreased by 1.8%
with respect to the second configuration of previous experiments.
The final observation concerned the insertion errors caused by hesitations and filled pauses. By
modifying the generated keyword model to include also the various forms for Portuguese filled pauses,
we managed to decrease the number of insertions, reducing the average WER, by an additional 3.1%.
Table 6.4 resumes the above configurations, while Figure 6.3 illustrates the result of each user with
different configurations.
Overall, experiments have conducted to encouraging results, showing the feasibility of the animal
naming task, however it should be noted that the WER does not represent a valid estimation for the
animal naming score, since does not allow to discern between the various sources of error.
54
3rd Set of experimentsConfig. 1 Config. 2Phonetic updates Filled pauses recognition
Avg WER 17.72 14.66
Table 6.4: Experiments data and resulting average WER.
Figure 6.3: Third set of experiments, including phonetic transcription correction and filled pause models.
Twenty three subjects, out of thirty one, used keywords that were missing in the list. Without consid-
ering repetitions, as these are not allowed, the total number of keywords that were missing is thirty. Thir-
teen of the words used were simply lacking from the list, because of the filtering with Onto.PT or mostly
because they were missing from the initial baseline list. Four words were either out-of-vocabulary words
(espetada - skewered) or made-up words perdiniz. The remaining thirteen used words were missing
from the list, since they represented inflected forms.
For these reasons, a manual evaluation has been performed, where the errors due to the inflected
forms have been discounted. Table 6.5 reports the data of this experiment for every user, highlighting
those that were using the inflected forms. The average value of this customized WER is 11.64%. The
average WER computed discounting the four made-up words is 10.30%. In this work, repetitions have
not been discounted.
Also, the substitution in the reference file of the inflected words with their lemma has been ex-
perimented. Unfortunately, this did not provide a great improvement, the automatic WER decrease of
0.3%. However, this was expectable for various reasons. On the one side, Portuguese female forms may
be totally different from their male form, as in the cases cao, cadela (dog) or cavalo, egua (horse). The
same applies to the diminutive form, i.e.: pintainho, pinto (chick). On the other side, it happened that
for a key term both the lemma and the inflected form were missing from the list, as in the case aves,
ave. However, when the lemma was presented and this term was very similar to the corresponding in-
flected form, it was correctly recognized. Example of such cases are patos, pato (duck) and gato, gata
(cat).
55
Animal Naming ScoreUser WER WER -infl. WER -made-up1 4.35 4.35 4.352 38.09 27.78 23.533 12.50 12.50 12.504 18.18 15.63 9.095 7.69 3.85 3.856 17.39 13.64 13.047 19.05 14.29 14.298 0.00 0.00 0.009 54.29 38.46 22.8610 17.65 15.15 12.1211 11.43 8.82 8.8212 11.76 11.76 11.7613 36.84 7.69 7.6914 35.00 23.53 18.7515 17.65 14.71 14.7116 0.00 0.00 0.0017 6.06 6.06 6.0618 5.56 5.56 5.5619 5.71 5.71 5.7120 14.82 14.82 14.8221 0.00 0.00 0.0022 11.43 11.43 11.4323 3.70 3.70 3.7024 11.43 11.43 11.4325 16.67 16.67 16.6726 13.04 9.09 9.0927 12.50 12.50 12.5028 15.15 15.15 12.5029 6.06 6.06 6.0630 22.20 22.20 22.2031 8.33 8.33 4.35avg 14.66 11.64 10.30
Table 6.5: Automatic and manual WER with the configuration 2 of the last set of experiments.
The test represented, for some of the subjects that have participated, a source of stress. Even with
healthy subjects, the idea of uttering as many animals as possible within one minute creates a state of
anxiety. At the end of the test some people were frustrated for not having remembered more animals,
whose names when the test ended came up immediately. This caused the interesting phenomenon
where most of the animal names were quickly uttered in the first seconds of the tests, then the sub-
ject started to think of more animal names, typically intermingling speech disfluencies with silence or
keywords. Blowing noises were a causes of insertions. Another common source of error is the concate-
nation between words or between words and filled pauses. It happens that while thinking out loud some
subject introduced syllables such eeehm then, when they suddenly remembered a new word, this is
concatenated to the previous filled pause, as in the case eeeeepiriquito. Other concatenation errors
happen between two consecutive words, such in the cases rinocerontelefante, mosquitovaca.
56
7During a preliminary set of experiments performed within the context of the VITHEA project, an analysis
of the word detection errors showed, for some patients, a remarkable tendency to slow down the rhythm
of a word in coincidence with its syllables. Preliminary studies examined in the literature review (Section
2.4.5) have also shown that taking into account syllable boundaries may actually improve speech recog-
nition performance. These two reasons have motivated the investigation of an approach that considers
and integrates syllables division in the speech recognition process.
In the following, Section 7.1 introduces the syllabification task and explains how it has been imple-
mented within the in-house speech recognition engine AUDIMUS. Then, the results of an experimental
evaluation are described in Section 7.2, while a final discussion is reported in Section 7.3.
Syllabification is the process consisting of the identification and delineation of syllable boundaries in a
word. Contrary to what could be expected, syllabification is not a task of simple resolution and can be
addressed from different perspectives, namely by considering the orthographic or the phonetic form of
the word. Sometimes syllabification deals more with a concept of syllable that corresponds to the written
form, while in other situations it is correlated in some way with audibility. One can observe that, in a
speech recognition context, syllable boundaries dealing with phonetic parameters would be closer to
actual speech, and indeed these have been the subject of studies aiming at exploring complementary
acoustic models for speech processing. A syllable division that considers phonetic constraints could also
be suitable in the context of a speech recognition process where the lexicon generation is automatically
generated through a grapheme-to-phone module. However, even when approaching the problem from a
purely phonetic perspective, there is still no consensual solution for the syllabification problem. In fact,
there are several approaches to this task that differ in how sounds are grouped and thus, lead to new,
different syllable splitting rules.
7.1.1 Methodology
The development of a tool that would allow the automatic identification and division of syllable boundaries
is out of the scope of this work. Thus, research was directed toward open source solutions freely avail-
able. A software implementation of the syllabification task was kindly distributed from the department
of Electrical and Computer Engineering from the university of Coimbra. The software is a rule-based
approach based on the Maximal Onset Principle for European Portuguese [Candeias 08, Candeias 09].
Rules were driven with a lexicon of almost 400K words, syllabification was performed according with the
orthographic form.
To integrate the generated syllables into the version of AUDIMUS customized for the VITHEA sys-
tem, it is necessary to alter the lexicon used by the recognizer. In practice, for each keyword entry a
new alternative phonetic transcription is generated which consists of the same original phonetic string
with short pause units inserted in between the syllable boundaries. In this way, for each pronuncia-
tion provided by the automatic grapheme-to-phoneme module, an alternative “syllabified” version of the
canonical pronunciation is generated. Unfortunately, the matching of the syllable boundaries produced
for the orthographic transcription with the corresponding phonetic transcription need to be specifically
addressed.
In fact, depending on the stress and the duration imposed to a given phoneme, there may be different
ways of pronouncing the same word, that lead to different phonetic transcriptions. This is the case for
the Portuguese word pente (comb), whose phonetic transcriptions are: p e ∼ t @, and p e ∼ t. In the last
example, the last vowel is not included due to a phenomenon known as vowel reduction, an acoustic
variation of the pronunciation of a vowel that makes it shorter, sometimes almost inaudible. Neverthe-
less, the orthographic syllabification provided by the automated software for the same word is: pen.te.
Exceptions of these kinds has been handled accordingly to the phonetic rule, thus leading to p e ∼ . t
@, p e ∼ . t. This is in accordance with the results obtained by Candeias [Candeias 11], in a work fo-
cused on exploiting acoustic-phonetic constraints to derive new syllable prototypes. In fact, in canonical
Portuguese grammar, a consonant grapheme cannot constitute a syllable, but, if an acoustic-phonetic
constraint is applied, a syllable composed of one consonant in a final word position would be possible.
The assessment of the performance of the recognition process provided with this alternative syllabified
lexicon has been evaluated through automated tests. In particular, in order to measure the achieved
results in terms of overall improvements, the same set of experiments carried out during the VITHEA
project to evaluate the word naming task have also been replicated here [Abad 13]. The corpus used for
the evaluation is described in the following section.
7.2.1 Speech corpus
A corpus of 16 patients, native Portuguese speakers with different types of aphasia, has been collected
in two different therapy centres in two different sessions. The first phase was carried out in February and
March of 2011 and includes speech from 8 aphasia patients. The second data collection was carried out
during May and June of 2011 and includes speech from 8 different aphasia patients. According to the
author, these sets are referred as APS-I and APS-II, respectively. Recordings were performed during
regular speech-language therapy sessions. Each of the sessions consisted of naming exercises with
pictures of objects presented at intervals of at most 15 seconds. The objects and the presentation order
were the same for all patients. The pictures adopted in the nomination exercises were selected from a
standardized set of 260 of black-and-white line drawings that extend and adapt the corpus of Sndograss
58
Figure 7.1: Results for the APS-I comparing the two pronunciation lexicons, the standard and the aug-mented version provided with syllable boundaries.
and Vanderwart [Snodgrass 80].
7.2.2 Results
In the original experiments with this corpus [Abad 13], the automatic word naming recognition module
was evaluated using two different metrics, the word naming score (WNS) and the word verification
rate (WVR). These are computed for each speaker, the first corresponds to the number of positive word
detections divided by the total number of exercises, the latter corresponds to the number of coincidences
between the manual and automatic result divided by the total number of exercises. The word verification
rate (WVR) is a measure of the reliability of the automatic recognition and will be used in the following
to compare the results achieved with the alternative pronunciations with the ones mentioned above.
Surprisingly, the results obtained for the APS-I corpus have shown that the usage of the new aug-
mented pronunciations does not lead to any significant improvements in terms of overall speech recog-
nition performance. Instead, we note that for some patients the WVR worsens. On the other hand, the
results obtained for the APS-II corpus showed encouraging improvements in term of WVR. A detailed
analysis of the original audio transcription, confirmed for some patients the general trend to slow down
the rhythm of a word in correspondence with syllable boundaries. This phenomena was sometime asso-
ciated with the hesitations shown by some patients in uttering a word. Figures 7.1, and 7.2 show, for the
APS-I and APS-2 corpus respectively, the results achieved including syllable boundaries in the keyword
model in comparison with the standard transcription. Data were compared with the previous experiments
for a specific operating point of the system regulated by a parameter, the background penalty term β,
59
already introduced in Section 6.1.3. Here, the operating point chosen is the same of the VITHEA system
and is equal to 0.6. Average WVR achieved for both corpora with the different pronunciation models is
reported on table 7.1.
Average WVRAPS-I APS-II
Syllabified pronunciation 0,79 0.73Standard pronunciation 0,80 0,60
Table 7.1: Average WVR for the APS-I and APS-II corpus with different pronunciation models.
To consolidate the results described above a cross-validation experiment has been carried out. This
is performed in the same fashion as described in [Abad 13], the data from every speaker was randomly
split into two halves. The first half is used to search for the best β parameter on that data sub-set.
Then, the β penalty term is used to process the second half of the data and the WVR is computed on
this second sub-set. Here the experiments were performed with the standard pronunciation model and
with the augmented one, guaranteeing that the same random partition was used for both tests. Overall
results, showed in table 7.2 confirmed again, for some patients, small improvements in the recognition
performance when the syllabic version of the word is provided. However, also in this context the WVR
worsens for some patients. Thus, the average WVR computed for all the patients, shows a more stable
trend, with no meaningful variability.
Figure 7.2: Results for the APS-II comparing the two pronunciation models, the standard and the aug-mented version provided with syllable boundaries.
60
Patient WVR syllabified pron. WVR standard pron.1 0.85 0.852 0.79 0.783 0.84 0.834 0.85 0.855 0.72 0.726 0.92 0.917 0.70 0.718 0.93 0.93
avg 0.83 0.82.
Patient WVR syllabified pron. WVR standard pron.9 0.87 0.9510 0.71 0.7211 0.78 0.7512 0.93 0.9313 0.81 0.8114 0.78 0.8115 0.75 0.7816 0.82 0.88avg 0.81 0.83
Table 7.2: WVR for APS-I and APS-II data sets and average WVR, using automatically calibrated back-ground penalty term.
Results achieved in the last Section have demonstrated that, in some conditions, the introduction of
syllabic boundaries may improve the performance of the recognition results. Patients that present a
slower rhythm in their speaking style or patients that tend to hesitate, may benefit from the introduction of
this information within the pronunciation model. Also, it should be noted that in this work on orthographic
transcription has been manually adapted to a phonetic transcription. It could be worthy to explore the
future integration of syllable boundaries derived with a phonetic perspective.
61
62
8This last Chapter presents the final remarks of this thesis. The main achievements accomplished are
summarized in Section 8.1, while Section 8.2 concludes presenting some ideas for future work.
This work addressed the development of new features for the VITHEA system, a platform resulting from
the work conducted in the context of a three years national project aiming at the development of a virtual
therapist for the recovery from a language disorder named aphasia.
The project, in which the author was actively involved since the very beginning, is publicly available
since July 2011 and is currently distributed to almost 160 users among speech therapists and patients.
During the last phase of the project, several speech therapists from different institutions were asked
to use and evaluate the program. The assessment was performed through on-line questionnaires and
involved almost 30 speech therapy professionals. The results of this survey were remarkably good,
achieving an average score of 4.14 in a likert scale of 5 points (1 to 5). The average score obtained to
the question “Do you think VITHEA will help you at your work?” was 4.64.
Recently, the project has collected several awards from both the speech and the health-care com-
munities:
• November 2012: The VITHEA project has been presented at the ”VII Jornadas en Tecnologıa del
Habla and III Iberian SLTech Workshop” where received the second best demo award.
• June 2013: The VITHEA project participated in the seventeen edition of the conference Saude
CUF, focused on Mobile Health, where won the ”Call for Papers” concourse for the category ”Pro-
vision of services”.
The work that I have carried out in the context of the VITHEA project and of this thesis have led to
the following publications:
• July 2011: An on-line system for remote treatment of aphasia. Speech and Language Processing
for Assistive Technologies (SLPAT). Anna Pompili, Alberto Abad, Isabel Trancoso, Jose Fonseca,
Isabel P. Martins, Gabriela Leal and Luisa Farrajota.
• November 2011: Vithea. Sistema online para tratamento da afasia. Encontro dos Tecnicos de Di-
agnostico e Terapeutica (Poster presentation). Faculdade de Medicina de Lisboa. Jose Fonseca,
Alberto Abad, Gabriela Leal, Luisa Farrajota, Anna Pompili, Isabel Trancoso, Isabel P. Martins.
• February 2012: VITHEA: Sistema online para tratamento da nomeacao oral na afasia. 6o Con-
gresso Portugues do AVC da Sociedade Portuguesa de AVC (Poster presentation). Jose Fonseca,
Alberto Abad, Gabriela Leal, Luisa Farrajota, Anna Pompili, Isabel Trancoso, and Isabel P. Martins.
• April 2012: VITHEA: On-line therapy for aphasic patients exploiting automatic speech recognition.
International Conference on Computational Processing of the Portuguese Language (Propor 2012)
- Demo Session. Anna Pompili and Alberto Abad.
• September 2012: Automatic word naming recognition for treatment and assessment of aphasia.
13th Annual Conference of the International Speech Communication Association (InterSpeech
2012). Alberto Abad, Anna Pompili, Angela Costa, Isabel Trancoso.
• October 2012: Automatic word naming recognition for an on-line aphasia treatment system. Spe-
cial Issue on Speech Proc. & NLP for AT. Computer Speech and Language, Elsevier. Alberto Abad,
Anna Pompili, Angela Costa, Isabel Trancoso, Jose Fonseca, Gabriela Leal, Luisa Farrajota, Isabel
P. Martins.
• November 2012: VITHEA: On-line word naming therapy in Portuguese for aphasic patients exploit-
ing automatic speech recognition. ”VII Jornadas en Tecnologıa del Habla” and III Iberian SLTech
Workshop (IberSPEECH2012). Anna Pompili, Pedro Fialho and Alberto Abad.
• June 2013: Vithea: Virtual therapist for aphasia treatment.XVII Edition of Conferencias
SAUDECUF. Alberto Abad, Anna Pompili, Isabel Trancoso, Jose Fonseca, Isabel P. Martins.
The success of the system motivated the research on additional features which could extend its
functionalities and robustness. These extensions have been the objectives of the present work and
have concerned with many aspects of the VITHEA platform, from its architecture to one of the main
components of the system: the speech recognition engine.
Probably, the main contribution of this thesis has been the development of a mobile version of the
client module that has shown the feasibility of this kind of systems on such increasingly widespread and
popular devices. User experience evaluation provided remarkably good results, encouraging the further
development of this version.
A custom approach for enabling a hands-free interface, has been designed and implemented. This
required an important architectural update and the exploitation of a recent standardized protocol to
overcome important limitations. The algorithm has been tailored on the speech characteristics that one
would expects from people with language disorders. The design of the implementation was suited to
facilitate the performance of the therapy session.
The administration platform of the project has been provided with an advanced search capability that
will enhance the usability of the application and improve the management of the system resources. Tech-
niques from Information Retrieval have allowed to obtain high values of recall in the results retrieved from
the system.
Another important achievement for the project has been the implementation of a new category of
exercise. A probabilistic keywords model has been generated to support and evaluate the introduction
64
of a specific semantic category naming exercise, the animal naming task. Automated evaluation showed
promising results, that will certainly lead to a future implementation of the exercise in the on-line platform
itself.
To conclude, an alternative pronunciation lexicon has been exploited in order to improve the robust-
ness of the speech recognizer. A rule-based software has been used to generate an alternative lexicon
according to the orthographic form. Recognition results exploiting this lexicon have shown, for some
patients, a slight improvement in the automatic word naming recognition performance.
This work allowed me to apply and deepen many of the topics learnt during the last years. I had
the possibility to exploit new techniques of Information Retrieval, to implement the concepts studied
in software engineering and security system courses, and of course to test my planning and manage-
ment skills. I also finally had the chance to have a closer approach to the challenges that surround the
area of speech recognition, a topic that has particularly atracted my interested after so many years of
involvement in the project.
Some directions for future work have already been identified in the course of this document. Among
these, there is of course the idea of improving with more functionalities and strengthen with advanced
signal processing techniques, besides increasing the robustness of the version of the system dedicated
to mobile devices.
For what concerns the hands-free interface, a future extension that aims at improving the quality
of the speech/non-speech detection, may consider the hypothesis of exploiting the in-house AUDIMUS
speech segmentation system for performing this task in the server-side. Alternatively, another possibility
would be to extend the client-side version of the speech detector by using complementary relevant
features in addition to the signal energy, such as for instance the zero crossing rate.
Related with the extended search capability developed, we plan to refine the metadata generation
process and to exploit a stemming algorithm which may allow to consider for the root of a word instead
of its derivation.
Another important area of the project which deserves further research is the one that concerns to
the current set of available exercises. Besides the future integration of the animal naming task, the
exercise itself could be extended to other types of stimulations, such as automatic serial naming or
picture description.
To a greater extent, with the introduction of new types of exercises, the VITHEA system itself could
be exploited to be applied to other kind of language or yet cognitive disorders. Currently, the VITHEA
system is indeed already being advised from speech therapists to some patients that do not suffer from
aphasia, but apparently have rather similar symptoms. This is the case of a recent patient that is affected
from Amyotrophic lateral sclerosis.
Thus, for instance, Alzheimer’s disease is a neurodegenerative process whose symptoms are mani-
fested predominantly as a disruption of memory processing that secondarily affects other cognitive abili-
65
ties [Ashford 08]. In Alzheimer’s disease linguistic tasks are used to evaluate and monitor the level of the
cognitive dysfunction. To this purpose, a typical task commonly used is the semantic fluency task. This
consists of naming as many words as possible belonging to a specific category, and within a one-minute
interval. The most common category used for this test is the “animals“ category, this subset is therefore
commonly referred as Animal Naming or Animal Fluency and is part of the CERAD (Consortium for the
Establishment of a Registry for Alzheimers Disease).
66
[Abad 08] A. Abad & J. Neto. Automatic classification and transcription of telephone speech
in radio broadcast data. In Proc.International Conference on Computational Pro-
cessing of Portuguese Language (PROPOR), 2008.
[Abad 12] A. Abad, A. Pompili, A. Costa & I. Trancoso. Automatic word naming recogni-
tion for treatment and assessment of aphasia. In 13th Annual Conference of the
International Speech Communication Association (InterSpeech 2012), 2012.
[Abad 13] A. Abad, A. Pompili, A. Costa, I. Trancoso, J. Fonseca, G. Leal, L. Farrajota &
I. P. Martins. Automatic word naming recognition for an on-line aphasia treatment
system. Computer Speech & Language, vol. 27, no. 6, pages 1235–1248, 2013.
Special Issue on Speech and Language Processing for Assistive Technology.
[Adlam 06] A.-L. R. Adlam, K. Patterson, T. T. Rogers, P. J. Nestor, C. H. Salmond, J. Acosta-
Cabronero & J. R. Hodges. Semantic dementia and fluent primary progressive
aphasia: two sides of the same coin? Brain, vol. 129, no. 11, pages 3066–3080,
2006.
[Albert 94] M. L. Albert, R. W. Sparks & N. A. Helm. Report of the Therapeutics and Tech-
nology Assessment Subcommittee of the American Academy of Neurology. Ass-
esment: melodic intonation therapy. Neurology, vol. 44, pages 566–568, 1994.
[Albert 98] M. L. Albert. Treatment of aphasia. In Archive of Neurology, volume 55, pages
1417–1419, 1998.
[Aronson 97] A. R. Aronson & T. C. Rindflesch. Query expansion using the UMLS Metathe-
saurus. Proc AMIA Annu Fall Symp, 1997.
[Ashford 08] J. W. Ashford. Screening for Memory Disorder, Dementia, and Alzheimer’s dis-
ease. In Aging Health), volume 4, pages 399–432, 2008.
[Basso 92] A. Basso. Prognostic factors in aphasia. Aphasiology, vol. 6, no. 4, pages 337–
348, 1992.
[Bell 08] M. Bell. Introduction to Service-Oriented Modeling. Service-Oriented Modeling:
Service analysis, design, and architecture. Wiley & Sons, 2008.
[Bhogal 03] S. K. Bhogal, R. Teasell & M. Speechley. Intensity of aphasia therapy, impact on
recovery. Stroke, pages 987–993, 2003.
67
[Campbell 05] W. W. Campbell. Dejong’s the neurologic examination, chapitre Disorders of
Speech and Language. 2005.
[Candeias 08] S. Candeias & F. Perdigao. Conversor de grafemas para fones baseado em regras
para portugues. In Proc. 10 Years of Linguateca - PROPOR 2008, 2008.
[Candeias 09] S. Candeias & F. Perdigao. Syllable Structure Prototype for Portuguese Teach-
ing/Learning. In Proc. Athens Institute for Education and Research International
Conf. on Literatures, Languages, 2009.
[Candeias 11] S. Candeias & F. Perdigao. Syllable Structure Prototype for Portuguese Teach-
ing/Learning. In The 17th International Congress of Phonetic Sciences (ICPhS
XVII), 2011.
[Caseiro 02] D. Caseiro, I. Trancoso, L. Oliveira & C. Viana. Grapheme-to-phone using finite-
state transducers. In Proceedings of 2002 IEEE Workshop on Speech Synthesis,
pages 215 – 218, 2002.
[Caseiro 06] D. Caseiro & I. Trancoso. A specialized on-the-fly algorithm for lexicon and lan-
guage model composition. IEEE Transactions on Audio, Speech & Language
Processing, vol. 14, no. 4, pages 1281–1291, 2006.
[Chuangsuwanich 11] E. Chuangsuwanich & J. R. Glass. Robust Voice Activity Detector for Real World
Applications Using Harmonicity and Modulation Frequency. In INTERSPEECH,
pages 2645–2648. ISCA, 2011.
[Code 94] C. Code & M. J. Ball. Syllabification in aphasic recurring utterances: contributions
of sonority theory. Journal of Neurolinguistics, vol. 8, no. 4, pages 257 – 265,
1994.
[Cui 02] H. Cui, J.-R. Wen, J.-Y. Nie & W.-Y Ma. Probabilistic query expansion using query
logs. In Proceedings of the 11th international conference on World Wide Web,
WWW ’02, pages 325–332, New York, NY, USA, 2002. ACM.
[Davis 85] G. A. Davis & M. L. Wilcox. Adult aphasia rehabilitation: Applied pragmatics.
College Hill Press, 1985.
[Ferro 99] J. M. Ferro, G. Mariano & S Madureira. Recovery from Aphasia and Neglect.
Cerebrovasc Dis, vol. 9, pages 6–22, 1999.
[Fielding 00] R. T. Fielding. Architectural Styles and the Design of Network-based Software
Architectures. PhD thesis, UNIVERSITY OF CALIFORNIA, IRVINE, 2000.
[Fielding 02] R. T. Fielding & R. N. Taylor. Principled design of the modern Web architecture.
ACM Trans. Internet Technol., vol. 2, no. 2, pages 115–150, May 2002.
68
[Goncalo Oliveira 12] H. Goncalo Oliveira, L. Anton Perez & P. Gomes. Integrating lexical-semantic
knowledge to build a public lexical ontology for portuguese. In Proceedings of the
17th international conference on Applications of Natural Language Processing
and Information Systems, NLDB’12, pages 210–215, Berlin, Heidelberg, 2012.
Springer-Verlag.
[Gong 05] Z. Gong, C. W. Cheang & L. Hou U. Web query expansion by Wordnet. In DEXA,
pages 166–175, 2005.
[Goodglass 93] H. Goodglass. Understanding aphasia: technical report. Rapport technique, Uni-
versity of California. San Diego. Academic Press, 1993.
[HBP 12] The Human Brain Project Pilot Report, 2012.
[Hermansky 90] H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of
the Acoustical Society of America, vol. 87, no. 4, pages 1738–1752, 1990.
[Hermansky 92] H. Hermansky, N. Morgan, A. Bayya & P. Kohn. RASTA-PLP speech analysis tech-
nique. In Proceedings of the 1992 IEEE international conference on Acoustics,
speech and signal processing, volume 1 of ICASSP’92, pages 121–124, 1992.
[Howard 85] D. Howard, K. Patterson, S. Franklin, V. Orchard-Lisle & J. Morton. The facilitation
of picture naming in aphasia. Cognitive Neuropsychology, vol. 2, pages 49–80,
1985.
[Hunt 04] M. Hunt. Speech recognition, syllabification and statistical phonetics. In Proc.
International Conference on Speech and Language Processing, Interspeech-04,
October 2004.
[Jamaati 08] M. Jamaati, H. Marvi & M. Lankarany. Vowels recognition using mellin transform
and plp-based feature extraction. J Acoust Soc Am, vol. 123, no. 5, page 3177,
2008.
[Kingsbury 98] B. E. D. Kingsbury, N. Morgan & S. Greenberg. Robust speech recognition using
the modulation spectrogram. Speech Communication, vol. 25, no. 1-3, pages 117
– 132, 1998.
[Knutsen 09] J. Knutsen. Web service clients on mobile android devices - a study on architec-
tural alternatives and client performance. Master’s thesis, Norwegian University
of Science and Technology, 2009.
[Koller 10] O. T. A. Koller. Automatic speech recognition and identfication of african por-
tuguese. Diploma thesis, Berlin University of Technology, June 2010.
[Maier 09] A. Maier, T. Haderlein, U. Eysholdt, F. Rosanowski, A. Batliner, M. Schuster &
E. Noth. {PEAKS} – A system for the automatic evaluation of voice and speech
disorders. Speech Communication, vol. 51, no. 5, pages 425 – 437, 2009.
69
[Mamede 12] N. J. Mamede, J. Baptista, C. Diniz & V. Cabarrao. STRING: An Hybrid Statistical
and Rule-Based Natural Language Processing Chain for Portuguese. In Interna-
tional Conference on Computational Processing of Portuguese, Propor, 2012.
[Meinedo 00] H. Meinedo & J. P. Neto. Combination Of Acoustic Models In Continuous Speech
Recognition Hybrid Systems. vol. 2, pages 931–934, 2000.
[Meinedo 03] H. Meinedo, D. Caseiro, J. Neto & I. Trancoso. AUDIMUS.Media: a Broadcast
News speech recognition system for the European Portuguese language. In Proc.
International Conference on Computational Processing of Portuguese Language
(PROPOR), 2003.
[Meinedo 10] H. Meinedo, A. Abad, T. Pellegrini, I. Trancoso & J. P. Neto. The L2F Broadcast
News Speech Recognition System. In Fala2010, Vigo, Spain, 2010.
[Mohri 02] M. Mohri, F. Pereira & M. Riley. Weighted Finite-State Transducers in Speech
Recognition. Computer Speech and Language, vol. 16, pages 69–88, 2002.
[Morgan 95] N. Morgan & H. Bourlad. An introduction to hybrid HMM/connectionist continuous
speech recognition. IEEE Signal Processing Magazine, vol. 12, no. 3, pages 25–
42, 1995.
[Murray 01] L. L. Murray & R. Chapey. Assesment of Language Disorders in Adults. In
R. Chapey, editeur, Language Intervention Strategies in Aphasia and Related
Neurogenic Communication Disorders. Lippincott Williams & Wilkins, 4 edition,
2001.
[Neto 96] J. P. Neto, C. Martins & L. B. Almeida. An Incremental Speaker-Adaptation Tech-
nique For Hybrid Hmm-Mlp Recognizer. In Recognizer, Proceedings ICSLP 96,
pages 1289–1292, 1996.
[Oliveira 05] C. Oliveira, L. C. Moutinho & A. J. S. Teixeira. On european Portuguese automatic
syllabification. In INTERSPEECH 2005 - Eurospeech, 9th European Conference
on Speech Communication and Technology, Lisbon, Portugal, pages 2933–2936.
ISCA, 2005.
[Ortmanns 00] S. Ortmanns & H. Ney. The time-conditioned approach in dynamic programming
search for LVCSR. Speech and Audio Processing, IEEE Transactions on, vol. 8,
no. 6, pages 676 –687, 2000.
[Paulo 08] S. Paulo, L. C. Oliveira, C. Mendes, L. Figueira, R. Cassaca, C. Viana & H. Moniz.
DIXI – A Generic Text-to-Speech System for European Portuguese. In Computa-
tional Processing of the Portuguese Language, volume 5190 of Lecture Notes in
Computer Science, pages 91–100. Springer Berlin Heidelberg, 2008.
70
[Pedersen 95] P. M. Pedersen, H. Stig Jørgensen, H. Nakayama, H. O. Raaschou & T. S. Olsen.
Aphasia in acute stroke: Incidence, determinants, and recovery. Annals of Neu-
rology, vol. 38, no. 4, pages 659–666, 1995.
[Pinto 07] J. Pinto, A. Lovitt & H. Hermansky. Exploiting Phoneme Similarities in Hybrid
HMM-ANN Keyword Spotting. In Proc. Interspeech, pages 1610–1613, 2007.
[Pompili 11] A. Pompili, A. Abad, I. Trancoso, J. Fonseca, I. P. Martins, G. Leal & L. Farrajota.
An on-line system for remote treatment of aphasia. In Proceedings of the Sec-
ond Workshop on Speech and Language Processing for Assistive Technologies,
SLPAT ’11, pages 1–10. Association for Computational Linguistics, 2011.
[Pompili 13] A. Pompili. New features for on-line aphasia therapy. Master’s thesis, Instituto
Superior Tecnico, 2013.
[Rabiner 89] L. R. Rabiner. A tutorial on hidden markov models and selected applications in
speech recognition. Proceedings of the IEEE, vol. 77, no. 2, page 257–285, 1989.
[Rabiner 93] L. R. Rabiner & B. H. Juang. Fundamentals of speech recognition. Prentice Hall,
1993.
[Ramirez 04] J. Ramirez, J. C. Segura, C. Benitez, A. de la Torre & A. Rubio. Voice activity
detection with noise reduction and long-term spectral divergence estimation. In
Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP ’04).
IEEE International Conference on, volume 2, pages ii–1093–6 vol.2, 2004.
[Richardson 07] L. Richardson & S. Ruby. Restful web services. O’Reilly Media, 2007.
[Sangwan 02] A. Sangwan, M. C. Chiranth, H. S. Jamadagni, R. Sah, R. Venkatesha Prasad
& V. Gaurav. VAD techniques for real-time speech transmission on the Internet.
In High Speed Networks and Multimedia Communications 5th IEEE International
Conference on, pages 46–50, 2002.
[Sarno 81] M. T. Sarno. Recovery and rehabilitation in aphasia. In Acquired Aphasia, pages
485–530. New York, Academic Press, 1981.
[Snodgrass 80] J. G. Snodgrass & M. Vanderwart. A standardized set of 260 pictures: Norms
for name agreement, image agreement, familiarity, and visual complexity. Journal
of Experimental Psychology: Human Learning and Memory, vol. 6, no. 2, pages
174–215, 1980.
[Szoke 05] I. Szoke, P. Schwarz, P. Matejka, L. Burget, M. Karafiat, M. Fapso & J. Cernocky.
Comparison of Keyword Spotting Approaches for Informal Continuous Speech. In
Proc. Interspeech, pages 633–636, 2005.
[Tebelskis 95] J. Tebelskis. Speech recognition using neural networks. Phd thesis, Carnegie
Mellon University, 1995.
71
[Trancoso 03] I. Trancoso, C. Viana, M. Barros, D. Caseiro & S. Paulo. From Portuguese to
Mirandese: fast porting of a letter-to-sound module using FSTs. In Proceedings of
the 6th international conference on Computational processing of the Portuguese
language, PROPOR’03, pages 49–56, Berlin, Heidelberg, 2003.
[Voorhees 94] E. M. Voorhees. Query Expansion using Lexical-Semantic Relations. In Proceed-
ings of the 17th Annual International ACM SIGIR conference on Research and
Development in Information Retrieval, SIGIR ’94, pages 61–69. Springer London,
1994.
[Weinrich 91] M. Weinrich. Computerized visual communication as an alternative communica-
tion system and therapeutic tool. Neurolinguist., vol. 6, pages 159–176, 1991.
[Wilshire 00] C.E. Wilshire & H.B. Coslett. Disorders of word retrieval in aphasia: theories and
potential applications. In L. J. G. Rothi e B. Crosson S. E. Nadeau, editeur, Apha-
sia and Language. Theory to practice, pages 82–107. New York: The Guilford
Press, 2000.
72