3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young...

36
Mid Term Review Report Research Training Network Hearing, Organisation and Recognition of Speech in Europe HOARSE Contract N°: HPRN-CT-2002-00276 Commencement date of contract: 1/9/2002 Duration of contract (months): 48 Period covered by the report: 1/9/2002 to 31/8/2004 Coordinator: Professor Phil Green Department of Computer Science, University of Sheffield Regent Court, Portobello St., Sheffield S1 4DP, UK Phone: +44 114 222 1828: Fax: +44 114 222 1810 : e-mail [email protected] Mid Term Review Meeting : Daimler-Chrysler Research Labs, Ulm, Germany, 29/10/04, 0930 HOARSE Partners 1. The University of Sheffield [USFD] coordinator

Transcript of 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young...

Page 1: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

Mid Term Review Report

Research Training Network

Hearing, Organisation and Recognition of Speech in Europe

HOARSE

Contract N°: HPRN-CT-2002-00276 Commencement date of contract: 1/9/2002 Duration of contract (months): 48

Period covered by the report: 1/9/2002 to 31/8/2004

Coordinator:

Professor Phil GreenDepartment of Computer Science, University of SheffieldRegent Court, Portobello St.,Sheffield S1 4DP, UKPhone: +44 114 222 1828: Fax: +44 114 222 1810 : e-mail [email protected]

Mid Term Review Meeting : Daimler-Chrysler Research Labs, Ulm, Germany, 29/10/04, 0930

HOARSE Partners

1. The University of Sheffield [USFD] coordinator2. Ruhr-Universitat Bochum [RUB]3. DaimlerChrysler AG [DCAG]4. Helsinki University of Technology [HUT]5. Institut Dalle Molle d’Intelligence Artificielle Perceptive [IDIAP]6. Liverpool University [UNILIV]7. University of Patras [PATRAS]

Page 2: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

Part A. Research Results

A.1 Scientific HighlightsAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work so far.

At Sheffield, doctoral researcher JANA EGGINK is researching Task 1.5., auditory scene analysis in music. This is an addition to the original HOARSE work programme, added in the year 1 report. Eggink is focussing on instrument recognition, which is in many ways related to speaker identification, and techniques developed for the latter problem have been successfully adapted for monophonic instrument recognition. To enable instrument recognition when more than one sound source is present, Eggink has deployed missing data techniques, which have been developed at USFD for speech or speaker recognition in the presence of noise. Eggink’s first system was intended to identify the fundamental frequencies (F0s) of all notes in pieces for small musical ensembles, and to determine the instruments on which these notes were played. The F0s were estimated first, using a frequency domain pattern matching approach related to the so-called 'harmonic sieve'. Based on the F0s, missing data masks were then constructed by declaring all frequency regions where a harmonic of an interfering tone was found to be unreliable or ‘missing’. The actual instrument classification was carried out using a Gaussian mixture model classifier, adapted to work with missing data. Results were generally good, not only for artificial mixtures of tones with known F0s, but also for examples taken from commercially available music compact discs.

A system for the recognition of the solo instrument in accompanied sonatas and concertos was subsequently developed. Compared with the previous system there was a change of focus from the background towards the foreground. Instead of identifying regions dominated by interfering sound sources, only the harmonic series belonging to the dominant instrument is identified and used for recognition. Test material is taken from commercially available classical music CDs, without placing any restrictions on the background accompaniment. The recognition accuracies achieved are comparable to those of systems developed to deal with monophonic music only. In an additional step, knowledge about the solo instrument is used to extract the F0s of the main melodic line played by this instrument. Combining different knowledge sources in a probabilistic framework led to a significant improvement in F0s estimation when compared to a baseline system using only bottom-up processing.

At Bochum, doctoral researcher JUHA MERIMAA has concentrated on HOARSE Tasks 2.1 (Researching the precedence effect), 2.2 (Reliability of auditory cues in multi-source scenarios), and 2.3 (Perceptual models of room reverberation with application to speech recognition). In year 1 a listening test was conducted to investigate the perceptual grouping of binaural cues and their relation to sound sources and the listening environment. The data showed a complex dependence on the source signal properties. Reliability of binaural auditory cues (task 2.2) was also directly addressed in an auditory

Page 3: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

modelling study in collaboration with Agere Systems. This investigation described the behaviour of interaural time-difference (ITD) and interaural level-difference (ILD) cues in some typical multisource and reverberant scenarios. Later, a method for extracting reliable cues based on a novel way of analysing the output of an auditory cross-correlation model was proposed. A novel auditory modelling mechanism predicting localization under precedence effect, multi-source, and reverberant conditions has been proposed. The model is currently being investigated further by gathering new experimental data on the precedence effect. The perception of room reverberation has also been investigated in a study of spatial impression. The first part of this work included developing the experimental methods, as well as finding and training suitable test subjects for the listening experiments. Ongoing work concentrates on the effect of conflicting binaural cues on perception and on grouping of the cues to those related to sound sources and acoustical environments. Furthermore, a novel method for multi-channel loudspeaker reproduction of room reverberation has been developed in collaboration with HUT, leading to several joint papers.

Also at Bochum, post-doctoral researcher JOHN WORLEY investigated the cues to sound source location and subsequently a resolution of the ‘cone of confusion’ contained within an individual listeners’ head-related transfer function (HRTF). This work is related to HOARSE tasks 2.1 (the precedence effect) and 2.2 (Reliability of auditory cues in multi-source scenarios). Individual differences in HRTF measurements relate to the differences in the size and shape of individuals’ pinnæ and head. Typically when the source location is synthesised from non-individualised HRTFs listeners display a significant amount of reversal errors. It has been suggested that listeners can learn to associate the cues in non-individualised HRTFs to resolve source location. Worley performed a longitude study to assess whether inexperienced listeners can learn to resolve front-back allocation with location synthesised from HRTFs provided by a non-individualised dummy head. Over a training period that spanned 9 days, listeners did not display a reduction in the amount of reported reversal errors. However, listeners did display a difference in their propensity for visual capture of the auditory event by the visual cue. The results were analysed with reference to a difference between dynamic and passive listening with non-individualised HRTFs. Worley has since moved to Patras (see below).

At DCAG, doctoral researcher JULIEN BOURGEOIS is working on Task 4.2, informing speech recognition. His first-year aim was to study and to evaluate existing blind and semi-blind source separation methods. This theoretic and implementation effort encompassed speech-specific methods (CASA and AMDecor ), that exploit the correlated modulation of the sources in different frequency bands. Another group of techniques makes use of the statistical independence between the sources. Various independence measures were tested, using either second order or higher order information. Decorrelation was shown to be the most robust separation criterion in noisy conditions. Observing that speech signals fill a small portion of the time-frequency plane, source separation methods assuming that only one source is present at each point of the spectrogram were also considered and lead to the best results in terms of interference reduction. Spatial filtering (beamforming) is a classical alternative to these methods. It

Page 4: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

shows noise reduction ability but its performance depends on reliable voice activity detection. Most methods require at least as many microphones as sources. For the evaluation of the various approaches (Task 5.1) Bourgeois used a set of speech recordings from the TIMIT digit database in highway conditions at different driving speeds. These were made with artificial heads so that the acoustic response between the speaker and the microphone remains constant. In year 2, Bourgeois concentrated on the comparison between linear blind source separation (BSS) methods and minimum-variance (or beamforming) techniques for the separation of the driver and co-driver speech in cars. He observed experimentally that BSS performs poorly when microphones are placed on the roof, as close as possible to the mouth of each speaker. Bourgeois examined this theoretically and showed that when the input signal-to-interference ratio (SIR) at the microphone is above a certain threshold, BSS is not able to bring any crosstalk-reduction, whereas beamforming still performs SIR improvement. Another limitation of BSS methods is their slower convergence on so-called non-causal mixtures, which arises for example if the two speakers are on the same half-plane defined by the microphone position. As a consequence, it is not advantageous to incorporate spatial prior information (available in cars) as hard constraints on the separation filters. This finding is confirmed in other experimental settings. Therefore, classical beamforming methods are preferable whenever reasonable speaker activity detection can be achieved. In Task 5.1, Speech recognition evaluation in multi-speaker conditions, DCAG made additional recordings using the commercial S-Klasse mirror beamformer and close-talk microphones. Further multi-speaker recognition evaluation: on these recordings, BSS methods performed lower word error rate reduction than beamforming.

At HUT Helsinki, working on HOARSE Tasks 3.1, Glottal excitation estimation and 3.2, Voice production studies doctoral researcher EVA BJORKNER has been making comparisons between habitual and throaty voice area functions of the vocal tract obtained from MR imaging and acoustic recordings, obtained by using inverse filtering (Task 3.1, glottal excitation estimation) for the vowels /a, ae, i, u/. This experiment combines for the first time visual and acoustical analyses of the vocal tract area functions for throaty voice production. Bjorkner has gone on to study physiological differences between chest and head register in the female singing voice by inverse filtering the oral airflow recorded for a sequence of /pae/ syllables sung at constant pitch and decreasing vocal loudness in each register by seven female musical theatre singers. Ten equidistantly spaced subglottal pressure (Ps) values were selected and the relationships between Ps and several parameters were examined. The normalised amplitude quotient (NAQ) was used for measuring glottal adduction. Development and evaluation of inverse filtering has been studied using physiological modelling of voice production as well as high-speed digital imaging of the vocal folds fluctuation. Thus this experiment combines several Ps -values with NAQ to measure glottal adduction.

At IDIAP, researcher VIKTORIA MAIER studied contextual and temporal information in speech and its use in ASR: The relevant HOARSE Task is 4.2, Informing Speech Recognition.

Page 5: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

The classic experiment of Liberman et al (1952) was re-designed and perceptual test run on a group of 37 listeners. Results were broadly consistent with Liberman et al (1952). The implications for HMM-based speech recognition systems were discussed.

The importance of the number of emitting states in a model and the relationship of phoneme duration has been analyzed.

Viktoria Maier is about to leave the HOARSE network and continue her doctoral work at Sheffield, under different funding.

Also at IDIAP, doctoral researcher PETR SVOJANOVSKY is extending the TRAP-TANDEM model proposed by IDIAP. The main effort is towards universal classifiers frequency-localized patterns, extending (Hermansky and Jain, Eurospeech 2003). Recently an interesting and apparently effective method of training a classifier on a particular frequency band and applying it also at all other frequencies has emerged Svojanovsky’s work. The HOARSE task involved here is 4.3, Advanced ASR algorithms.

Svojanovsky was also involved in ASR experiments with nonsense syllables. This database in principle allows for evaluation of automatic recognizer, independently of any language-level constraints. HOARSE Task 5.1, Speech recognition evaluation in multi-speaker conditions.

Doctoral researcher GUILLAUME LATHOUD is working under task 5.2 (signal and speech detection in sound mixtures) on overlaps between speakers. Previously proposed microphone array-based speaker segmentation methods were extended into a generic short-term segmentation/tracking framework [Lathoud et al. 04] that successfully copes with unkown number of speakers and unkown speakers' locations. An audiovisual database called AV16.3 is now accessible online [Lathoud et al. 04] including a variety of multi-speaker cases, 3D location annotation and some speech/silence segmentation annotation. Recent work focused on sector-based multiple sources detection and localization [Lathoud et al. 04].

At Liverpool, the work of doctoral researcher ELVIRA PEREZ has concentrated on Task 1.3, active/passive speech perception. The basic assumption underlying this work is that human listeners actively build models on how two simultaneous sound sources will sound and that therefore it will be easier to segregate highly predictable maskers from speech patterns than maskers with the same long-term spectro-temporal properties that are not predictable. Perez’ work in the previous year suggested that listeners do not make short term temporal predictions to aid segregation. These initial findings were confirmed by two sets of experiments which evaluated speech intelligibility in two contexts:

regularly spaced and randomly spaced noise bursts (to test temporal prediction) using a predictable and unpredictable frequency modulated sinewave that could

be integrated into the speech percept or heard as a separate sound. Both experiments confirm that our ability to segregate signals from maskers does not exploit (or rely on) regularity of the masker. A paper on this work is on preparation.

Page 6: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

Perez has taken a year out of HOARSE to pursue a collaboration with Columbia University, NY, on a Fulbright fellowship. She has now returned to Liverpool. .Also at Liverpool, post-doctoral researcher PATTI ADANK has worked on Task 1.4: Envelope information and binaural processing. Successful segregation of background noise not only requires the separation of complex auditory scenes into components linked to individual sound sources, but also the formation of streams out of the separate ‘snippets’ that are segregated from each other. Adank concentrated on the use of voice characteristics to help segregation of simultaneous speakers. Previous work has shown that listeners are able to segregate spatially disparate signals much better when spoken by different speakers (Darwin and Hukin, 2000). This has led to the suggestion that a two-stage process may first segregate the signals on F0 and that in a second stage components with matching spatial cues are grouped to from a speech stream. The results also appear to be consistent with the finding that vocal tract differences aid in the segregation of simultaneous sentences provided they are spoken on different F0 (Darwin et al., 2003). Important voice characteristics are local amplitude modulation (flutter) or random F0 variation (jitter). Modelling work (Ellis 1993) has shown that jitter can be extracted by computational models and used for grouping. Our aim was to test whether human listeners use this cue either as a primary segregation cue (analogous to the F0 cue) or as a secondary segregation cue (analogous to spatial cues) that helps in stream formation. In a first experiment synthetic vowel pairs were synthesized with a range of jitter and F0 values. We show that while F0 differences lead to improved recognition, manipulation of the F0 jitter does not and therefore conclude that jitter is not a primary grouping cue. In a second set of experiments listeners were presented with sentences that were synthesized with pitch and jitter differences to test whether jitter might aid stream formation. Again our results show that the introduction of jitter does not aid in the segregation of sentences. This leaves the intriguing question how speaker specific information aids stream formation. A technical report on this work is available.

Other work at Liverpool has addressed Task 2.2 Reliability of auditory cues in multi-cue scenarios. A key question for systems that have to integrate multiple cues is how to combine and weight the different cues that are available. At Liverpool a range of experiments that examine combination rules for low-level auditory and visual motion signals were examined. Three models for cue integration were formalised: independent decisions, probability summation (i.e. independent local decisions) and linear summation (i.e. direct integration of the signals before decisions are made). Results showed that human observers use probability summation for signals that are not ecologically plausible, such as motion signals that run in opposite directions, and linear summation for signals that are ecologically plausible. This work was presented at ICA2004, Kyoto. A paper on this topic has been accepted for publication. Liverpool are currently extending the work on this topic by recording EEG traces. The results show that evoked potentials in response to auditory and visual signals are faster and larger than would be predicted from the sum of the unimodal responses. This suggests that there are specific processing sites for multi-modal signals. The work was presented at IMRF2004, Barcelona and the Liverpool team is currently preparing a set of EEG experiments to study the integration of non-speech signals into speech sounds.

Page 7: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

Liverpool are collaborating with Bochum and Sheffield in Task 4.1: Informing speech recognition. Liverpool carried out initial studies aiming to use linear prediction of the energy in 32 channels of an auditory filterbank to predict noise spectra based on past data. The results, based on the AURORA noises, show that short term prediction should lead to much better noise estimates than measures such as the long term average. The gains are larger for non-stationary noises than for stationary noises by virtue of the long term average being an already good predictor. The current aim is to record a database of typical environmental noises to evaluate the system with a reasonable sample of sounds. With help from Bochum Liverpool built a set of in-ear microphones that can be used with a DAT recorder to record binaural environmental sounds and are now collaborating with Sheffield to make the recordings.

The team at Patras is engaged on several HOARSE tasks. The post-doctoral researcher involved is JOHN WORLEY (previously at Bochum).

Task 2.3: Perceptual models of room reverberation with application to speech recognition: Work has been performed based on the use of smoothed room response measurements. The tests have illustrated some novel aspects of response measurements when employed for real-time room acoustics compensation and also the robustness of the method based on smoothed room response. This work is forming the starting point for further tests, which are described in Task 2.4.

Task 2.4: Speech enhancement for reverberant environments:: John Worley has designed an experiment that tests the spatial quality and sound efficacy of a complex smoothed room response filter. The initial stage of the experiment has been completed which has involved the building of two Graphical user interfaces to obtain subjective data as to various aspects of spatial quality (source width, envelopment, and room size) and sound quality (phase clarity, spectral balance, loudness, and overall sound quality). The testing will reveal the factors that listeners consider important when assessing reverberation characteristics of a room. Some work is also in progress on the use beamforming arrays for use in speech enhancement and ASR tasks.

Pursuing Task 2.1: Researching the Precedence effect, Worley travelled to Bochum to test subjects on the Franssen illusion within different sized rooms, and with a range of onset transitions. He completed three experiments in Bochum, which show that the traditional illusion breaks down when it is performed within a large hall. The preliminary conclusion for this work is that for the precedence effect to work, the secondary signal in the Franssen illusion must not be active until the listener has received the reflections within the room. Therefore, congruent with the ‘plausibility hypothesis’ the secondary signal will be perceived as a reflection and the illusion will operate.

A.2 Joint Publications and Patents

Page 8: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

Publications

IDIAP and USFD Andrew C. Morris, Viktoria Maier and Phil Green, “From WER and RIL to MER

and WIL: improved evaluation measures for connected speech recognition”, in International Conference on Spoken Language Processing (ICSLP), Jeju Island, Korea, 2004. In press

HUT and USFD Palomäki, K. Brown, G., and Barker, J., 'Techniques For Handling Convolutional

Distortion With `Missing Data' Automatic Speech Recognition”, Speech Communication Vol. 43, no. 1-2, pp. 123-142, 2004

Palomäki, K., Brown, G., and Wang, D., ''A Binaural Processor for Missing Data Speech Recognition in the Presence of Noise and Small-Room Reverberation,'' Speech Communication, 2004. In press.

Patents

HUT and Bochum

Merimaa, J & Pulkki, V: Perceptually-Based Processing of Directional Room Responses for Multichannel Loudspeaker Reproduction, Proc. IEEE WASPAA, New Paltz, NY, USA, 2003, pp. 51-54.

Pulkki, V, Merimaa, J & Lokki, T: Multi-Channel Reproduction of Measured Room Responses, 18th International Congress on Acoustics, Kyoto, Japan, 2004, pp. II 1273-1276.

Pulkki, V, Merimaa, J & Lokki, T: Reproduction of Reverberation with Spatial Impulse Response Rendering, AES 116th Convention, Berlin, Germany, 2004, Preprint 6057.

Merimaa, J. & Pulkki, V: Spatial Impulse Response Rendering, 7th International Conference on Digital Audio Effects (DAFx'04), Naples, Italy, 2004. Invited paper.

Pulkki, V, Merimaa J. & Lokki T: A Method for Reproducing Natural or Modified Spatial Impression in Multichannel Listening. International patent application, filed March 2004.

Page 9: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

Part B - Comparison with the Joint Programme of Work

B.1 Research Objectives

The research objectives, as set down in Annex I of the contract, are still relevant and achievable. There are inevitable shifts in perspective and emphasis, to reflect scientific progress and the expertise and interests of the young researchers we have recruited.

B.2 Research Method and Work PlanThe HOARSE tasks are as follows:

Theme Task Lead Partner

Other Partners

1. Auditory Scene Analysis

1.1: Neural Oscillators for Auditory Scene Analysis

Sheffield

1.2 Modelling grouping integration by multisource decoding

Sheffield Liverpool

1.3 Active/Passive speech perception

Liverpool Sheffield, HUT

1.4. Envelope information and binaural processing

Liverpool Bochum, Patras

1.5 Auditory Scene 1Analysis in Music

Sheffield

2: Dealing with Reverberant Conditions

2.1 Researching the precedence effect

Bochum Patras

2.2 Reliability of auditory cues in multi-source scenarios

Bochum Liverpool, Sheffield

2.3 Perceptual models of room reverberation with application to speech recognition

Patras Bochum

Task 2.4 Speech enhancement for reverberant environments

Patras

3: Speech Production Modelling

Task. 3.1 Glottal excitation estimation

HUT

Task 3.2 Voice production studies

HUT Sheffield

Task 3.3 Voice production HUT IDIAP, Bochum

1 This task was added at the end of year 1

Page 10: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

and cortical speech processing4: Automatic Speech Recognition Methodologies

Task 4.1 Developments in MultiSource Decoding

Sheffield IDIAP, HUT

Task 4.2 Informing Speech Recognition

Liverpool DCAG, IDIAP

Task 4.3 Advanced ASR Algorithms

IDIAP Sheffield

5: Speech Technology Applications

Task 5.1 Speech recognition evaluation in multi-speaker conditions

DCAG IDIAP, Sheffield

Task 5.2: Signal and speech detection in sound mixtures

IDIAP

Task 5.3 Speech technology assessment by simulated acoustic environments

Bochum IDIAP

We can identify two additions to our research methodology:

For studies of auditory scene analysis and dealing with multiple speakers, there is increasing emphasis on the collection and analysis of ‘meetings data’. This work is supported by the FP5 EC project M4 (www.m4project.org)and will soon be augmented by the FP6 Integrated Project AMI (www.amiproject.org). IDIAP and USFD are partners in these projects and the data they are collecting is be available to HOARSE researchers. A demonstration of the AMI meetings recording facility was part of the September 2003 HOARSE workshop

In the speech production modelling work led by HUT, MR imaging, which was not mentioned in the contract as a research method, has now been shown to be a powerful tool for obtaining knowledge about voice production.

B.3 Schedule and milestonesNote that here we are reporting on the work of the HOARSE teams, rather than the work of the young researchers alone.

Task Lead Partner

12 Month Milestone 24 Month Milestone

Comments

1.1 Neural Oscillators for Auditory Scene Analysis

USFD Multiple F0s using harmonic cancellation.Initial implementation of binaural grouping

F0 tracking using continuity constraints

Multiple F0 work published [Wu, Wang & Brown 03]

1.2 Modelling grouping

USFD incorporation of noise estimation into

mask-level integration.

. Multisource decoding theory journal article

Page 11: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

integration by multisource decoding

oscillator-based grouping

published in Speech Communication

1.3 Active/Passive speech perception

Liverpool Planning experiments

Experiments conducted

Experiments conducted

1.4 Envelope information and binaural processing

Liverpool Preliminary experiments

Experiments and analysis

Experiments conducted

1.5 Auditory Scene Analysis in Music

USFD F0 estimation Development of a two-stage (lower and cognitive) precedence effect model

Second system completed

2.1 Researching the precedence effect

RUB Psychoacoustic experiments on the precedence effect in realistic scenarios.

Development of a localisation model using automatic weighting function for binaural cues

Model completed [Faller & Merimaa 04]. Further psychoacoustical experiments being conducted.Some work at Patras on the relationship of Precedence effect and the Franssen illusion in conjunction with Bochum

2.2 Reliability of auditory cues in multi-source scenarios

RUB The importance of single binaural cues in various multisource environments determined in psychoacoustic experiments

Extension to multiple sources and practical room conditions

Completed [Braasch 03], Braasch et al 03], [Braasch & Blauert 03]. Research at RUB extended to spatial impression and separation of binaural cues to source and room related.

2.3 Perceptual models of room reverberation with application to speech recognition

Patras integrated response/ signal perceptual model for single source in reverberant environments.

Extension for multiple sources

Significant part of the work completed

2.4 Speech enhancement for reverberant environments

Patras Research into auto-directive arrays, controlled from the perceptual directivity module

Development of new parameterisation techniques for the voice source

Some work completed (test interfaces ready) to be supplemented by subjective tests.Missing data techniques for handling reverb developed at Sheffield

3.1 Glottal excitation estimation

HUT Research on combining new AR (Auto Regressive)

Inverse filtering experiments on intensity regulation

On schedule

Page 12: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

models to inverse filtering

of speech with soft and extremely loud voices

3.2 Voice production studies

HUT Inverse filtering experiments on high-pitched voices

Research on the relationship between the main effects of the glottal flow (fundamental frequency, phonation type etc.) and brain functions using MEG.

On schedule

3.3 Voice production and cortical speech processing

HUT Development of DSP algorithms for parameterisation of the voice source, getting familiar with MEG

.Ongoing

4.1 Developments in MultiSource Decoding

USFD Probabilistic decoding contraints

Design of predictive noise estimation algorithms. Known BSS algorithms adopted as a common base for evaluation

Probabilistic decoding implemented in current software. Adaptive noise estimation implemented in multisource models

4.2 Informing Speech Recognition

Liverpool Design of predictive noise estimation algorithms. Known BSS algorithms adopted as a common base for evaluation

HMM2 & DBM adaptation

Finalising intelligibility model for different noise types. Also work at DCAG and IDIAP

4.3 Advanced ASR Algorithms

IDIAP Multistream adaptation

Assessment report 1Targets for assessment report 2

Work reported on this task in Eurospeech 03, IEEE ASRU 03

5.1 Speech recognition evaluation in multi-speaker conditions

DCAG Database specification.Targets for assessment report 1

First recognition test in multi-speaker environment using separation algorithms (BSS and beamforming).

5.2 Signal and speech detection in sound mixtures

IDIAP Analysis of auditory cues

ASR performance for simulated deteriorated speech tested

Work reported: Ajmera et al 2003, Lathoud et al 2003

5.3 Speech technology RUB Simulation Completed and integrated

Page 13: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

assessment by simulated acoustic environments

environment for hands-free communication developed

into IKA telephone line simulation tool. ASR, speaker recognition, and speech synthesis assessment experiments carried out.

B4. Research effort of the Participants

Participant Young researchers financed by the contract (person-months)

Researchers financed from other sources (person-months)

Researchers contributing to the project (number of individuals)

1. USFD 24 48 1YR + 5 others=62. RUB 27 24 2YRs + 2 others=43. DCAG 24 24 1 YR + 2 others=34. HUT 18 12 1YR + 2 others =35. IDIAP 23.5 24 1YR + 3 others=46. LIVERPOOL 19 25 1YR + 2 others=37. PATRAS 7 6 1YR+ 1 other = 2Totals 142.5 163 8YR+17 other = 24

B.5 Organisation and Management

B5.1 Organisation and managementHOARSE is being managed in the way described in Annex 1 of the contract. The non-executive director is Dr. Jordan Cohen of VoiceSignal inc, Boston, MA. Administrative is being handled from USFD by Gillian Callaghan ([email protected]).

B5.2 Communication StrategyMost communication within HOARSE is conducted electronically. The HOARSE web site is www.hoarsenet.org. Meeting records and so on are on password-protected pages on that site. The email address for the whole network is [email protected].

B5.3 Network Meetings

Our pattern is to hold a HOARSE workshop every 6 months. Most of the time is taken on research updates: all young researchers make a presentation and we also have update talks from academics where appropriate. There is much discussion. The meeting begins with a report from the coordinator and ends with a session planning activities for the next 6 months. Prior to this, the non-executive director has an opportunity to provide feedback

Page 14: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

on the progress of the network. The workshops are scheduled for 2 days and the steering committee meets at some point in this time period. So far the following workshops have been held:

Kick-off workshop, hosted by RUB, 4-5 September 2002. All partners present, 9 delegates

2nd Workshop, hosted by HUT, 21-22 January 2003. All partners present except IDIAP, 13 delegates.

3rd Workshop, hosted by IDIAP, 5-6 September 2003. All partners present except Patras: 14 delegates

4th Workshop, hosted by USFD, 20-21 February 2004. Al lpartners present except Patras and HUT. 14 delegates

5th Workshop and Mid Term Review, to be hosted by DCAG, 28-30 Oct 2004

The non-executive director was present at the first 4 workshops. He is treated as an external expert for funding purposes. Dr. Stefan Launer of Phonak (a Swiss-based hearing aid company) attended the 3rd workshop as an external expert.

B.6 Cohesion with Less Favoured Regions and Associated StatesThe coordinator and 2 other partners (Liverpool and Patras) are in less-favoured regions. We have recently taken on a young researcher from a former candidate state (Czech Republic).

B.7 Connections to Industry

HOARSE has a full industrial partner in DCAG, and in future years we anticipate young researchers spending time there to learn about the priorities and strengths of industrial research.

PART C - TRAINING

C.1 Appointment of Young Researchers

Participant Contract deliverable of Young Researchers to be financed by the contract (person- months)

Young Researchers financed by the contract so far (person-months)

Pre-doc (a)

Post-doc (b)

Total (a+b) Pre-doc ( c) Post-doc (d) Total (c+d)

1. USFD 18 18 36 24 0 242. RUB 18 18 36 18 9 273. DCAG 18 18 36 24 0 244. HUT 18 18 36 18 0 18

Page 15: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

5. IDIAP 18 18 36 23.5 0 23.56. LIVERPOOL 18 18 36 8 11 197. PATRAS 18 18 36 0 7 7

Totals126 126 Overall

Total252

115.5 27 142.5

HOARSE opportunities have been publicised by means ofemail lists such as those maintained by ELSNET (European Language and Speech

Network), ISCA (International Speech Communication Association) and SALT (UK Speech and Language Technology).

The IHP network vacancies siteThe HOARSE web site

Though we have not been overwhelmed with applications, there has been a steady stream, of high quality. We are not recruiting at the moment though there may be some further opportunities later.

Forecasting the distribution of doctoral and post-doctoral researchers has proved difficult, as our experience with previous training networks would predict. We have recruited more doctoral and less post-doctoral researchers than we anticipated, but we have not turned down any quality post-doctoral applications. It seems that RTN opportunities are more attractive for doctoral researchers in this field: post doctoral researchers are in considerable demand.

C2. Training Programme

C2.1 Networking

HOARSE policy is that each young researcher should spend at least a week with each network partner.

Visits in the reporting period by members of the teams were as follows:

Viktoria Meyer from IDIAP to Sheffield, February 04 Eva Bjorkner from HUT to Sheffield, March 04 John Worley from Patras to Bochum Juha Merimaa from Bochum to Patras, May 04 Juha Merimaa of RUB to HUT, December 2003

The following visits are planned in the 3rd year:

Jana Eggink from Sheffield to HUT Julien Bourgeois from DCAG to IDIAP

Page 16: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

C2.3 Integration

We feel we have created an informal atmosphere in which young researchers can readily integrate with more experienced researchers and with their peers. It is difficult to be precise about how we have done this but the quality of the interactions at workshops, and the value of the discussions has been high. A young post-doctoral researcher comments:

"Having a mix of Ph.D and post-doctoral researchers, in addition to the more senior members of the group gives a synergistic effect. Since, the Ph.D students can learn from the direct contact with recent post-docs and the post-doc gains experience of advisement and discussion in an informal environment."

Our policy to encourage integration into the network is for each young researcher to have a supervisor in the host lab and an advisor in a different lab, usually but not necessarily in the network. These arrangements are as follows:

Table 4: Supervisors and Advisors

Young Researcher Supervisor AdvisorJana Eggink Guy Brown, USFD Georg Meyer, LiverpoolJuha Merimaa Jens Blauert, RUB Matti Karjalainan, HUTJohn Worley Jon Mourjopoulos, Patras Jens Blauert, BochumJulien Bourgeois Klaus Linhard, DCAG Ian McCowan, IDIAPEva Bjorkner Paavo Alku (HUT) Johan Sundberg, KTH, SwedenGuillaume Lathoud Herve Bourlard, IDIAP Klaus Linhard, DCAGElvira Perez Georg Meyer, Liverpool Martin Cooke, USFDPatti Adank Georg Meyer, Liverpool Guy Brown, USFDViktoria Maier Hynek Hermansky, IDIAP Martin Cooke, USFDPetr Svojanovsky Hynek Hermansky, IDIAP Roger Moore, USFD

C2.4 Training Measures.

At the IDIAP workshop there was a training session on the ‘smart meeting room’ facility.

Many Universities provide complementary skills programmes for researchers, and HOARSE students are encouraged to take advantage of these. An example is the Research Training Programme at USFD, which Jana Eggink has completed. There is a similar programme at Liverpool. At RUB, John Worley has taken a German language course.

C2.5 Equal Opportunities

Page 17: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

We have taken no special equal opportunities measures, but 50% of the young researchers HOARSE has recruited are female.

C2..6 Multidisciplinarity

In HOARSE, multidisciplinarity is so central to the research that young researchers receive training across discipline boundaries every day. We have recruited from a variety of backgrounds: mathematics, phonetics, linguistics and music for instance. Much of our work involves a combination of experimental work, perhaps with human listeners, and computational or mathematical modelling.

C2.7 Industrial Training

Covered in B7 above

PART D - SKETCHES OF THE YOUNG RESEARCHERSD.1 For each of the young researchers who will present their experiences at

the Mid-Term Review Meeting, provide a maximum 25 line description of the young researcher’s scientific background, of their responsibilities in the Network and of his experiences (positive and negative) to date. The young researchers should write these sketches themselves.

Eva Bjorkner Nationality: Swedish Age when started: 31 Contract: 1 March 2003-1 August 2006 PhD student, Conducting research on the singing voice Place of work: HUT, Finland No earlier fundings from Networks

‘Before I started in the HOASRE project I worked as singing teacher, which is my profession. During 1998-2000 I also did some voice research together with Professor Johan Sundberg at the department of Speech, Music & Hearing at the Royal Institute of Technology, Stockholm, Sweden. HOARSE-network has given me the opportunity to evolve and deepen my knowledge about voice production and parameterization, both in speech and in singing. Knowledge about voice production and high subglottal pressures is of great interest and importance today when many professions and singing styles have a tendency of using such pressures, which in turn is known to jeopardize vocal health. The Normalized Amplitude Quotient, developed at HUT by Professor Paavo Alku et al, has shown to be a valuable tool in this process and I am studying it´s variation with subglottal pressure and several other flow glottogram parameters. My experience of HOARSE has so far only positive. Thank you.

Page 18: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

Julien Bourgeoisnationality, French

age at time of appointment, 25 start and likely end date of appointment, 01.09.2002 - 31.08.2005 category of researcher, PhD Student scientific speciality, Signal Processing place of work, Ulm country of work, Germany

‘Scientific Background: Five years study of Electrical Engineering (majoring in Telecommunications) Parallel to that: two years study of Mathematics at Univ. Marne-La-Vallee, including one year Erasmus at Univ. of Ulm, Germany.

Responsibilities in the Network: Task 4.2: Investigation of relationships between Blind and Informed (Beamforming) Source Separation Method in Multispeakers environment.Task 5.1: Evaluation of these techniques with speech recognition.

Experience: I find the Network to establish contacts with extern research labs and become input of them for my work.’

John Worley

‘I graduated from the University of Essex in 1998 with a BSc (Hons) Psychology. I then proceeded to the University of Sussex to study psychoacoustics under the supervision of Christopher Darwin. In 2003 I was awarded a DPhil in Experimental Psychology. My doctoral thesis investigated the role of median vertical plane location in the sequential segregation of attended speech. To achieve my investigative aims I presented speech in both an anechoic and virtual environment, using individualised and non-individualised head-related transfer functions (HRTF’s). My post-doc experience has involved working in the Ruhr University, Bochum (Germany) and presently Patras University (Greece). Whilst at the Ruhr University I was responsible for the anechoic chamber laboratory. Being responsible for the laboratory taught me the importance or time management and it gave a wide range of experience in acoustical testing and measurement. At Patras my responsibility toward the laboratory is to advise and instigate psychoacoustic experiments. Currently I am involved in investigating auditory perception within rooms. For myself the HOARSE network has provided me with the opportunity to work in laboratories different from that I would experience within a psychology department in the UK. Being a psychologist with audio engineers is full filling for both parties, since the engineer learns how to make subjective tests and the psychologist can get help on the creation or analysis of the stimuli.’

Petr Svojanovsky

Name: Petr Svojanovsky

Page 19: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

Nationality: Czech Republic Age: 25 Start date: 1. January 2004 End date: 31. December 2004 Category of researcher: pre-doc undertaking PhD studies Scientific speciality: Speech recognition Place of work: IDIAP, Martigny Country of work: Switzerland

‘EDUCATION: Master degree programme at Faculty of Electrical Engineering and Communication, Brno University of Technology (1998 - 2003)

EXPERIENCES: IDIAP Research Institute, Martigny, Switzerland (Jan. 2004 ? Sep. 2004): PhD in speech recognition. Work on TRAPs (TempoRAl Patterns) ASR system, where the first stage is frequency independent, application of different clustering techniques on TRAPs based features, neural network outputs analysis. ASR experiments with a new database of nonsense syllables. Supervised by Prof. Hynek Hermansky

Viktoria Maier

Nationality: German Age: 25 Start date: 23.August 03 End date: 31 November 04 Category of researcher: pre-doc undertaking PhD studies Scientific speciality: Speech recognition Place of work: IDIAP, Martigny Country of work: Switzerland Previously studied at one of network partners: University of Sheffield Sept99-

June 03, IDIAP Febr 03 – June 03‘Education

European Masters in Language and Speech 2003; IDIAPMaster in Computer Science University of Sheffield; 1999 – 2003;

Relevant Research Experience IDIAP Research Institute Martigny, Switzerland; Sep. 2003 – Aug. 2004: PhD in speech recognition;

Experiments on sound classification: perception of synthetic consonant-vowel stimuli. Possible suggestions for ASR state-of-the-art systems. Speech Modelling: how do current state-of-the-art systems fit the speech data with focus on phoneme duration.

Supervised by Prof. Hynek Hermansky

IDIAP Research Institute Martigny, Switzerland; Feb – Jun. 2003 (European Masters in Language and Speech:

Page 20: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

Research on the problem of combining the acoustic model with the language model for Automated Speech Recognition, i.e. derive the weighting factor a (that is currently used in many ASR implementations) theoretically rather than experimentally, and varying from input to input. Supervised by Prof. Hervé Bourlard

University of Sheffield Oct. 2001 – May 2002: Dissertation in the field of Speech and Hearing.

Title: Evaluating RIL (Relative Information Loss) as basis for evaluating Automated Speech Recognition devices and the consequences of using probabilistic String Edit Distance as input. Supervised by Prof. Phil Green.

Daimler-Chrysler AG Summer 2002 Speech and hearing research department: Analyzing and evaluating current evaluation tools, such as NIST and SCTK. This work included evaluating AURORA tests with the tools mentioned above. Directed by Fritz Claas

Network responsibilities and experiences

Responsibilities : Knowledge of human speech perception applied to Automated Speech recognition as outlined in the current project description.

Experiences: The network is useful for extending available resources. When our perception experiment was conducted, we were able to use the resources and experience of two other network partners, Liverpool and Sheffield. Through the regular meetings with presentations of all the members of the network, a sensibility to other areas of research is created, including their problematic.’

Jana Eggink Nationality: German Age at time of appointment: 29 Start and likely end date of appointment: start 01.09.2002, end 15.01.2005 Category of researcher: pre-doc Scientific speciality: information extraction from musical audio Place of work: USFD Country of work: UK Previous work at a network partner: USFD under SPHEAR, 15.01.2001 –

31.08.2001

‘My first degree is a master of art in musicology, with computer science as a second major subject. This subject combination led to an early interest into auditory perception and computational auditory scene analysis, and the differences and similarities encountered when dealing with musical audio as opposed to speech. In the field of musical audio analysis I am concentrating on F0 estimation and instrument recognition in

Page 21: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

polyphonic, multi-instrumental music. Some approaches developed in the context of speaker identification and robust speech recognition like for example missing data techniques have proven to be useful for music analysis as well. In other cases new approaches were needed to take into account the harmonic relationship between concurrent tones in music. Within the HOARSE network my work falls within the task 1.5., auditory scene analysis in music. Additionally, I am responsible for the set-up and maintenance of the HOARSE web pages. One of the most positive experiences in the network for me are the frequent meetings with the other researchers in the network who work on related but not identical tasks. This helps to get new ideas, and to become aware of research that one might otherwise miss. It also results in personal contacts and knowledge about the research situations in other European countries, which I think is essential for further career development. The generous travel funding which allows for regular conference attendance is also very helpful. Close collaborations, possibly resulting in joint publications, are in my opinion hard to achieve and maybe not really suitable within the context of a (British) PhD, which is supposed to be a coherent work on one subject. This objective might be more suitable for post-doctoral/senior researchers who can be more flexible within their research.’

Guillaume Lathoud

Nationality: French Age at time of appointment: 26 Start: March 2003 End: March 2006 Category: pre-doc (PhD student) Scientific speciality: microphone array research Place of work: IDIAP Research Institute Country of work: Switzerland

‘Guillaume Lathoud's background is a M.Sc. in Telecommunications and Computer Science at INT (France), followed by 2 years and a half as a guest researcher in the Digital Television team at NIST (USA). Guillaume joined the IDIAP Research Institute as a PhD student in march 2002. He has focused on microphone array-based research, in various contexts. Below is a list of topics he has investigated, most recent first:

- sector-based multiple speakers detection and localization.- AV16.3: an audiovisual corpus for speaker localization and tracking.- short-term speaker tracking for joint segmentation and tracking.- audiovisual speaker tracking.- meeting segmentation in terms of group events.- location-based speaker segmentation.

All topics have provided Guillaume with positive experience, including new directions of research as well as handling of relatively large amounts of multichannel data. Guillaume

Page 22: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

is currently finalizing his microphone array-based study by working on multiple speakers detection and localization.’

ELVIRA PEREZ

Nationality: Spanish Age: 27 Start and likely end date of appointment: Feb 03 – Feb 06 Category of researcher: Pre-doc undertaking PhD studies Scientific speciality: Psychoacoustics/Speech Perception Place of work: School of Psychology, University of Liverpool Country of work: England

’ Researcher's scientific background: I obtained my Psychology degree at the University of Granada (Spain) in 2000. I continued studying the doctorate program in the Department of Experimental Psychology and Physiology of Behavior at the same university, carrying out an MPhil research on syllabic units, stress and speech errors. The final goal of this project was to develop a model of language production that could account for the cross linguistic differences we found in speech errors. After two years of doctoral studies I moved to the School of Psychology, University of Liverpool. In this department I started studying psychoacoustics, shifting the topic of my thesis from language production to speech perception. In the network I am studying which strategies human listeners use to segregate environmental noise from the signal or speech attended. In order to do that I am conducting experiments where I compare different background noises and how they affect speech intelligibility. The final goal is to understand auditory perceptual organization.This change of perspective has enriched my knowledge as a researcher, and also provided me with the flexibility needed to jump from one area to another. Probably this is one of the essences of a multidisciplinary network as HOARSE, and the reason because from the very beginning I felt comfortable and enthusiastic about working with researcher coming from very different backgrounds. Until now I have been in touch with Sheffield, especially Martin Cooke (my external advisor) and in the months upcoming I have proposed to Helsinki to stay in their lab for 3 or 4 months to conduct an experiment using their MEG technology.The possibilities of staying in another labs are great, the only inconvenient I have noticed is that it is not always easy to fit your line of research into other’s partner lab, and once it is done, practical reasons delay the proposal.’

Page 23: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarse_mtr.doc · Web viewAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work

PART E - NETWORK FINANCING E.1 Compare, in tabular form, the expenditure to date of each Network partner (an estimate will be sufficient) with the

allowable costs foreseen in the table following the signatures in the contract. Also estimate a breakdown of the total expenditure to date by the Network into the cost categories A, B and C. Explain any substantial differences from the rates of spending originally foreseen.

Costs based on estimated expenditure (to date)

ParticipantPrincipal

Contractor No

Type of financial

participation

Rate of financial

participant

Estimated eligible cost in euro

Maximum Community contribution

in EURO

Distribution of expenditure by category

Costs linked to

NetworkingOverhead Cost Personnel Cost Total

Expenditure

USFD 1 AC 100 209,195 209,195 13,167 17,177 80,716 111,060RUB 2 AC 100 224,400 224,400 11,973 23,508 105,566 141,047DCAG 3 AC 100 180,000 180,000 5,746 13,029 59,400 78,175HUT 4 AC 100 180,000 180,000 7,279 12,855 56,996 77,130IDIAP 5 AC 0 235,600 0 8,768 9,194 45,794 63,755UNILIV 6 AC 100 207,000 207,000 7,075 14,754 66,694 88,524PATRAS 7 AC 100 163,900 163,900 6,978 5,862 22,380

35,21960,985 96,378 437,547 594,909