Creating a Voice for Festival

Creating a Voice for Festival

Presentation by Matthew Hood

Supervisors: S. Bangay

A. Lobb

Voice: cmu_uk_rab_diphone

Presentation Overview

About the project Festival About Text to Speech

3 layer approach Waveform Generation Languages, phones and diphones

Making a voice Recording Diphones Labelling

Results

About the Project

Text to speech programs have been around for many years without much excitement.

Many new applications have arisen, sparking new interest.

One of the factors limiting its usefulness is the limited number of voices (fewer than 10?)

Creating a voice is a long, tedious process. But a greater problem is the lack of documentation.

This project aims to give a comprehensive overview of how to make a voice in Festival, pointing out all the pitfall ahead of time.

Festival

Festival is an open source TTS system developed at the University of Edinburgh in the late 90s.

“It offers a free, portable, language independent, run-time speech synthesis engine for various platforms under various APIs.” [Black et al]

Supported by the FestVox toolkit. Documented in “Building Synthetic Voices”

[Black et al]

General Text to Speech

Text Analysis Words and Utterances identified.

Linguistic Analysis Words analysed in context and pronunciation

generated e.g. 1990. Waveform Generation

Utterances turned into sound and the words “Spoken”. Due to abstraction from previous layers, this is the only layer were the voice is used.

Waveform Generation

Festival is a concatenative synthesis system. This means sound clips are joined together to

generate speech eg Talking Clocks.

Recorded Sound set

“The time is”; “past”; “o’clock”; numbers etc.

Generated Output

“The time is” – “half” – “past” – “three”.

Voice: cmu_us_kal_diphone

Waveform Generation

For a more general system it is not feasible to record everything that could be said.

Speech needs to be broken down into smaller units.

A phone is a single phonetic sound that is generated by a human when speaking.

eh - get ; feather

s - sit ; mass

zh - vision ; casual

Languages

A language is defined by its phoneme set. A phoneme set is a collection of every

phonetic sound used in any word in the language (including silence).

US English phoneset used in Festival has 44 phones.

BUT it is not enough to record every phone in the phoneset.

Diphones

We donot always pronounce a phone the same way.

Its pronunciation depends on its neighbouring phones. This is know as the co-articulatory effect.

Festival relies on the simplifying assumption that the co-articulatory effect does not extend across more than a pair of phones.

These are known as diphones.

Diphones

By combining recorded diphones, we can now “say” any word in the language.

E.g. Jack - jh-ae-k

__- jh ae - kk - __jh - ae

Recording Diphones

Because of the co-articulatory effect, it is nearly impossible to pronounce a diphone accurately on its own.

Using made up words is preferable to using real words.

us_006 “pau t aa k aa k aa pau” - “k-aa” “aa-k”

us_603 “pau t aa t ey ah t aa pau” - “ey-ah”

Recording Diphones

In theory the number of diphones needed to speak a language is the number of phones squared.

But we don’t actually talk every combination. The standard US diphone list used by festival

contains 1396 diphones. It is often worth extending this list to take into

account strong accents or common foreign words.

Recording Diphones

Because pronouncing the words can be a bit tricky, especially the first few times you try, FestVox provides a prompting tool.

Recording Environment

The better the recording the better the voice. With a decent sound card it is possible to

record straight onto the PC. Background noise must be kept to a

minimum. Takes approximately 1.5 hours to record all

diphones. Enviroment must be repeatable.

Labelling

Labelling is the hardest and one of the most important part of creating a voice.

Label file consists of series of boundary times.

Emu label is an open source program that graphically shows where in the wave file the phones are marked.

Part of the Emu Speech Tools available on Source Forge.

Hand Labelling

Displays phones, frequency and waveform.

Sound extracted from mid point of labels.

Worth moving further into the phone when recording eh-__.

Us-0603 “ey- ah”

Auto Labelling - results

FestVox provides an auto labeller. 1.6% failure rate. 8 – 15% error rate. 70% useable diphones. (400+ hand

correction)

Auto labeller

Test, test and retest.

Created splittest.pl

Hand label any problem phones.

Remove DB markers.

Finishing voice

Once happy with labels. Optional pitchmark extraction. Volume levelling. Load the voice into festival and test with

actual speech. Build final voice database. Create symbolic link.

What I have learnt & achieved

Learnt a lot about speech and speech synthesis. Learnt a lot about Linux and sound editing. Created a number of variations of

ru_us_matt_diphone, used to test different labelling methods, how recordings affect results etc.

Final paper giving step by step guide and helpful hints.

There is much room for future work, including voice adaptation.

Am sick of the sound of my own voice.Voice: ru_us_matt_diphone

Creating a Voice for Festival

Documents

Transcript of Creating a Voice for Festival