Speech SynthesisSpeech Synthesis
1
Speech synthesisSpeech synthesisSpeech synthesis is the artificial production of
human speech.The computer or instrument used for this purpose is
called a speech synthesizer.A Text-To-Speech(TTS) synthesis is production of speech
from normal language text.
Input text phonetic synthesised levels speech simple text-to-speech synthesis
2
text and linguistic analysis
Prosody and speech generation
Stephen HawkingStephen HawkingSuffering from motor neuron
disease ALS.Lost his speech ability.First computer based speech
system was provided by Intel®.Main interface program is EZ KEYS
written by Word plus Inc.Cursor is controlled by cheek
moments and detected by IR sensor mounted on spectacles.
Formed words are sent to speech synthesiser ,hardware made by speech+.
3
Stephen HawkingStephen Hawking• Speech synthesiser voice output
can also be stored.• Current configuration
• Lenovo ThinkPad X220 tablet (2 copies).
• Intel® Core™ i7-2620M CPU @ 2.7GHz.
• Intel® 150Gb Solid-State Drive 520 Series.
• Windows 7.• Speech Synthesizers (3 copies):
Manufacturer: Speech+ CA.4
History of speech synthesizerHistory of speech synthesizer
• First device to be considered as speech synthesiser was VODER introduced by Homer Dudley in 1939 in New
York’s world fair.• The first format synthesizer PAT (Parametric Artificial
Talk) was introduced by Lawrence in 1953.
5
Architecture of TTS systemsArchitecture of TTS systems
6
Text-to-phoneme module
Text input
Grapheme-to-phoneme
conversion
Prosodic modelling
Acoustic synthesis
Abbreviation lexicon
Text in orthographic formExceptions
lexicon
Orthographic rules
Phoneme string
Normalization
Grammar rules
Phoneme string + prosodic annotation
Prosodic model
Synthetic speech output
Phoneme-to-speech module
Various methods
Challenges in speech Challenges in speech synthesissynthesis• TEXT-To-Phoneme Conversion
It is the conversion of input text into linguistic representation, also called as Grapheme-To-Phoneme conversion.
• Text Processing In this digits ,numerals, fractions, dates, abbreviations are
expanded into full words.
• Pronunciation• Next task is to find correct pronunciation.
• Homographic words should be pronounced correctly.
7
Challenges in speech Challenges in speech synthesissynthesis• Prosody
– Finding correct intonation, stress, and duration for written text.
8
Text normalizationText normalization
• Text ProcessingIn this digits, numerals, fractions, dates, abbreviations
are expanded into full words.Examples; 1750 would be expanded as seventeen-fifty
(if year) and one-thousand seven-hundred and fifty (if measure).
5/13 would be expanded as five-thirteenths (if fraction) and May thirteen.
Numbers are especially difficult 233 4488
9
Text normalizationText normalization
• Any text that has a special pronunciation is stored in a lexiconAbbreviations (Mr, Dr, St)Acronyms (UN as UNESCO)Special symbols (&, %)Particular conventions (£5, $5 million, 12°C)
10
Grapheme-to-phoneme conversionGrapheme-to-phoneme conversion
• It is the conversion of input text into linguistic representation.
• English spelling is complex but largely regular than other languages. • Gross exceptions must be in lexicon• Lexicon features
– look-up should be quick.– need rules anyway for unknown words too.
11
Grapheme-to-phoneme conversionGrapheme-to-phoneme conversion
• Much easier for some languages (Spanish, Italian, Welsh, Czech, Korean)
• Much harder for others (English, French)• Especially if writing system is only partially alphabetic
(Arabic, Urdu)• Or not alphabetic at all (Chinese, Japanese)
12
Prosody modellingProsody modellingThe voice parameters affected by emotions are usually categorized in three main types:
Voice quality contains largely constant voice characteristics over the spoken utterance, such as loudness and breathiness.
Pitch contour and its dynamic changes carry important emotional information.
Time characteristics contain the general rhythm, speech rate, the lengthening and shortening of the stressed syllables, the length of content words, and the duration an placing of pauses.
13
Prosody modellingProsody modelling• The secondary emotional states are ;
Anger The voice is very breathy and has tense articulation with abrupt changes.
Happiness or joy The voice is breathy and light without tension.
Fear or anxiety Articulation is precise and the voice is irregular and energy at lower frequencies is
reduced. Sadness or sorrowness
The articulation precision and the speech rate are also decreased. Disgust or contempt
The average pitch level and the speech rate are also lower compared to normal speech and the number of pauses is high.
Whispering and shouting Whispering is produced by speaking with high breathiness without fundamental
frequency. Shouted speech causes an increased pitch range, intensity and greater variability in it.
14
Acoustic synthesisAcoustic synthesis
• Methods, Techniques and Algorithms:Articulatory synthesisFormant synthesisConcatenative synthesis
PSOLA MethodMicrophonemic MethodLinear prediction based MethodsSinusoidal Models
15
Articulatory synthesisArticulatory synthesis• Refers to the computational techniques for synthesizing
speech based on human vocal tract and articulation processes occurring there.
• Wolfgang von Kempelen and others used bellows, reeds and tubes to construct mechanical speaking machine.
• Modern versions simulate electronically the effect of articulator positions, vocal tract shape, etc.
16
Formant synthesisFormant synthesis• Formant means an acoustic resonance of human
vocal tract.• Probably the most widely used synthesis method
during last decades • Synthesised speech output is created by using
additive synthesis and an acoustic modelling.• SoftVoice synthesizers stimulates the human speech
production mechanism using digital oscillators, noise sources, and filters(formant resonators) just like an electronic music synthesizers.
17
Formant synthesis Demo: Formant synthesis Demo:
Microsoft windows• In control panel select
“Speech” icon• Type in your text and Preview
voice• You may have a choice of
voices
18
Concatenative synthesisConcatenative synthesis
• Concatenate segments of pre-recorded natural human speech.
• Requires database or lexicon of previously recorded human speech covering all the possible segments to be synthesised.
• Segment might be phoneme, syllable, word, phrase, or any combination.
• Diphone segments can be digitally manipulated for length, pitch and loudness.
• Segment boundaries need to be smoothed to avoid distortion.
19
Concatenative synthesis Concatenative synthesis methodsmethods• PSOLA (Pitch synchronous Overlap Add)
This algorithm is used to concatenate smoothly and provides good controlling for pitch and duration.
It is used for commercial synthesis systems.Time domain PSOLA is most commonly used due to its
computational efficiency.
• Micro-phoneme methodThe concatenation is made by Linear amplitude-based
Interpolation Method between the prototypes.
20
Concatenative synthesis Concatenative synthesis methodsmethods• Linear prediction based methods
This method is designed originally for speech coding system ,but also used for speech synthesis.
Co-variance and auto co-relation is used.
• Sinusoidal ModelsBased on assumption that voice signal can be
represented as sum of sine waves with time varying amplitude and frequencies.
Sinusoidal models are successfully used in singing voice synthesis using MIDI interface.
21
APPLICATIONSAPPLICATIONS
Application for the blindUsed for reading and communication aid for blindCurrent systems are mostly software based ,so with
scanner and OCR(optical character recognition) systemsApplication for deafened and vocally handicapped
Provides opportunity to communicate with people who do not understand sign language.
HAMLET helps users to express their feelings.HAMLET system is used with high quality TTS such as
DECTALK.
24
APPLICATIONSAPPLICATIONS
Educational applicationsProgrammed for special tasks like spelling and
pronunciation teaching for different languages. speech synthesizer is connected to word processor
which is helpful for proof reading.Applications for telecommunication and
multimediaSynthesized speech is used in all kind of telephone
enquiry systems.VoiceXML: Internet surfing using voice.
25
PRODUCTSPRODUCTS
• INFOVOX INFOVOX speech synthesizer is perhaps one of best known multilingual TTS
products. The latest full commercial version available is INFOVOX IVOX.
26
PRODUCTSPRODUCTS
• DECTalk Available for American English, Spanish and German
and available in nine different voice personalities, four female , four male and one child.
27
PRODUCTSPRODUCTS
• Bell Labs Text-to-Speech Available in English, French, Spanish, Italian,
German, Russian, Romanian, Chinese and Japanese.
28
PRODUCTSPRODUCTS
• SoftVoiceSoftVoice is better known for SAM(Software Automatic
Mouth) synthesizer for Apple MacinTAlk, Amiga and Attari computers.
Fifth generation SoftVoice is also available for windows in 20 different languages.
• CNET PSOLAOne of the promising method for concatenation
synthesis developed by French Telecom CNET(Centre National d’Etudes Télécommunications ).
29
PRODUCTSPRODUCTS
• Apple Plain TalkApple developed three different speech synthesis
systems for Macintosh PCs.
30
PRODUCTSPRODUCTS
• Windows WhistlerMicrosoft Whistler (Whisper Highly Intelligent Stochastic
Talker) is a trainable speech synthesis system which is under development at Microsoft Research, Richmond, USA. The system is designed to produce synthetic speech that sounds natural and resembles the acoustic and prosodic characteristics of the original speaker .
31
THANK YOUTHANK YOU
Top Related