Text-to-Speech Lin Cheng Yuan 2014,6,18. Agenda TTS Introduction HTS Q & A

Text-to-Speech

Lin Cheng Yuan

2014,6,18

Agenda

• TTS Introduction• HTS• Q & A

2

3

Stephen Hawking’s Voice

Professor Stephen Hawking selects NeoSpeech Text-to-Speech as his new voice. (Mar. 15, 2004)

2010/10/15 3

http://www.neospeech.com/NewsDetail.aspx?id=50

http://www.youtube.com/watch?v=T2muY-lTzt0&feature=related

TTS Software Comparison

http://en.wikipedia.org/wiki/Comparison_of_speech_synthesizers

4

Training part

Synthesis part

Unit-selection synthesis (USS) (1/2)

SPEECHDATABASE Prosodic

parameterextraction

Spectralparameterextraction

Signalprocessing

TEXT

Text analysis

SYNTHESIZEDSPEECH

Build USS system

Unit-selectionalgorithm

Unit database& cost functions

Spectralparameters

Prosodicparameters

Transcriptions

Speech signal

Prosodyprediction

Prosodyprediction

5

Unit-selection synthesis (USS) (2/2)

Unit-selection algorithm [Hunt; ’96]

Major techniques– General selection technique [Hunt; ’96]

– Clustering-based technique [Donovan; ’95]

q

kii

ck

ckii

c uuCwuuC1

11 ),(),(

p

jii

tj

tjii

t utCwutC1

),(),(

)},({minargˆ 1111

nn

u

n utCun

Candidateunits

Targetunits

Concatenationcost

1iu

1it

iu 1iu

it 1it

targetcost

6

n

i

n

i

nn utC1 2

11 ),( ),( 1 iic uuC ),( ii

t utC

: Target cost),( iit utC: Concatenation cost),( 1 ii

c uuC

Agenda

• TTS Introduction• HTS• Q & A

7

HMM-based speech synthesis system

SPEECHDATABASE Excitation

Parameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Speech signal Training part

Synthesis part

Training HMMs

Spectralparameters

Excitationparameters

Labels

8

HTS History

December 25, 2002

HTS version 1.0 was released.

December 29, 2006

HTS version 2.0 was released

October 1, 2007

HTS version 2.0.1 and hts_engine_API version 0.9 were released.

July 7, 2011

HTS version 2.2 was released.

December 25, 2012 (latest)

HTS version 2.3 alpha was released to the hts-users ML members.

More detail:

http://hts.sp.nitech.ac.jp/?History#v5285cba

9

http://hts.sp.nitech.ac.jp/?History

Before HTS

• HTK • The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and

manipulating hidden Markov models • Latest version 3.4.1 (2009)

• SPTK• The Speech Signal Processing Toolkit (SPTK) is a suite of speech signal

processing tools• Latest version 3.7 (2013)

• hts_engine API• The hts_engine is software to synthesize speech waveform from HMMs trained by

the HMM-based speech synthesis system (HTS).• Latest version 1.08 (2013)

• Optional:• Festival • STRAIGHT

10

11

HTS Process

Speech database– Suggest at least 2 hours

Label file generation– A good text analysis tool is must

Feature extraction– mcp, lf0, cmp, …..

Model training– Using modified HTK 3.4.1 with HTS patch

HTS Installation

12

HTS Installation (1) – HTS for HTK patch

13

HTS Installation (2) – Normal Demo

1. Install Festival SPTK HTS Hts_engine

2. Configure

3. Make

14

HTS Directory Structure Before Installation

15

HTS Directory Structure After Installation (include training)

16

Data Scripts Results

training

Voices Data

– .pdf (probability density function file)• Binary-formatted• Statistics of model parameters

– .winX (window function)• Text-formatted• Not used actually

– .inf (decision tree file)• Text-formatted• Selected questions• Tree topology

17

Text-To-Speech using hts_engine

18

hts_engine

HTS Training Data - Labels

labels/mono : Mono-phone (context-independent) label files Label needs to include time information

labels/full : Context-dependent label files Time information can be omitted.

labels/gen : Context-dependent label files for test

19

How To Get Label Files?

20

Label information:– Raw Text Input– Markup Parser (XML tag)– Text Converter

• Sentence Segmentation• Text Normalization (06/18 六月十八號 )

– Linguistic Analyzer• Word Segmentation & POS (Part-of-Speech) tagging• Word Formation ( 構詞 )

– 定量式複合詞 (e.g. 二月十本六萬 )– 重疊詞 (e.g. 慢慢看、快快樂樂 )

• 破音字處理 (heteronym processing)– e.g. 「著」火 vs. 「著」名 ; 「還」來 vs. 「還」錢來

• 變調 (tone sandhi)– e.g. 「一」定 ( 二聲 ) 第「一」名 ( 一聲 ) – e.g. 「不」要 ( 二聲 ) 「不」可 ( 四聲 )

Time information:– HTK forced alignment + manual revisions (optional)

TimeTone

SyllableWordPOS…..

HTS Training Data - Feature Data

o /data/mcp: o mel-spectrum featureso Using SPTK toolkit (mcep)

o /data/lf0 : o logarithmic fundamental frequencyo Using SPTK toolkit (pitch)

o /data/cmp : o composited features (MCP+LF0)o Using perl scripts (data/scripts)

21

HTS Training Data - Question Set

Example:

22

HTS Training Data - Others

o /data/scpo train.scp for the trainingo gen.scp for synthesis

o /data/listso Full.list for all unique labels of training data o Full_all.list for all unique labels of training data and test data

o /data/wino parameters used in feature extractiono lf0.win1 ~ 3, mcp.win1~3

23

HTS 基本原理介紹

24

25

MCP (Mel-Cepstral Parameter) DeltaMCP delta

– Take mcp.win2 為例 : 3 -0.5 0.0 0.5 (3 個參數 : -0.5 0.0 0.5)

.

.

.

.

.

.

.

.

Mdims

K frames

N=1 N=K

……………………………………......

.

.

.

.

N=2 N=3

.

.

.

.

N=2

-0.5 0.0 0.5

+

Mdims ……………………………………..

.

.

.

.

N=1

.

.

.

.

N=K

MCP delta

NGASR Workshop 201026

2010/1/25

HTS - Feature RepresentationSpectral

parameters(MCP)

Spectral parameters

(LF0)

Window function (Δ- and ΔΔ-)

MCP Δ-MCP ΔΔ-MCP LF0 Δ-LF0 ΔΔ-LF0

Stream 1 Stream 2 Stream 4Stream 3

x 1x M

x 3M x 1x 1x 1

One Gaussian per state Two mixtures per state, “voiced” and “unvoiced”

(cmp)

Observation of F0

Time

Log

freq

uen

cy

Unable to model by continuous or discrete distribution ⇒ Multi-space probability distribution HMM (MSD-HMM)

11 R

Voiced0

2 RUnvoiced

・

27

MSD-HMM for F0 modeling

1 2 3HMM for F0

2,1w 2,2w 2,3w

1,1w 1,2w 1,3w

Unvoiced

Voiced

Unvoiced

Voiced

Unvoiced

Voiced

・・・02 R

11 R

Voiced / Unvoiced weights

Enable to model with dynamic F0 feature28

Structure of F0 state-output distributions

Voiced

Unvoiced

Voiced

Unvoiced

Voiced

Unvoiced

・

・

・

Log F0

MSD(Gaussian & discrete)



p

p2

p

29

Fre

quen

cy50

100150200

0 1 time (s)2

Fre

quen

cy

50

100150200

0 1 2 time (s)

Fre

quen

cy

50

100150200

0 1 2 time (s)

Generated F0

natural speech

without dynamic features

with dynamic features

30

Speech parameter generation algorithm [Tokuda; ’00]

)|(),|(max

)|(),|()|(

λqλqo

λqλqoλo

q

q

PP

PPP

31

),ˆ|(maxargˆ

),|(maxargˆ

λqoo

λqq

o

q

P

wP

⇒

For given HMM , determine a speech parameter vectorSequence which maximizes

λTTTT ],,,[ 21 Toooo

Structure of state-output distributions

Voiced

Unvoiced

Voiced

Unvoiced

Voiced

Unvoiced

・

・

・

Log F0

Mel-cepstrum

Spe

ctru

mE

xcita

tion

tο

tc

tc

tc2

tp2tptp

32

Determination of state sequence (1/3)

33

q

Observation sequence

State sequence

4 10 5dState duration

Determine state sequence via determining state durations

11a 22a 33a

12a 23a

)(1 tb o )(2 tb o )(3 tb o

1o 2o 3o 4o 5o To ・・

1 2 3

1 1 1 1 2 2 3 3

ija

)( tqb o

: State transition probability

: Output probability

o

Relation between two approaches (2/3)

Unit selection HMM-based

34

Features

Units

Unit selection : Serial

F0Dur.

Text

SYNTHESIZED SPEECH

Synthesis

Text analysis

Text

SYNTHESIZED SPEECH

Synthesis

Text analysis

HMM-based : Parallel

Labels

Dur.Spect. F0

Comparison of two approaches

Unit selection HMM-based synthesis

Advantage:High quality at the waveform level(Especially in the limited domain)

Disadvantage:Vocoder sound(Domain-independent)

Disadvantage:• Large footprints• Discontinuous• Unstable quality

Advantage:• Small footprints• Smooth• Stable quality

Fixed voice characteristics Various voice characteristics

35

Text Normalization Examples– 1750 as seventeen-fifty (if year) or one-thousand seven-

hundred and fifty (if measure).

– 1/3 as January third (if date) or one third (if fraction).

– St. can be saint or street

– $100 million as one hundred million dollars, not as one hundred dollars million.

– RAM is usually pronounced as a word and NBA as letter-by-letter.

– Chapter III should be read as Chapter three.

– US means United States, UK means United Kingdom, how about USE?

36

Hybrid approaches (3/4)

Smoothing units

Smooth unit boundaries using HMM statistics– Microsoft [Plumpe; ’98], OGI [Wouters; ’00]

TEXT

Text analysis

Unit selection& concationSpectra &

State sequences

SmoothedspectraSpectral

smootherWaveformsynthesizer

SYNTHESIZEDSPEECH

37

Text-to-Speech Lin Cheng Yuan 2014,6,18. Agenda TTS Introduction HTS Q & A

Documents

Transcript of Text-to-Speech Lin Cheng Yuan 2014,6,18. Agenda TTS Introduction HTS Q & A