Synthesis Generations First Generation Evaluation – Perfect speech could be generated – Required...

Synthesis Generations

• First Generation Evaluation– Perfect speech could be generated– Required perfect setting of the parameters– Human intervention put supper limits on the achievable quality

• Second Generation– Memorize pre-stored waveforms for concatenation– Cannot store enough data to concatenate all that we want– Only allows pitch and timing changes

• Third Generation– Introduce statistical models to learn the data’s properties– Allows the possibility to modify the output in many ways

Dynamic Programming• Definition: Recursive algorithm that uses arrays.

• Description:– Start with the base case, which initializes the arrays– Each step of the algorithm fills in table entries– Later steps access table entries filled-in by earlier steps

• Advantages: – avoids repeat calculations performed during recursion– Uses loops without the overhead of creating activation records

• Applications– Many applications beyond signal processing– Dynamic Time Warping: How close are two sequences– Hidden Markov Model Algorithms

Example: Minimum Edit Distance

• Problem: How can we measure how different one word is from another word (ie spell checker)?– How many operations will transform one word into another?– Examples: caat --> cat, fplc --> fireplace

• Definition: – Levenshtein distance: smallest number of insertion, deletion,

or substitution operations to transform one string into another– Each insertion, deletion, or substitution is one operation

• Requires a two dimension array– Rows: source word positions, Columns: spelled word positions– Cells: distance[r][c] is the distance up to that point

A useful dynamic programming algorithm

Pseudo Code (minDistance(target, source))

n = character in sourcem = characters in targetCreate array, distance, with dimensions n+1, m+1FOR r=0 TO n distance[r,0] = rFOR c=0 TO m distance[0,c] = cFOR each row r FOR each column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of

distance[r-1,c] + 1, //insertion

distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitutionResult is in distance[n,m]

Example

• Source: GAMBOL, Target: GUMBO• Algorithm Step: Initialization

G U M B O

0 1 2 3 4 5

G 1

A 2

M 3

B 4

O 5

L 6

Example

• Source: GAMBOL, Target: GUMBO• Algorithm Step: Column 1

G U M B O

0 1 2 3 4 5

G 1 0

A 2 1

M 3 2

B 4 3

O 5 4

L 6 5

Example


G U M B O

0 1 2 3 4 5

G 1 0 1

A 2 1 1

M 3 2 2

B 4 3 3

O 5 4 4

L 6 5 5

Example


G U M B O

0 1 2 3 4 5

G 1 0 1 2

A 2 1 1 2

M 3 2 2 1

B 4 3 3 2

O 5 4 4 3

L 6 5 5 4

Example


G U M B O

0 1 2 3 4 5

G 1 0 1 2 3

A 2 1 1 2 3

M 3 2 2 1 2

B 4 3 3 2 1

O 5 4 4 3 2

L 6 5 5 4 3

Example

• Source: GAMBOL, Target: GUMBO• Algorithm Step: Column 5• Result: Distance equals 2

G U M B O

0 1 2 3 4 5

G 1 0 1 2 3 4

A 2 1 1 2 3 4

M 3 2 2 1 2 3

B 4 3 3 2 1 2

O 5 4 4 3 2 1

L 6 5 5 4 3 2

Another ExampleE X E C U T I O N

0 1 2 3 4 5 6 7 8 9

I 1 1 2 3 4 5 6 6 7 8

N 2 2 2 3 4 5 6 7 7 7

T 3 3 3 3 4 5 5 6 7 8

E 4 3 4 3 4 5 6 6 7 8

N 5 4 4 4 4 5 6 7 7 7

T 6 5 5 5 5 5 5 6 7 8

I 7 6 6 6 6 6 6 5 6 7

O 8 7 7 7 7 7 7 6 5 6

N 9 8 8 8 8 8 8 7 6 5

Hidden Markov Model• Motivation

– We observe the output– We don't know which internal states the model is in– Goal: Determine the most likely internal (hidden) state

sequence– Hence the title, “Hidden”

• Definition (Discrete HMM Ф = [O, S, A, B, Ω]1. O = {o1, o2, …, oM} is the possible output states2. S = {1, 2, …, N} possible internal HMM states3. A = {aij} is the transition probability matrix from i to j4. B = {bi(k)} probability of state i outputting ok

5. {Ω i} = is a set of initial State Probabilities where Ωi is the probability that the system starts in state i

HMM Applications

1. Evaluation ProblemWhat is the probability that the model generated the observations?

2. Decoding ProblemWhat is the most likely state sequence S=(s0, s1, s2, …, sT) in the model that produced the observations?

3. Learning ProblemHow can we adjust parameters of the model to maximize the likelihood that the observation will be correctly recognized?

Given an HMM Model and an observation sequence:

Hidden Markov Model (HMM)Natural Language processing and HMM

1. Speech Recognition • Which words generated the observed acoustic signal?

2. Handwriting Recognition • Which words generated the observed image?

3. Part-of-speech • Which parts of speech correspond to the observed words?• Where are the word boundaries in the acoustic signal?• Which morphological word variants match the acoustic signal?

4. Translation • Which foreign words are in the observed signal?

5. Speech Synthesis• What database unit fits the synthesis script

Demo: http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/hmms/s3_pg1.html

http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/hmms/s3_pg1.html

Natural Language HMM Assumptions• A Stochastic Markov process

– System state changes are not deterministic; they vary according to some probabilistic distribution

• Discrete– There is a countable system state set can observable in time

steps

• Markov Chain: Next state depends solely on the current state

• Output Assumption– Output at a given state solely depends on that state

P(w1, …, wn) ≈∏i=2,n P(wi | wi-1) not P(w1, …, wn) =∏i=2,n P(wi | w1, …, wi-1)

Demonstration of a Stochastic Processhttp://cs.sou.edu/~harveyd/classes/cs415/docs/unm/movie.html

http://cs.sou.edu/~harveyd/classes/cs415/docs/unm/movie.html

Speech Recognition Example• Observations: The digital signal features• Hidden States: The spoken word that generated the features• Goal: Assume Word Maximizes P(Word|Observation)• Bayes Law gives us something we can calculate:

– P(Word|Observation) = P(Word) P(Observation|Word)/P(O)– Ignore denominator: it’s probability = 1 (we observed it after all)

• P(Word) can be looked up from a database– Use bi or tri grams to take the context into account– Chain rule: P(w) = P(w1)P(w2|w1)P(w3|w1,w2)P(w,w1w2…wG)

– If there is no such probability, we can use a smoothing algorithm to insert a value for combinations never encountered.

HMM: Trellis Model

Question: How do we find the most likely sequence?

Probabilities• Forward probability:

The probability of being in state si, given the partial observation o1,…,ot

• Backward probability:

The probability of being in state si, given the partial observation ot+1,…,oT

• Transition probability:αij = P(qt = si, qt+1 = sj | observed output)

The probability of being in state si at time t and going from state si, to state sj, given the complete observation o1,…,oT

t (i)

t (i)

Forward Probabilities

Notesλ = HMM, qt = HMM state at time t, sj = jth state, Oi = ith outputaij = probability of transitioning from state si to sj

bi(ot) = probability of observation ot resulting from si

αt(j) = probability of state j at time t given observations o1,o2,…,ot

αt(j) = ∑i=1,N {αt-1(i)αijbj(ot)} and αt(j) = P(O1…OT | qt = sj, λ)

Forward Algorithm Pseudo Code

forward[i,j]=0 for all i,j; forward[0,0]=1.0 FOR each time step t FOR each state s FOR each state transition s to s’ forward[s’,t+1] +=

forward[s,t]*a(s,s’)*b[s’,ot]RETURN ∑forward[s,tfinal+1] for all states s

Notes 1. a(s,s’) is the transition probability from

state s to state s’2. b(s’,ot) is the probability of state s’ given

observation ot

What is the likelihood of each possible observed pronunciation?

Complexity: O(t S2) where S is the number of states

Viterbi Algorithm• Viterbi is an optimally efficient dynamic programming

HMM algorithm that traces through a series of possible states to find the most likely cause of an observation

• Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum

• Forward Algorithm:

• Viterbi:

t ( j) t 1(i) aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i) aij b j (ot )

Viterbi Algorithm Pseudo Code

viterbi[i,j]=0 for all i,j; viterbi[0,0]=1.0 FOR each time step t FOR each state s FOR each state transition s to s’ newScore=viterbi[s,t]*a(s,s’)* b[s’,ot] IF (newScore > viterbi[s’,t+1])

viterbi[s’,t+1] = newScore maxScore = newScore

save maxScore in a queueRETURN queue

Notes 1. a(s,s’) is the transition probability from state s

to state s’2. B(s’,ot) is the probability of state s’ given

observation ot

What is the likelihood of a word given an observation sequence?

Markov Example• Problem: Model the Probability of

stocks being bull, bear, or stable

• Observe: up, down, unchanged• Hidden: bull, bear, stable

aij Bull Bear Stable

Bull 0.6 0.2 0.2

Bear 0.5 0.3 0.2

Stable 0.4 0.1 0.5

Ωi

Bull 0.5

Bear 0.2

Stable 0.3

Initialization Matrix

Probability Matrix

bearbull

stable

Example: What is the probability of observing up five days in a row?

HMM Example

• O = {up, down, unchanged (Unch)}• S = {bull (1), bear (2), stable (3)}

aij 1 2 3

1 0.6 0.2 0.2

2 0.5 0.3 0.2

3 0.4 0.1 0.5

State Ωi

1 0.5

2 0.2

3 0.3

21

3

Bi up down Unch.

1 0.7 0.1 0.2

2 0.1 0.6 0.3

3 0.3 0.3 0.4

Observe 'up, up, down, down, up'What is the most likely sequence of states for this output?

Forward Probabilities Ai,c 0 1 2

0 0.6 0.2 0.2

1 0.5 0.3 0.2

2 0.4 0.1 0.5

State Ωc

0 0.5

1 0.2

2 0.3

bc up down Unch.

0 0.7 0.1 0.2

1 0.1 0.6 0.3

2 0.3 0.3 0.4

X = [up, up]

Sum of α0,c * ai,c * bc

0.35

0.02

0.09 0.036

0.009

0.179

t=0 t=1

Bi*Ωc = 0.7 * 0.5

Note: 0.35*0.2*0.3 + 0.02*0.2*0.3 + 0.09*0.5*0.3 = 0.0357

Viterbi Example Ai,c 0 1 2

0 0.6 0.2 0.2

1 0.5 0.3 0.2

2 0.4 0.1 0.5

State Ωc

0 0.5

1 0.2

2 0.3

bc up down Unch.

0 0.7 0.1 0.2

1 0.1 0.6 0.3

2 0.3 0.3 0.4

Observed = [up, up]

Maximum of α0,c * ai,c * bc

0.35

0.02

0.09 0.021

0.007

0.147

t=0 t=1

Bi*Ωc = 0.7 * 0.5

Note: 0.021 = 0.35*0.2*0.3, versus 0.02*0.2*0.3 and 0.09*0.5*0.3

State 0

State 1

State 2

Backward Probabilities• Similar algorithm as computing the forward

probabilities, but in the other direction• Answers the question: What is the

probability that given an HMM model and given the state at time t is i, when the partial observation ot+1 … oT is generated?

t (i) P(ot1...oT | qt si,)

βt(i) = ∑j=1,N {αij bj(ot+1)βt+1(i)}

Backward Probabilities

βt(i) = ∑j=1,N {βt+1(j)αijbj(ot+1)} and βt(i) = P(Ot+1…OT | qt = si, λ)

Notesλ = HMM, qt = HMM state at time t, sj = jth state, Oi = ith outputaij = probability of transitioning from state si to sj

bi(ot) = probability of observation ot resulting from si

βt(i) = probability of state i at time t given observations ot+1,ot+2,…,oT

Parameters for HMM states• Cepstrals

– Why? They are largely statistically independent which make them suitable for classifying outputs

• Delta coefficients– Why? To overcome the HMM limitation where transitions

only depend on one previous state. Speech articulators change slowly, so they don’t follow the traditional HMM model. Without delta coefficients, HMM tends to jump too quickly between states

• Synthesis requires more parameters than ASR– Examples: additional delta coefficients, duration and F0

modeling, acoustic energy

Cepstral Review

1. Perform Fourier transform to go from time to frequency domain

2. Warp the frequencies using the Mel-scale3. Gather the amplitude data into bins (usually 13)4. Perform the log power of the amplitudes5. Compute first and second order delta coefficients6. Perform a discrete cosine transform (no complex

numbers) to form the Cepstrals

Note: Phase data is lost in the process

Training Data• Question: How do we establish the transition

probability between states when that information is not available– Older Method: tedious hand marking of wave files based

on spectrograms– Optimal Method: NP complete is intractable – Newer Method: HMM Baum Welsh algorithm is a popular

heuristic to automate the process

• Strategies– Speech Recognition: train with data from many speakers– Speech Synthesis: train with data for specific speakers

Baum-Welsh Algorithm Pseudo-code

• Initialize HMM parameters, its = 0• DO

– HMM’ = HMM; iterations++;– FOR each training data sequence

• Calculate forward probabilities• Calculate backward probabilities• Update HMM parameters

• UNTIL |HMM - HMM’|<delta OR iterations<MAX

Re-estimation of State Changes

Sum forward/backward ways to arrive at time t with observed output divided by the forward/backward ways to get to state t

α’ij = ∑t=1,T αi(t)αijb(ot+1) βj(t+1)

∑t=1,T αi(t)βi(t)

Note: b(ot) is part of αi(t)Forward Value

Backward Value

Joint probabilities State i at t and j at t+1

aijbj(Xt+1)

Re-estimation of Other Probabilities

• The probability of an output, o, being observed from a given state, s

• The probability of initially being in state, s, when observing the output sequence

b’(o) = Number of times in state s observing o

Number of times in state s

∑i=1,N α1(s)α1jb(o2) β2(i)

∑i=1,N ∑j=1,N α1(i) α1jb(o2) β2(j)Ω’s =

Summary of HMM Approaches• Discrete

– The continuous valued observed outputs are compared against a codebook of discrete values for HMM observations

– Performs well for smaller dictionaries• Continuous Mixture Density

– The observed outputs are fed to the HMM in continuous form– Gaussian mixture: outputs map to a range of distribution

parameters– Applicable for large vocabulary with a large number of

parameters• Semi-Continuous

– No mixture of Gaussian densities– Tradeoff between discrete and continuous mixture– Large vocabularies: better than discrete, worse than continuous

HMM limitations1. HMM is a hill climbing algorithm

It finds local (not global) minimums, not global minimums

It is sensitive to initial parameter settings2. HMM's have trouble modeling time duration or speech3. The first order Markov assumption independence

don't exactly model speech4. Underflow when computing Markov probabilities. For

this reason, log probabilities are normally used5. Continuous output model performance limited by

probabilities that incorrectly map to outputs 6. Relationship between outputs are interrelated, not

independent

x

x xx

x

x

xxx

xx

x

x

x

xx

x

x

x

x

x

x

x

x xx

x

x

x

xx

xx

x

x

x

xx

x

x

x

x

x

x

Decision Trees

Reasonably Good Partition Poor Partition

Partition a series of questions, each with a discrete set of answers

CART Algorithm

1. Create a set of questions that can distinguish between the measured variablesa. Singleton Questions: Boolean (yes/no or true/false) answersb. Complex Questions: many possible answers

2. Initialize the tree with one root node3. Compute the entropy for a node to be split4. Pick the question that with the greatest entropy gain5. Split the tree based on step 46. Return to step 3 as long as nodes remain to split7. Prune the tree to the optimal size by removing leaf nodes

with minimal improvement

Classification and regression trees

Note: We build the tree from top down. We prune the tree from bottom up.

Example: Play or not Play?Outlook Temperature Humidity Windy

Play?

sunny hot high false No

sunny hot high true No

overcast hot high false Yes

rain mild high false Yes

rain cool normal false Yes

rain cool normal true No

overcast cool normal true Yes

sunny mild high false No

sunny cool normal false Yes

rain mild normal false Yes

sunny mild normal true Yes

overcast mild high true Yes

overcast hot normal false Yes

rain mild high true No

Questions1)What is the outlook?2)What is the temperature?3)What is the humidity?4)Is it Windy?

Goal: Order the questions inthe most efficient way

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Yes

Example Tree for “Do we play?”

Outlook

HumidityWindy

Goal: Find the optimal tree

Which question to select?

witten&eibe

Computing Entropy• Entropy: Bits needed to store possible question answers

• Formula: Computing the entropy for a question:

Entropy(p1, p2, …, pn) = - p1log2p1 – p2log2p2 … - pn log2pn

• Where pi is the probability of the ith answer to a question

log2x is logarithm base 2 of x

• Examples: – A coin toss requires one bit (head=1, tail=0)– A question with 30 equally likely answers requires

∑i=1,30-(1/30)log2(1/30) = - log2(1/30) = 4.907

Example: question “Outlook”

Entropy(“Outlook”=“Sunny”)=Entropy(0.4, 0.6)=-0.4 log2(0.4)-0.6 log2(0.6)=0.971

Five outcomes, 2 for play for P = 0.4, 3 for not play for P=0.6

Entropy(“Outlook” = “Overcast”) = Entropy(1.0, 0.0)= -1 log2(1.0) - 0 log2(0.0) = 0.0

Four outcomes, all for play. P = 1.0 for play and P = 0.0 for no play.

Entropy(“Outlook”=“Rainy”)= Entropy(0.6,0.4)= -0.6 log2(0.6) - 0.4 log2(0.4)= 0.971

Five Outcomes, 3 for play for P=0.6, 2 for not play for P=0.4

Entropy(Outlook) = Entropy(Sunny, Overcast, Rainy) = 5/14*0.971+4/14*0+5/14*0.971 = 0.693

Compute the entropy for the question: What is the outlook?

Computing the Entropy gain• Original Entropy : Do we play?

Entropy(“Play“)=Entropy(9/14,5/14)=-9/14log2(9/14) - 5/14 log2(5/14)=0.940

14 outcomes, 9 for Play P = 9/14, 5 for not play P=5/14

• Information gain equals (information before) – (information after)

gain("Outlook") = 0.940 – 0.693 = 0.247

• Information gain for other weather questions– gain("Temperature") = 0.029– gain("Humidity") = 0.152– gain("Windy") = 0.048

• Conclusion: Ask, “What is the Outlook?” first

Continuing to split

bits 571.0)e"Temperaturgain(" bits 971.0)Humidity"gain("

bits 020.0)Windy"gain("

For each child question, do the same thing to form the complete decision tree

Example: After the outlook sunny node, we still can ask about temperature, humidity, and windiness

yesnono

The final decision tree

Note: The splitting stops when further splits don't reduce entropy more than some threshold value

Other Models

• Goal: Find database units to use for synthesizing some element of speech

• Other approaches– Relax the Markov assumption

• Advantage: Can better model speech• Disadvantage: Complicates the model

– Neural nets• Disadvantage: Has not demonstrated to be superior to

the HMM approach

Synthesis Generations First Generation Evaluation – Perfect speech could be generated – Required...

Documents

Transcript of Synthesis Generations First Generation Evaluation – Perfect speech could be generated – Required...