Synthesis Generations First Generation Evaluation – Perfect speech could be generated – Required...
-
date post
19-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of Synthesis Generations First Generation Evaluation – Perfect speech could be generated – Required...
Synthesis Generations
• First Generation Evaluation– Perfect speech could be generated– Required perfect setting of the parameters– Human intervention put supper limits on the achievable quality
• Second Generation– Memorize pre-stored waveforms for concatenation– Cannot store enough data to concatenate all that we want– Only allows pitch and timing changes
• Third Generation– Introduce statistical models to learn the data’s properties– Allows the possibility to modify the output in many ways
Dynamic Programming• Definition: Recursive algorithm that uses arrays.
• Description:– Start with the base case, which initializes the arrays– Each step of the algorithm fills in table entries– Later steps access table entries filled-in by earlier steps
• Advantages: – avoids repeat calculations performed during recursion– Uses loops without the overhead of creating activation records
• Applications– Many applications beyond signal processing– Dynamic Time Warping: How close are two sequences– Hidden Markov Model Algorithms
Example: Minimum Edit Distance
• Problem: How can we measure how different one word is from another word (ie spell checker)?– How many operations will transform one word into another?– Examples: caat --> cat, fplc --> fireplace
• Definition: – Levenshtein distance: smallest number of insertion, deletion,
or substitution operations to transform one string into another– Each insertion, deletion, or substitution is one operation
• Requires a two dimension array– Rows: source word positions, Columns: spelled word positions– Cells: distance[r][c] is the distance up to that point
A useful dynamic programming algorithm
Pseudo Code (minDistance(target, source))
n = character in sourcem = characters in targetCreate array, distance, with dimensions n+1, m+1FOR r=0 TO n distance[r,0] = rFOR c=0 TO m distance[0,c] = cFOR each row r FOR each column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of
distance[r-1,c] + 1, //insertion
distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitutionResult is in distance[n,m]
Example
• Source: GAMBOL, Target: GUMBO• Algorithm Step: Initialization
G U M B O
0 1 2 3 4 5
G 1
A 2
M 3
B 4
O 5
L 6
Example
• Source: GAMBOL, Target: GUMBO• Algorithm Step: Column 1
G U M B O
0 1 2 3 4 5
G 1 0
A 2 1
M 3 2
B 4 3
O 5 4
L 6 5
Example
• Source: GAMBOL, Target: GUMBO• Algorithm Step: Column 2
G U M B O
0 1 2 3 4 5
G 1 0 1
A 2 1 1
M 3 2 2
B 4 3 3
O 5 4 4
L 6 5 5
Example
• Source: GAMBOL, Target: GUMBO• Algorithm Step: Column 3
G U M B O
0 1 2 3 4 5
G 1 0 1 2
A 2 1 1 2
M 3 2 2 1
B 4 3 3 2
O 5 4 4 3
L 6 5 5 4
Example
• Source: GAMBOL, Target: GUMBO• Algorithm Step: Column 4
G U M B O
0 1 2 3 4 5
G 1 0 1 2 3
A 2 1 1 2 3
M 3 2 2 1 2
B 4 3 3 2 1
O 5 4 4 3 2
L 6 5 5 4 3
Example
• Source: GAMBOL, Target: GUMBO• Algorithm Step: Column 5• Result: Distance equals 2
G U M B O
0 1 2 3 4 5
G 1 0 1 2 3 4
A 2 1 1 2 3 4
M 3 2 2 1 2 3
B 4 3 3 2 1 2
O 5 4 4 3 2 1
L 6 5 5 4 3 2
Another ExampleE X E C U T I O N
0 1 2 3 4 5 6 7 8 9
I 1 1 2 3 4 5 6 6 7 8
N 2 2 2 3 4 5 6 7 7 7
T 3 3 3 3 4 5 5 6 7 8
E 4 3 4 3 4 5 6 6 7 8
N 5 4 4 4 4 5 6 7 7 7
T 6 5 5 5 5 5 5 6 7 8
I 7 6 6 6 6 6 6 5 6 7
O 8 7 7 7 7 7 7 6 5 6
N 9 8 8 8 8 8 8 7 6 5
Hidden Markov Model• Motivation
– We observe the output– We don't know which internal states the model is in– Goal: Determine the most likely internal (hidden) state
sequence– Hence the title, “Hidden”
• Definition (Discrete HMM Ф = [O, S, A, B, Ω]1. O = {o1, o2, …, oM} is the possible output states2. S = {1, 2, …, N} possible internal HMM states3. A = {aij} is the transition probability matrix from i to j4. B = {bi(k)} probability of state i outputting ok
5. {Ω i} = is a set of initial State Probabilities where Ωi is the probability that the system starts in state i
HMM Applications
1. Evaluation ProblemWhat is the probability that the model generated the observations?
2. Decoding ProblemWhat is the most likely state sequence S=(s0, s1, s2, …, sT) in the model that produced the observations?
3. Learning ProblemHow can we adjust parameters of the model to maximize the likelihood that the observation will be correctly recognized?
Given an HMM Model and an observation sequence:
Hidden Markov Model (HMM)Natural Language processing and HMM
1. Speech Recognition • Which words generated the observed acoustic signal?
2. Handwriting Recognition • Which words generated the observed image?
3. Part-of-speech • Which parts of speech correspond to the observed words?• Where are the word boundaries in the acoustic signal?• Which morphological word variants match the acoustic signal?
4. Translation • Which foreign words are in the observed signal?
5. Speech Synthesis• What database unit fits the synthesis script
Demo: http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/hmms/s3_pg1.html
Natural Language HMM Assumptions• A Stochastic Markov process
– System state changes are not deterministic; they vary according to some probabilistic distribution
• Discrete– There is a countable system state set can observable in time
steps
• Markov Chain: Next state depends solely on the current state
• Output Assumption– Output at a given state solely depends on that state
P(w1, …, wn) ≈∏i=2,n P(wi | wi-1) not P(w1, …, wn) =∏i=2,n P(wi | w1, …, wi-1)
Demonstration of a Stochastic Processhttp://cs.sou.edu/~harveyd/classes/cs415/docs/unm/movie.html
Speech Recognition Example• Observations: The digital signal features• Hidden States: The spoken word that generated the features• Goal: Assume Word Maximizes P(Word|Observation)• Bayes Law gives us something we can calculate:
– P(Word|Observation) = P(Word) P(Observation|Word)/P(O)– Ignore denominator: it’s probability = 1 (we observed it after all)
• P(Word) can be looked up from a database– Use bi or tri grams to take the context into account– Chain rule: P(w) = P(w1)P(w2|w1)P(w3|w1,w2)P(w,w1w2…wG)
– If there is no such probability, we can use a smoothing algorithm to insert a value for combinations never encountered.
HMM: Trellis Model
Question: How do we find the most likely sequence?
Probabilities• Forward probability:
The probability of being in state si, given the partial observation o1,…,ot
• Backward probability:
The probability of being in state si, given the partial observation ot+1,…,oT
• Transition probability:αij = P(qt = si, qt+1 = sj | observed output)
The probability of being in state si at time t and going from state si, to state sj, given the complete observation o1,…,oT
t (i)
t (i)
Forward Probabilities
Notesλ = HMM, qt = HMM state at time t, sj = jth state, Oi = ith outputaij = probability of transitioning from state si to sj
bi(ot) = probability of observation ot resulting from si
αt(j) = probability of state j at time t given observations o1,o2,…,ot
αt(j) = ∑i=1,N {αt-1(i)αijbj(ot)} and αt(j) = P(O1…OT | qt = sj, λ)
Forward Algorithm Pseudo Code
forward[i,j]=0 for all i,j; forward[0,0]=1.0 FOR each time step t FOR each state s FOR each state transition s to s’ forward[s’,t+1] +=
forward[s,t]*a(s,s’)*b[s’,ot]RETURN ∑forward[s,tfinal+1] for all states s
Notes 1. a(s,s’) is the transition probability from
state s to state s’2. b(s’,ot) is the probability of state s’ given
observation ot
What is the likelihood of each possible observed pronunciation?
Complexity: O(t S2) where S is the number of states
Viterbi Algorithm• Viterbi is an optimally efficient dynamic programming
HMM algorithm that traces through a series of possible states to find the most likely cause of an observation
• Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum
• Forward Algorithm:
• Viterbi:
t ( j) t 1(i) aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i) aij b j (ot )
Viterbi Algorithm Pseudo Code
viterbi[i,j]=0 for all i,j; viterbi[0,0]=1.0 FOR each time step t FOR each state s FOR each state transition s to s’ newScore=viterbi[s,t]*a(s,s’)* b[s’,ot] IF (newScore > viterbi[s’,t+1])
viterbi[s’,t+1] = newScore maxScore = newScore
save maxScore in a queueRETURN queue
Notes 1. a(s,s’) is the transition probability from state s
to state s’2. B(s’,ot) is the probability of state s’ given
observation ot
What is the likelihood of a word given an observation sequence?
Markov Example• Problem: Model the Probability of
stocks being bull, bear, or stable
• Observe: up, down, unchanged• Hidden: bull, bear, stable
aij Bull Bear Stable
Bull 0.6 0.2 0.2
Bear 0.5 0.3 0.2
Stable 0.4 0.1 0.5
Ωi
Bull 0.5
Bear 0.2
Stable 0.3
Initialization Matrix
Probability Matrix
bearbull
stable
Example: What is the probability of observing up five days in a row?
HMM Example
• O = {up, down, unchanged (Unch)}• S = {bull (1), bear (2), stable (3)}
aij 1 2 3
1 0.6 0.2 0.2
2 0.5 0.3 0.2
3 0.4 0.1 0.5
State Ωi
1 0.5
2 0.2
3 0.3
21
3
Bi up down Unch.
1 0.7 0.1 0.2
2 0.1 0.6 0.3
3 0.3 0.3 0.4
Observe 'up, up, down, down, up'What is the most likely sequence of states for this output?
Forward Probabilities Ai,c 0 1 2
0 0.6 0.2 0.2
1 0.5 0.3 0.2
2 0.4 0.1 0.5
State Ωc
0 0.5
1 0.2
2 0.3
bc up down Unch.
0 0.7 0.1 0.2
1 0.1 0.6 0.3
2 0.3 0.3 0.4
X = [up, up]
Sum of α0,c * ai,c * bc
0.35
0.02
0.09 0.036
0.009
0.179
t=0 t=1
Bi*Ωc = 0.7 * 0.5
Note: 0.35*0.2*0.3 + 0.02*0.2*0.3 + 0.09*0.5*0.3 = 0.0357
Viterbi Example Ai,c 0 1 2
0 0.6 0.2 0.2
1 0.5 0.3 0.2
2 0.4 0.1 0.5
State Ωc
0 0.5
1 0.2
2 0.3
bc up down Unch.
0 0.7 0.1 0.2
1 0.1 0.6 0.3
2 0.3 0.3 0.4
Observed = [up, up]
Maximum of α0,c * ai,c * bc
0.35
0.02
0.09 0.021
0.007
0.147
t=0 t=1
Bi*Ωc = 0.7 * 0.5
Note: 0.021 = 0.35*0.2*0.3, versus 0.02*0.2*0.3 and 0.09*0.5*0.3
State 0
State 1
State 2
Backward Probabilities• Similar algorithm as computing the forward
probabilities, but in the other direction• Answers the question: What is the
probability that given an HMM model and given the state at time t is i, when the partial observation ot+1 … oT is generated?
t (i) P(ot1...oT | qt si,)
βt(i) = ∑j=1,N {αij bj(ot+1)βt+1(i)}
Backward Probabilities
βt(i) = ∑j=1,N {βt+1(j)αijbj(ot+1)} and βt(i) = P(Ot+1…OT | qt = si, λ)
Notesλ = HMM, qt = HMM state at time t, sj = jth state, Oi = ith outputaij = probability of transitioning from state si to sj
bi(ot) = probability of observation ot resulting from si
βt(i) = probability of state i at time t given observations ot+1,ot+2,…,oT
Parameters for HMM states• Cepstrals
– Why? They are largely statistically independent which make them suitable for classifying outputs
• Delta coefficients– Why? To overcome the HMM limitation where transitions
only depend on one previous state. Speech articulators change slowly, so they don’t follow the traditional HMM model. Without delta coefficients, HMM tends to jump too quickly between states
• Synthesis requires more parameters than ASR– Examples: additional delta coefficients, duration and F0
modeling, acoustic energy
Cepstral Review
1. Perform Fourier transform to go from time to frequency domain
2. Warp the frequencies using the Mel-scale3. Gather the amplitude data into bins (usually 13)4. Perform the log power of the amplitudes5. Compute first and second order delta coefficients6. Perform a discrete cosine transform (no complex
numbers) to form the Cepstrals
Note: Phase data is lost in the process
Training Data• Question: How do we establish the transition
probability between states when that information is not available– Older Method: tedious hand marking of wave files based
on spectrograms– Optimal Method: NP complete is intractable – Newer Method: HMM Baum Welsh algorithm is a popular
heuristic to automate the process
• Strategies– Speech Recognition: train with data from many speakers– Speech Synthesis: train with data for specific speakers
Baum-Welsh Algorithm Pseudo-code
• Initialize HMM parameters, its = 0• DO
– HMM’ = HMM; iterations++;– FOR each training data sequence
• Calculate forward probabilities• Calculate backward probabilities• Update HMM parameters
• UNTIL |HMM - HMM’|<delta OR iterations<MAX
Re-estimation of State Changes
Sum forward/backward ways to arrive at time t with observed output divided by the forward/backward ways to get to state t
α’ij = ∑t=1,T αi(t)αijb(ot+1) βj(t+1)
∑t=1,T αi(t)βi(t)
Note: b(ot) is part of αi(t)Forward Value
Backward Value
Joint probabilities State i at t and j at t+1
aijbj(Xt+1)
Re-estimation of Other Probabilities
• The probability of an output, o, being observed from a given state, s
• The probability of initially being in state, s, when observing the output sequence
b’(o) = Number of times in state s observing o
Number of times in state s
∑i=1,N α1(s)α1jb(o2) β2(i)
∑i=1,N ∑j=1,N α1(i) α1jb(o2) β2(j)Ω’s =
Summary of HMM Approaches• Discrete
– The continuous valued observed outputs are compared against a codebook of discrete values for HMM observations
– Performs well for smaller dictionaries• Continuous Mixture Density
– The observed outputs are fed to the HMM in continuous form– Gaussian mixture: outputs map to a range of distribution
parameters– Applicable for large vocabulary with a large number of
parameters• Semi-Continuous
– No mixture of Gaussian densities– Tradeoff between discrete and continuous mixture– Large vocabularies: better than discrete, worse than continuous
HMM limitations1. HMM is a hill climbing algorithm
It finds local (not global) minimums, not global minimums
It is sensitive to initial parameter settings2. HMM's have trouble modeling time duration or speech3. The first order Markov assumption independence
don't exactly model speech4. Underflow when computing Markov probabilities. For
this reason, log probabilities are normally used5. Continuous output model performance limited by
probabilities that incorrectly map to outputs 6. Relationship between outputs are interrelated, not
independent
x
x xx
x
x
xxx
xx
x
x
x
xx
x
x
x
x
x
x
x
x xx
x
x
x
xx
xx
x
x
x
xx
x
x
x
x
x
x
Decision Trees
Reasonably Good Partition Poor Partition
Partition a series of questions, each with a discrete set of answers
CART Algorithm
1. Create a set of questions that can distinguish between the measured variablesa. Singleton Questions: Boolean (yes/no or true/false) answersb. Complex Questions: many possible answers
2. Initialize the tree with one root node3. Compute the entropy for a node to be split4. Pick the question that with the greatest entropy gain5. Split the tree based on step 46. Return to step 3 as long as nodes remain to split7. Prune the tree to the optimal size by removing leaf nodes
with minimal improvement
Classification and regression trees
Note: We build the tree from top down. We prune the tree from bottom up.
Example: Play or not Play?Outlook Temperature Humidity Windy
Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Questions1)What is the outlook?2)What is the temperature?3)What is the humidity?4)Is it Windy?
Goal: Order the questions inthe most efficient way
overcast
high normal falsetrue
sunny rain
No NoYes Yes
Yes
Example Tree for “Do we play?”
Outlook
HumidityWindy
Goal: Find the optimal tree
Which question to select?
witten&eibe
Computing Entropy• Entropy: Bits needed to store possible question answers
• Formula: Computing the entropy for a question:
Entropy(p1, p2, …, pn) = - p1log2p1 – p2log2p2 … - pn log2pn
• Where pi is the probability of the ith answer to a question
log2x is logarithm base 2 of x
• Examples: – A coin toss requires one bit (head=1, tail=0)– A question with 30 equally likely answers requires
∑i=1,30-(1/30)log2(1/30) = - log2(1/30) = 4.907
Example: question “Outlook”
Entropy(“Outlook”=“Sunny”)=Entropy(0.4, 0.6)=-0.4 log2(0.4)-0.6 log2(0.6)=0.971
Five outcomes, 2 for play for P = 0.4, 3 for not play for P=0.6
Entropy(“Outlook” = “Overcast”) = Entropy(1.0, 0.0)= -1 log2(1.0) - 0 log2(0.0) = 0.0
Four outcomes, all for play. P = 1.0 for play and P = 0.0 for no play.
Entropy(“Outlook”=“Rainy”)= Entropy(0.6,0.4)= -0.6 log2(0.6) - 0.4 log2(0.4)= 0.971
Five Outcomes, 3 for play for P=0.6, 2 for not play for P=0.4
Entropy(Outlook) = Entropy(Sunny, Overcast, Rainy) = 5/14*0.971+4/14*0+5/14*0.971 = 0.693
Compute the entropy for the question: What is the outlook?
Computing the Entropy gain• Original Entropy : Do we play?
Entropy(“Play“)=Entropy(9/14,5/14)=-9/14log2(9/14) - 5/14 log2(5/14)=0.940
14 outcomes, 9 for Play P = 9/14, 5 for not play P=5/14
• Information gain equals (information before) – (information after)
gain("Outlook") = 0.940 – 0.693 = 0.247
• Information gain for other weather questions– gain("Temperature") = 0.029– gain("Humidity") = 0.152– gain("Windy") = 0.048
• Conclusion: Ask, “What is the Outlook?” first
Continuing to split
bits 571.0)e"Temperaturgain(" bits 971.0)Humidity"gain("
bits 020.0)Windy"gain("
For each child question, do the same thing to form the complete decision tree
Example: After the outlook sunny node, we still can ask about temperature, humidity, and windiness
yesnono
The final decision tree
Note: The splitting stops when further splits don't reduce entropy more than some threshold value
Other Models
• Goal: Find database units to use for synthesizing some element of speech
• Other approaches– Relax the Markov assumption
• Advantage: Can better model speech• Disadvantage: Complicates the model
– Neural nets• Disadvantage: Has not demonstrated to be superior to
the HMM approach