Post on 13-Dec-2015
1048576ACOUSTIC-PHONETIC UNIT SIMILARITIESFOR CONTEXT DEPENDENT ACOUSTIC MODEL
PORTABILITY
Viet Bac Le Laurent Besacier Tanja Schultz
CLIPS-IMAG Laboratory UMR CNRS 5524 BP 53 38041 Grenoble Cedex 9 FRANCE Interactive Systems Laboratories Carnegie Mellon University Pittsburgh PA USA
Presenter Hsu-Ting Wei
2
Outline
bull 1 Introductionbull 2 Acoustic-phonetic unit similarities
ndash 21 Phoneme Similarityndash 22 Polyphone Similarityndash 23 Clustered Polyphonic Model Similarity
bull 3 Experimental and resultsbull 4 Conclusion
3
1 Introduction
bull There are at less 6900 languages in the worldbull We are interested in new techniques and tools for rapid p
ortability of speech recognition systems (exASR) when only limited resources are available
4
1 Introduction (cont)
bull In crosslingual acoustic modeling previous approaches have been limited to context independent modelsndash Monophonic acoustic models in target language were initialized us
ing seed models from source languagendash Then these initial models could be rebuilt or adapted using trainin
g data from the target language
bull The cross-lingual context dependent acoustic modeling portability and adaptation can be investigated
5
1 Introduction (cont)
bull 1996 J Kohlerndash J Kohler used HMM distances to calculate the similarity between two mo
nophonic models
ndash Method
bull To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric
bull During training the parameters of the mixture Laplacian density phoneme models are estimated
bull Further for each phoneme a set of phoneme tokens X is extracted from a test or development corpus
bull Given two phoneme models and and given the sets of phoneme tokens or observations Xi and Xj the distance between model and is defined by
i j
i j
ijjiji
ijjjij
jiiiji
ddd
XpXpd
XpXpd
2
1
is average thedictance symmetric
|log|log
|log|log
6
1 Introduction (cont)
bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme
distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones
ndash Method
jiNji
ccccS
SWSWSW
S
N
k
RCL
TRI
1 )( where
)()()()(2
1)(
)()()(
) - -(
phoneme) central theand
phonemecontext right phonemecontext left thedenote and (
- and - triphones twoof similarity The
ji
1kjkikjkiji
Rj
Riji
Lj
Li
Rjj
Lj
Rii
Li
RL
Rjj
Lj
Rii
Li
Confusion matrix
7
1 Introduction (cont)
bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to
overcome the problem which the context mismatch across languages increases dramatically for wider contexts
ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is
adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language
8
1 Introduction (cont)
bull In this paper we investigate a new method for this crosslingual transfer process
bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language
bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied
9
2 Acoustic-phonetic unit similarities
bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method
bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained
automatically by calculating the distance between two acoustic models
bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix
10
2 Acoustic-phonetic unit similarities (cont)
bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation
bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers
bull k = number of layers
bull Gi = user defined similarity value for layer i (i = 0k-1)
ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments
G0 = 09
G1 = 045
G4 = 00
G3 = 01
G2 = 025
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
2
Outline
bull 1 Introductionbull 2 Acoustic-phonetic unit similarities
ndash 21 Phoneme Similarityndash 22 Polyphone Similarityndash 23 Clustered Polyphonic Model Similarity
bull 3 Experimental and resultsbull 4 Conclusion
3
1 Introduction
bull There are at less 6900 languages in the worldbull We are interested in new techniques and tools for rapid p
ortability of speech recognition systems (exASR) when only limited resources are available
4
1 Introduction (cont)
bull In crosslingual acoustic modeling previous approaches have been limited to context independent modelsndash Monophonic acoustic models in target language were initialized us
ing seed models from source languagendash Then these initial models could be rebuilt or adapted using trainin
g data from the target language
bull The cross-lingual context dependent acoustic modeling portability and adaptation can be investigated
5
1 Introduction (cont)
bull 1996 J Kohlerndash J Kohler used HMM distances to calculate the similarity between two mo
nophonic models
ndash Method
bull To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric
bull During training the parameters of the mixture Laplacian density phoneme models are estimated
bull Further for each phoneme a set of phoneme tokens X is extracted from a test or development corpus
bull Given two phoneme models and and given the sets of phoneme tokens or observations Xi and Xj the distance between model and is defined by
i j
i j
ijjiji
ijjjij
jiiiji
ddd
XpXpd
XpXpd
2
1
is average thedictance symmetric
|log|log
|log|log
6
1 Introduction (cont)
bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme
distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones
ndash Method
jiNji
ccccS
SWSWSW
S
N
k
RCL
TRI
1 )( where
)()()()(2
1)(
)()()(
) - -(
phoneme) central theand
phonemecontext right phonemecontext left thedenote and (
- and - triphones twoof similarity The
ji
1kjkikjkiji
Rj
Riji
Lj
Li
Rjj
Lj
Rii
Li
RL
Rjj
Lj
Rii
Li
Confusion matrix
7
1 Introduction (cont)
bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to
overcome the problem which the context mismatch across languages increases dramatically for wider contexts
ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is
adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language
8
1 Introduction (cont)
bull In this paper we investigate a new method for this crosslingual transfer process
bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language
bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied
9
2 Acoustic-phonetic unit similarities
bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method
bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained
automatically by calculating the distance between two acoustic models
bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix
10
2 Acoustic-phonetic unit similarities (cont)
bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation
bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers
bull k = number of layers
bull Gi = user defined similarity value for layer i (i = 0k-1)
ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments
G0 = 09
G1 = 045
G4 = 00
G3 = 01
G2 = 025
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
3
1 Introduction
bull There are at less 6900 languages in the worldbull We are interested in new techniques and tools for rapid p
ortability of speech recognition systems (exASR) when only limited resources are available
4
1 Introduction (cont)
bull In crosslingual acoustic modeling previous approaches have been limited to context independent modelsndash Monophonic acoustic models in target language were initialized us
ing seed models from source languagendash Then these initial models could be rebuilt or adapted using trainin
g data from the target language
bull The cross-lingual context dependent acoustic modeling portability and adaptation can be investigated
5
1 Introduction (cont)
bull 1996 J Kohlerndash J Kohler used HMM distances to calculate the similarity between two mo
nophonic models
ndash Method
bull To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric
bull During training the parameters of the mixture Laplacian density phoneme models are estimated
bull Further for each phoneme a set of phoneme tokens X is extracted from a test or development corpus
bull Given two phoneme models and and given the sets of phoneme tokens or observations Xi and Xj the distance between model and is defined by
i j
i j
ijjiji
ijjjij
jiiiji
ddd
XpXpd
XpXpd
2
1
is average thedictance symmetric
|log|log
|log|log
6
1 Introduction (cont)
bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme
distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones
ndash Method
jiNji
ccccS
SWSWSW
S
N
k
RCL
TRI
1 )( where
)()()()(2
1)(
)()()(
) - -(
phoneme) central theand
phonemecontext right phonemecontext left thedenote and (
- and - triphones twoof similarity The
ji
1kjkikjkiji
Rj
Riji
Lj
Li
Rjj
Lj
Rii
Li
RL
Rjj
Lj
Rii
Li
Confusion matrix
7
1 Introduction (cont)
bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to
overcome the problem which the context mismatch across languages increases dramatically for wider contexts
ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is
adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language
8
1 Introduction (cont)
bull In this paper we investigate a new method for this crosslingual transfer process
bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language
bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied
9
2 Acoustic-phonetic unit similarities
bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method
bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained
automatically by calculating the distance between two acoustic models
bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix
10
2 Acoustic-phonetic unit similarities (cont)
bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation
bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers
bull k = number of layers
bull Gi = user defined similarity value for layer i (i = 0k-1)
ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments
G0 = 09
G1 = 045
G4 = 00
G3 = 01
G2 = 025
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
4
1 Introduction (cont)
bull In crosslingual acoustic modeling previous approaches have been limited to context independent modelsndash Monophonic acoustic models in target language were initialized us
ing seed models from source languagendash Then these initial models could be rebuilt or adapted using trainin
g data from the target language
bull The cross-lingual context dependent acoustic modeling portability and adaptation can be investigated
5
1 Introduction (cont)
bull 1996 J Kohlerndash J Kohler used HMM distances to calculate the similarity between two mo
nophonic models
ndash Method
bull To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric
bull During training the parameters of the mixture Laplacian density phoneme models are estimated
bull Further for each phoneme a set of phoneme tokens X is extracted from a test or development corpus
bull Given two phoneme models and and given the sets of phoneme tokens or observations Xi and Xj the distance between model and is defined by
i j
i j
ijjiji
ijjjij
jiiiji
ddd
XpXpd
XpXpd
2
1
is average thedictance symmetric
|log|log
|log|log
6
1 Introduction (cont)
bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme
distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones
ndash Method
jiNji
ccccS
SWSWSW
S
N
k
RCL
TRI
1 )( where
)()()()(2
1)(
)()()(
) - -(
phoneme) central theand
phonemecontext right phonemecontext left thedenote and (
- and - triphones twoof similarity The
ji
1kjkikjkiji
Rj
Riji
Lj
Li
Rjj
Lj
Rii
Li
RL
Rjj
Lj
Rii
Li
Confusion matrix
7
1 Introduction (cont)
bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to
overcome the problem which the context mismatch across languages increases dramatically for wider contexts
ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is
adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language
8
1 Introduction (cont)
bull In this paper we investigate a new method for this crosslingual transfer process
bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language
bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied
9
2 Acoustic-phonetic unit similarities
bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method
bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained
automatically by calculating the distance between two acoustic models
bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix
10
2 Acoustic-phonetic unit similarities (cont)
bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation
bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers
bull k = number of layers
bull Gi = user defined similarity value for layer i (i = 0k-1)
ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments
G0 = 09
G1 = 045
G4 = 00
G3 = 01
G2 = 025
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
5
1 Introduction (cont)
bull 1996 J Kohlerndash J Kohler used HMM distances to calculate the similarity between two mo
nophonic models
ndash Method
bull To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric
bull During training the parameters of the mixture Laplacian density phoneme models are estimated
bull Further for each phoneme a set of phoneme tokens X is extracted from a test or development corpus
bull Given two phoneme models and and given the sets of phoneme tokens or observations Xi and Xj the distance between model and is defined by
i j
i j
ijjiji
ijjjij
jiiiji
ddd
XpXpd
XpXpd
2
1
is average thedictance symmetric
|log|log
|log|log
6
1 Introduction (cont)
bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme
distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones
ndash Method
jiNji
ccccS
SWSWSW
S
N
k
RCL
TRI
1 )( where
)()()()(2
1)(
)()()(
) - -(
phoneme) central theand
phonemecontext right phonemecontext left thedenote and (
- and - triphones twoof similarity The
ji
1kjkikjkiji
Rj
Riji
Lj
Li
Rjj
Lj
Rii
Li
RL
Rjj
Lj
Rii
Li
Confusion matrix
7
1 Introduction (cont)
bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to
overcome the problem which the context mismatch across languages increases dramatically for wider contexts
ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is
adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language
8
1 Introduction (cont)
bull In this paper we investigate a new method for this crosslingual transfer process
bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language
bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied
9
2 Acoustic-phonetic unit similarities
bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method
bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained
automatically by calculating the distance between two acoustic models
bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix
10
2 Acoustic-phonetic unit similarities (cont)
bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation
bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers
bull k = number of layers
bull Gi = user defined similarity value for layer i (i = 0k-1)
ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments
G0 = 09
G1 = 045
G4 = 00
G3 = 01
G2 = 025
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
6
1 Introduction (cont)
bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme
distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones
ndash Method
jiNji
ccccS
SWSWSW
S
N
k
RCL
TRI
1 )( where
)()()()(2
1)(
)()()(
) - -(
phoneme) central theand
phonemecontext right phonemecontext left thedenote and (
- and - triphones twoof similarity The
ji
1kjkikjkiji
Rj
Riji
Lj
Li
Rjj
Lj
Rii
Li
RL
Rjj
Lj
Rii
Li
Confusion matrix
7
1 Introduction (cont)
bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to
overcome the problem which the context mismatch across languages increases dramatically for wider contexts
ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is
adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language
8
1 Introduction (cont)
bull In this paper we investigate a new method for this crosslingual transfer process
bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language
bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied
9
2 Acoustic-phonetic unit similarities
bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method
bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained
automatically by calculating the distance between two acoustic models
bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix
10
2 Acoustic-phonetic unit similarities (cont)
bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation
bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers
bull k = number of layers
bull Gi = user defined similarity value for layer i (i = 0k-1)
ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments
G0 = 09
G1 = 045
G4 = 00
G3 = 01
G2 = 025
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
7
1 Introduction (cont)
bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to
overcome the problem which the context mismatch across languages increases dramatically for wider contexts
ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is
adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language
8
1 Introduction (cont)
bull In this paper we investigate a new method for this crosslingual transfer process
bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language
bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied
9
2 Acoustic-phonetic unit similarities
bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method
bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained
automatically by calculating the distance between two acoustic models
bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix
10
2 Acoustic-phonetic unit similarities (cont)
bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation
bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers
bull k = number of layers
bull Gi = user defined similarity value for layer i (i = 0k-1)
ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments
G0 = 09
G1 = 045
G4 = 00
G3 = 01
G2 = 025
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
8
1 Introduction (cont)
bull In this paper we investigate a new method for this crosslingual transfer process
bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language
bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied
9
2 Acoustic-phonetic unit similarities
bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method
bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained
automatically by calculating the distance between two acoustic models
bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix
10
2 Acoustic-phonetic unit similarities (cont)
bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation
bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers
bull k = number of layers
bull Gi = user defined similarity value for layer i (i = 0k-1)
ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments
G0 = 09
G1 = 045
G4 = 00
G3 = 01
G2 = 025
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
9
2 Acoustic-phonetic unit similarities
bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method
bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained
automatically by calculating the distance between two acoustic models
bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix
10
2 Acoustic-phonetic unit similarities (cont)
bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation
bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers
bull k = number of layers
bull Gi = user defined similarity value for layer i (i = 0k-1)
ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments
G0 = 09
G1 = 045
G4 = 00
G3 = 01
G2 = 025
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
10
2 Acoustic-phonetic unit similarities (cont)
bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation
bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers
bull k = number of layers
bull Gi = user defined similarity value for layer i (i = 0k-1)
ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments
G0 = 09
G1 = 045
G4 = 00
G3 = 01
G2 = 025
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
11
2 Acoustic-phonetic unit similarities (cont)
bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)
G0 = 09
G1= 045
G4 = 00
G3 = 01
G2 = 025
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
12
2 Acoustic-phonetic unit similarities (cont)
bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in
target language
Final find the nearest polyphones Ps
N
k
ttctscttctsctsd1
kkkk )()()()(2
1)(
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
13
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited
training corpus usually does not cover enough occurrences of every polyphones
ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
14
2 Acoustic-phonetic unit similarities (cont)
bull 23 Clustered Polyphonic Model Similarity (cont)
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
15
3 Experimental and results
bull Syllable-based ASR system
bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline
bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)
ndash Crosslingual methods
bull Speech data from six languages were used to build these models Arabic Chinese English German Japanese and Spanish
bull Adapted by Vietnamese speech
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
16
3 Experimental and results (cont)
Initial model
越南人說越南話 (25 小時 )
六國人說自己本國話
( 也許 100 小時 )
加入越南人說越南話 (25 小時 ) 調適
New models
test
test
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
17
3 Experimental and results (cont)
bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data
bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
18
3 Experimental and results (cont)
bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix
bull We find that the knowledge based method outperforms the data-driven method
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates
19
4 Conclusion
bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units
bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates