Capturing Word-level Dependencies in Morpheme-based Language Modeling
-
Upload
guy-de-pauw -
Category
Technology
-
view
377 -
download
1
description
Transcript of Capturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language Modeling
Martha Yifiru Tachbelie and Wolfgang Menzel
@
University of Hamburg, Department of Informatics, Natural Languages Systems Group
Outline
Language Modeling Morphology of Amharic Language Modeling for Amharic
− Capturing word-level dependencies Language Modeling Experiment
− Word segmentation− Factored data preparation− The language models
Speech recognition experiment− The baseline speech recognition system− Lattice re-scoring experiment
Conclusion and future work
Language Modeling
Language models are fundamental to many natural language processing
Statistical language models – the most widely used ones
− provide an estimate of the probability of a word sequence W for a given task
− require large training data data sparseness problem OOV problem
Languages with rich morphology:− high vocabulary growth rate => a high perplexity
and a large number of OOV➢ Sub-word units are used in language modeling
Serious for morphologically rich languages
Morphology of Amharic
Amharic is one of the morphologically rich languages
Spoken mainly in Ethiopia− the second spoken Semitic language
Exhibits root-pattern non-concatenative morphological phenomenon
− e.g. sbr Uses different affixes to create inflectional
and derivational word forms➢ Data sparseness and OOV are serious
problems➔ Sub-word based language modeling has been
recommended (Solomon, 2006)
Language Modeling for Amharic
Sub-word based language models have been developed
Substantial reduction in the OOV rate have been obtained
Morphemes have been used as a unit in language modeling loss of word level dependency
Solution:− Higher order ngram => model complexity➢ factored language modeling
Capturing Word-level Dependencies
In FLM a word is viewed as a bundle or vector K parallel features or factors
− Factors: linguistic features Some of the features can define the word
➢ the probability can be calculated on the basis of these features
In Amharic: roots represent the lexical meaning of a word➢ root-based models to capture word-level
dependencies
W n≡ f n1 , f n
2 , ... , f nk
Morphological Analysis
There is a need for morphological analyser− attempts (Bayou, 2000; Bayu,2002; Amsalu and
Gibbon, 2005) − suffer from lack of data can not be used for our purpose
Unsupervised morphology learning tools➢ not applicable for this study
Manual segmentation− 72,428 word types found in a corpus of 21,338
sentences have been segmented polysemous or homonymous geminated or non-geminated
Factored Data Preparation
Each word is considered as a bundle of features: Word, POS, prefix, root, pattern and suffix
− W-Word:POS-noun:PR-prefix:R-root:PA-pattern:Su-suffix
− A given tag-value pair may be missing - the tag takes a special value 'null'
When roots are considered in language modeling:
− words not derived from roots will be excluded➢ stems of these words are considered
Factored Data Preparation -- cont.
Manually seg. word list
Factored representation
Text corpus
Factored data
The Language Models
The corpus is divided into training and test sets (80:10:10)
● Root-based models order 2 to 5 have been developed● smoothed with Kneser-Ney smoothing technique
The Language Models -- cont.
Perplexity of root-based models on development test set
A higher improvement: bigram Vs trigram Only 295 OOV The best model has:
− a logprob of -53102.3− a perplexity of 204.95 on the evaluation test set
Root ngram PerplexityBigram 278.57Trigram 223.26Quadrogram 213.14Pentagram 211.93
The Language Models -- cont.
Word-based model− with the same training data− smoothed with Kneser-Ney smoothing
A higher improvement: bigram Vs trigram 2,672 OOV The best model has a logprob of -61106.0
Word ngram PerplexityBigram 1148.76Trigram 989.95Quadrogram 975.41Pentagram 972.58
The Language Models -- cont.
Word-based models that use an additional feature in the ngram history have also been developed
Root-based models seem better than all the others, but might be less constraining➢ Speech recognition experiment – lattice
rescoring
Language models PerplexityW/W2,POS2,W1,POS1 885.81W/W2,PR2,W1,PR1 857.61W/W2,R2,W1,R1 896.59W/W2,PA2,W1,PA1 958.31W/W2,SU2,W1,SU1 898.89
Speech Recognition Experiment
The baseline speech recognition system (Abate, 2006)
Acoustic model:− trained on 20 hours of read speech corpus− a set of intra-word triphone HHMs with 3 emitting
states and 12 Gaussian mixture The language model
− trained on a corpus consisting of 77,844 sentences (868,929 tokens or 108,523 types)
− a closed vocabulary backoff bigram model− smoothed with absolute discounting method− perplexity of 91.28 on a test set that consists of
727 sentences (8,337 tokens)
Speech Recognition Experiment -- cont.
Performance:− 5k development test set (360 sentences read by
20 speakers) has been used to generate the lattices
− Lattices have been generated from the 100 best alternatives for each sentence
− Best path transcription has been decoded 91.67% word recognition accuracy
Speech Recognition Experiment -- cont.
To make the results comparable− root-based and factored language models has
been developed
The corpus used in the baseline system
factored version
root-based and factored language models
Speech Recognition Experiment -- cont.
Perplexity of root-based models trained on the corpus used in the baseline speech rec.
Root ngram Perplexity LogprobBigram 113.57 -18628.9Trigram 24.63 -12611.8Quadrogram 11.20 -9510.29Pentagram 8.72 -8525.42
Speech Recognition Experiment -- cont.
Perplexity of factored models
Language models Perplexity LogprobW/W2,POS2,W1,POS1 10.61 -9298.57W/W2,PR2,W1,PR1 10.67 -9322.02W/W2,R2,W1,R1 10.36 -9204.7W/W2,PA2,W1,PA1 10.89 -9401.08W/W2,SU2,W1,SU1 10.70 -9330.96
Speech Recognition Experiment -- cont.
Word lattice to factored lattice
Factored version
Word bigram model (FBL)
Word lattice
Factored lattice
Best path transcription
91.60 %
Speech Recognition Experiment -- cont.
WRA with factored models
Language models WRA in %Factored word bigram (FBL) 91.60FBL + W/W2,POS2,W1,POS1 93.60FBL + W/W2,PR2,W1,PR1 93.82FBL + W/W2,R2,W1,R1 93.65FBL + W/W2,PA2,W1,PA1 93.68FBL + W/W2,SU2,W1,SU1 93.53
Speech Recognition Experiment -- cont.
WRA with root-based language models
Language models WRA in %Factored word bigram (FBL) 91.60FBL + Bigram 90.77FBL + Trigram 90.87FBL + Quadrogram 90.99FBL + Pentagram 91.14
Conclusion and Future Work
Root-based models have low perplexity and high logprob
But, did not contribute to the improvement of word recognition accuracy
Improvement of these models by adding other word features but still maintaining word-level dependencies
Other ways of integrating the root-based models to a speech recognition system
Thank you