OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT...
-
Upload
muriel-mathews -
Category
Documents
-
view
215 -
download
0
Transcript of OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT...
OPTIMAL TEXT SELECTION ALGORITHM
ASR Project Meetings
Dt: 08 June 2004
- Rohit Kumar -
LTRC, IIIT Hyderabad
OPTIMAL TEXT SELECTION ALGORITHM
Basic Greedy Algorithm
1. Get the Frequency Distribution of basic units in a language by analyzing a large corpus
2. Iterate for as many sentence as you want to select
1. For Each Sentence on the Corpus
Score the sentence for its desirability in Optimal Text
2. Choose the sentence with best score into Optimal Text
3. Delete Selected Sentence from Corpus
4. Update the frequency distribution based on the sentence selected
OPTIMAL TEXT SELECTION ALGORITHM
Text Corpus
Phonetizer Syllabifier
This Gives you a sequence of
phonemes
Basically a set of sentences
This Gives you a sequence of basic
unitsDiphone, Triphones,
Syllables …
Unit Distribution
Analysis
Counts the number of
occurrences of each basic unit
Corpus Frequency Distribution
Analysis Step
Unit Frequency
ka 10000
ek 8756
ne 6593
…
OPTIMAL TEXT SELECTION ALGORITHM
Sentence Phonetizer Syllabifier
This Gives you a sequence of
phonemes
This Gives you a sequence of basic
unitsDiphone, Triphones,
Syllables …
Corpus Frequency Distribution
How to Score each Sentence
1. Each units is scored on the basis of its desirability.
2. Desirability is proportional to Frequency of the unit in large corpus
3. Sentence Score = Sum of Fn(Score of all units in the sentence) / Number of Units
Ranking Algorithm
Scoring Function could either be linear
or Inverse function
OPTIMAL TEXT SELECTION ALGORITHM
Sentence
Phonetizer Syllabifier
This Gives you a sequence of
phonemes
This Gives you a sequence of basic
unitsDiphone, Triphones,
Syllables …
Sentence Level Unit Frequency Distribution
How to Update the Frequency Distribution
Unit Distribution Analysis
Counts the number of
occurrences of each basic unit
Corpus Frequency Distribution
For Each Unit in Sentence Frequency Distribution, Subtract Modify its corpus frequency by
K x (Frequency of Unit in Sentence)
Modified Corpus
Frequency Distribution
OPTIMAL TEXT SELECTION ALGORITHM
Issues
1. Complete Desirable Coverage will not be possible with one step simple selection as it will bring a large number of sentences into the optimal text.
“Optimal Text means Maximum Coverage and Minimum Size”
How to Solve
Follow multiple small steps as described ahead
OPTIMAL TEXT SELECTION ALGORITHM
Our Strategy for Optimal Text Selection
1. From the large database, filter out sentences that are not of length between 5 to 15 words
2. From the frequency analysis of the unit, choose a set of N units (out of total M units), whose frequency is higher than a threshold (say around above 100).
3. Select the sentences (say X) which cover these N units
4. Repeat the process again with P (P = M - N) units – but restrict the number of sentences to be not more than 2 * X
5. For all the remaining units, select words which cover these units
OPTIMAL TEXT SELECTION ALGORITHM
Phonetizer: A class that takes as input a text and gives as output a sequence of phonemes.
What Phonemes ?? We will be following ITrans-3 as the notation across all our work.
Word Itrans – 3 Phonemes
namaste namaste n , a , m , a , s , t , e
dhanywad dhanywaad dh , a , n , y , w , aa , d
textile t’ekstaail t’ , e , k , s , t , aa , i , l
khabrein qhabren’ qh , a , b , r , e , n’
krishna krxshhnaa k , rx , shh , n , aa
OPTIMAL TEXT SELECTION ALGORITHM
Class Details
1. Class constructor
2. AddText (inputs string, no output)
3. GetPhoneme (no input, outputs one phoneme)
4. IsEmpty (no input, outputs flag if no text to work on left)
3 is the phonetizing function which breaks a text into phonemes and will broadly be the same of all languages.
The list of phonemes is shown in the next slide
OPTIMAL TEXT SELECTION ALGORITHM
Phoneme list
(for hindi, minor modifications for other languages)
a a1 aa aa* aa1 i ii u uu e e* e1 ai o au
n' : *' *
rx lx rxx lxx
k kh g gh ng-ch chh j jh nj-t' t'h d' d'h nd-t th d dh n n~p ph b bh m
y r r~ l l' l'~ v sh shh s s- h
q qh gx z dr~ dd~ f y~
OPTIMAL TEXT SELECTION ALGORITHM
Handling English Words
1. Dictionary Lookup
2. Letter to Sounds Module
OPTIMAL TEXT SELECTION ALGORITHM
Implementation: Syllabifier
Basic Units:
Diphones (2 phones), Triphones (3 phones), Syllables
Basically takes the phonemes from Phonetizer and gives units. So if we are working with triphones
t’ , e , k , s , t , aa , i , l >> t’-e-k , e-k-s, k-s-t, s-t-aa, t-aa-i, aa-i-l
Class Details1. Class constructor2. AddPhoneme (inputs a string, no output)3. GetUnit (no input, outputs one string)4. IsEmpty (no input, outputs flag if no phonemes left)
OPTIMAL TEXT SELECTION ALGORITHM
Effort for each Language
1. Collect the Corpus (most of Hindi, Telugu, Tamil, Marathi already available)
2. Automatic Cleaning and Conversions on the Corpus * English Word to ITrans Conversion by dictionary lookup
3. Modifying the Phonetizer (and Syllabifier) for the language
4. Running the OTS strategy
5. Manually Checking Selected Corpus and Corrections
6. Optional: Reiterating 1 or more steps in OTS Strategy if need be