OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT...

13
OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad

Transcript of OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT...

Page 1: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

ASR Project Meetings

Dt: 08 June 2004

- Rohit Kumar -

LTRC, IIIT Hyderabad

Page 2: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Basic Greedy Algorithm

1. Get the Frequency Distribution of basic units in a language by analyzing a large corpus

2. Iterate for as many sentence as you want to select

1. For Each Sentence on the Corpus

Score the sentence for its desirability in Optimal Text

2. Choose the sentence with best score into Optimal Text

3. Delete Selected Sentence from Corpus

4. Update the frequency distribution based on the sentence selected

Page 3: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Text Corpus

Phonetizer Syllabifier

This Gives you a sequence of

phonemes

Basically a set of sentences

This Gives you a sequence of basic

unitsDiphone, Triphones,

Syllables …

Unit Distribution

Analysis

Counts the number of

occurrences of each basic unit

Corpus Frequency Distribution

Analysis Step

Unit Frequency

ka 10000

ek 8756

ne 6593

Page 4: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Sentence Phonetizer Syllabifier

This Gives you a sequence of

phonemes

This Gives you a sequence of basic

unitsDiphone, Triphones,

Syllables …

Corpus Frequency Distribution

How to Score each Sentence

1. Each units is scored on the basis of its desirability.

2. Desirability is proportional to Frequency of the unit in large corpus

3. Sentence Score = Sum of Fn(Score of all units in the sentence) / Number of Units

Ranking Algorithm

Scoring Function could either be linear

or Inverse function

Page 5: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Sentence

Phonetizer Syllabifier

This Gives you a sequence of

phonemes

This Gives you a sequence of basic

unitsDiphone, Triphones,

Syllables …

Sentence Level Unit Frequency Distribution

How to Update the Frequency Distribution

Unit Distribution Analysis

Counts the number of

occurrences of each basic unit

Corpus Frequency Distribution

For Each Unit in Sentence Frequency Distribution, Subtract Modify its corpus frequency by

K x (Frequency of Unit in Sentence)

Modified Corpus

Frequency Distribution

Page 6: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Issues

1. Complete Desirable Coverage will not be possible with one step simple selection as it will bring a large number of sentences into the optimal text.

“Optimal Text means Maximum Coverage and Minimum Size”

How to Solve

Follow multiple small steps as described ahead

Page 7: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Our Strategy for Optimal Text Selection

1. From the large database, filter out sentences that are not of length between 5 to 15 words

2. From the frequency analysis of the unit, choose a set of N units (out of total M units), whose frequency is higher than a threshold (say around above 100).

3. Select the sentences (say X) which cover these N units

4. Repeat the process again with P (P = M - N) units – but restrict the number of sentences to be not more than 2 * X

5. For all the remaining units, select words which cover these units

Page 8: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Phonetizer: A class that takes as input a text and gives as output a sequence of phonemes.

What Phonemes ?? We will be following ITrans-3 as the notation across all our work.

Word Itrans – 3 Phonemes

namaste namaste n , a , m , a , s , t , e

dhanywad dhanywaad dh , a , n , y , w , aa , d

textile t’ekstaail t’ , e , k , s , t , aa , i , l

khabrein qhabren’ qh , a , b , r , e , n’

krishna krxshhnaa k , rx , shh , n , aa

Page 9: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Class Details

1. Class constructor

2. AddText (inputs string, no output)

3. GetPhoneme (no input, outputs one phoneme)

4. IsEmpty (no input, outputs flag if no text to work on left)

3 is the phonetizing function which breaks a text into phonemes and will broadly be the same of all languages.

The list of phonemes is shown in the next slide

Page 10: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Phoneme list

(for hindi, minor modifications for other languages)

a a1 aa aa* aa1 i ii u uu e e* e1 ai o au

n' : *' *

rx lx rxx lxx

k kh g gh ng-ch chh j jh nj-t' t'h d' d'h nd-t th d dh n n~p ph b bh m

y r r~ l l' l'~ v sh shh s s- h

q qh gx z dr~ dd~ f y~

Page 11: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Handling English Words

1. Dictionary Lookup

2. Letter to Sounds Module

Page 12: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Implementation: Syllabifier

Basic Units:

Diphones (2 phones), Triphones (3 phones), Syllables

Basically takes the phonemes from Phonetizer and gives units. So if we are working with triphones

t’ , e , k , s , t , aa , i , l >> t’-e-k , e-k-s, k-s-t, s-t-aa, t-aa-i, aa-i-l

Class Details1. Class constructor2. AddPhoneme (inputs a string, no output)3. GetUnit (no input, outputs one string)4. IsEmpty (no input, outputs flag if no phonemes left)

Page 13: OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad.

OPTIMAL TEXT SELECTION ALGORITHM

Effort for each Language

1. Collect the Corpus (most of Hindi, Telugu, Tamil, Marathi already available)

2. Automatic Cleaning and Conversions on the Corpus * English Word to ITrans Conversion by dictionary lookup

3. Modifying the Phonetizer (and Syllabifier) for the language

4. Running the OTS strategy

5. Manually Checking Selected Corpus and Corrections

6. Optional: Reiterating 1 or more steps in OTS Strategy if need be