Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

22
Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project

Transcript of Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Page 1: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Entropy in Machine Transliteration& Phonology

Bhargava Reddy110050078

B.Tech Project

Page 2: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Contents

• Entropy (Information Theory)• Mathematical Formulation • Cross Entropy• Transliterability and Transliteration Performance• WAVE• Phonology• Syllables• Some Syllabification rules

Page 3: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

What is Entropy

• Entropy is the amount of information obtained in each message received• It characterizes our uncertainty about our source of information

(Randomness)• Expected value function of information content in random variable

Based on Shannon's: A Mathematical Theory of Communication

Page 4: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Properties and Mathematical Formulation• Assume a set of event whose probability of occurrences are p1,p2,p3,

…,pn

• is the entropy and satisfies the following properties:1. should be continuous in pi

2. If all the pi are equal, , then H should be a monotonic increasing function of n

3. If a choice be broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Based on Shannon's: A Mathematical Theory of Communication

Page 5: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Explanation of property 3

• Assume that an event has 3 cases with probabilities of then we can measure the entropy as

1/2

1/6

1/2

1/31/2

1/3

2/3

1/2

1/6

1/3

𝐻 ( 12 , 13 , 16 )=𝐻 ( 12 , 12 )+ 12𝐻 ( 23 , 13 )Based on Shannon's: A Mathematical Theory of Communication

Page 6: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

The Formula for Entropy

Where K is a positive constant

Note: This is the only equation satisfying the properties mentioned Based on Shannon's: A Mathematical Theory of Communication

Page 7: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Properties• Assume there are 2 events with

probabilities p ,q where q = 1-p. The H is:

• if and only if all the pi but one are zero, which means one value should be 1. Which means H is always positive

• Given ‘n’ is maximum and equal to when all pi are equal to

Based on Shannon's: A Mathematical Theory of Communication

Page 8: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

The Notion of Cross Entropy

• Suppose there are 2 events, x and y, in question with m possibilities for the first and n for the second• Let be the probability of the joint occurrence of i for the first and j for

the second. The entropy of the joint event is: while

It can be easily shown that , and equality holds only when the events are independent

Page 9: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Transliterability and Transliteration performance• We need to keep in mind that transliteration between a pair of

languages is a non-trivial task1. Phonemic set of the two languages are rarely the same2. The mapping between phonemes and graphemes in

respective languages are rarely one-to-one• Some languages pairs such as have near-equal phonemes and an

almost one-to-one mapping between their character sets• Some might share similar, but unequal phoneme sets, but similar

orthography possibly due to common evolution, such as Hindi and Kannada (Many phonological features borrowed from Sanskrit)

Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Page 10: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Transliterability Measure

• The measure with the desirable qualities which could measure the ease of Transliterability among languages:

1. Rely purely on orthographic features of the language only( easily calculated based on parallel names corpora)

2. Capture and weigh the inherent ambiguity in transliteration at the character level. (i.e., the average number of character mappings)

3. Weigh the ambiguous transitions for a given character, according to the transition frequencies. Perhaps highly ambiguous mappings occur rarely

The Transliterability measure Weighted Average Entropy (WAVE), does out work

Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Page 11: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

WAVE

• WAVE will depend upon the n-gram that is being used as the unit of source and target language names• We denote WAVE1, WAVE2, WAVE3 for unigram, bigram and trigrams where, alphabet = Set of uni-, bi- or tri- grams

i,j = Source Language Unit (uni-, bi- or tri-grams)k = Target Language Unit (uni-, bi- or tri-grams)

Mappings(i) = Set of target language uni-, bi- or tri-grams that i can map toRef: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Page 12: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Motivation• From the adjacent table we can conclude that frequency

of occurrence of unigram ‘a’ is nearly 150 times more frequent than unigram ‘x’

• Which implies capturing ambiguities of ‘a’ will be more beneficial than those of ‘x’

• The term ‘frequency(i)’ captures this effect

• Table IV shows the mappings from the source to target languages

• We can observe that the uni-gram c has mapping to 2 characters स and क

• Whereas p has only one which is प

• The term ‘Entropy(i)’ captures this information and ensures that c is weighted more than p

Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Page 13: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Plot Between WAVE and Transliteration Quality

• The following plots are drawn between log(WAVE) and accuracy measure (for approximately 15k of training corpus) for language pairs of En-Hi, En-Ka, En-Ma, Hi-En, Ka-En, Hi-Ka, Ma-Ka

• We can see that as the value of WAVE decreases the accuracy is decreasing exponentially

• The left-top 2 in each of the plots is between Hindi and Marathi languages that share the same orthography and have large one-to-one character mappings between them

• We can observe that different n-grams have almost similar results which means we can choose the uni-gram model to generalize the model

• Based on these observations we can term two languages with small WAVE1 measure as more easily transliterable.

Page 14: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Phonology

• Phonetics: Concerned with how speech sounds are produced in a vocal tract as well as with the with the physical properties of the speech sound waves generated by the larynx and vocal tract• Phonology: Reference to the abstract principles that govern the distribution of sounds in a language It is the subfield of linguistics that studies the structure and systematic patterning of sounds in human language

Linguistics, An Introduction to Language and Communication, Adrian Akmajian

Page 15: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Views of Phonology

• Phonology broadly has 2 views:• Description of the sounds of a particular language and the rules governing the

distribution of these soundsEx: Phonology of English, German or other language

• Part of the general theory of human language that is concerned with universal properties of natural language sound system

• English languages has 44 phonetic sounds (20 vowel sounds and 24 consonant sounds)• These phonemes can be generalized such that it can be adapted to

many languages

Linguistics, An Introduction to Language and Communication, Adrian Akmajian

Page 16: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

English Language Phonemes Generalized

Page 17: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Syllables

• A syllable is a unit of organization for a sequence of speech sounds• They are often considered the phonological “building blocks” of words• Syllabic writing began several hundred years before the first letters. • A word that consists of a single syllable is called monosyllable. Similar

terms include disyllable for a word of 2 syllables, trisyllable for a word of 3 syllables and polysyllable which may refer to more than 3 syllables

Linguistics, An Introduction to Language and Communication, Adrian Akmajian

Page 18: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Syllable

• A syllable has the following structure:

• Across the world’s languages the most common type of syllable has the structure CV(C), that is, a single consonant C followed by a single vowel V, followed in turn (optionally) by a single consonant

Onset O

Syllable (σ)

Nucleus N

Coda C

Page 19: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Syllable Grouping

• Consider the word napkin whose splitting can be done as “nap-kin”napkin

σ1 σ2

On N

æ

Cp

Ok N

i

Cn

Linguistics, An Introduction to Language and Communication, Adrian Akmajian

Page 20: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Some Syllabification Rules

• Aspiration Rule: Phonemes with the features [-continuant, -voiced] are aspirated in syllable-initial position• /p/ is a [-continuant, -voiced] phoneme• If the intervocalic consonant p in the sequence apa is the onset of the

second syllable it will be aspirated. If it is the coda of the first syllable it will not be aspirated• As you pronounce the sequence aps, place your hand in front of your

mouth. You will feel a small puff of air that accompanies the release of the p, regardless of weather you stress the first a or the second• The presence of aspiration is the evidence we need to conclude that the

world apartment is syllabified as “a-part-ment”

Page 21: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Maximal Onset Principle

• The Principle: The sequence of consonants that combine to form an onset with vowel on the right are those that correspond to the maximal sequence that is available at the beginning of a syllable anywhere in the language• Illustration: Consider the word “constructs” which is bisyllabic

Between the 2 vowels is the sequence n-s-t-r which is to be splitSince the maximal sequence that occurs at the beginning of a

syllable in English is str- we need to split it as “n-str” Therefore the word is syllabified as “con-structs”• Why not other: Assume it is “ns-tr” then the t would appear in syllable

initial position which should be aspirated which is not true over here. Other can be ruled out similarly

Page 22: Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

References

• A Mathematical Theory of Communication (1948), C.E.Shannon , The Bell System Technical Journal, July 1948• Compositional Machine Transliteration (2010), A Kumaran, Mitesh M

Khapra, Pushpak Bhattacharya• Linguistics, An Introduction to Language and Communication, Adrian

Akmajian, Richard A Demers, Ann K Farmer, Robert M Harnish• Wiki articles on entropy, phonology and transliteration