CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 :...

84
CS626 : Natural Language Processing/Speech, NLP and the Web (Lecture 34–36: Phonetics and phonology; transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay 16, 27, 28 Oct, 2014 16 oct, 2014 Phonetics-phonology, Pushpak Bhattacharyya 1

Transcript of CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 :...

Page 1: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

CS626 : Natural Language Processing/Speech, NLP and the Web(Lecture 34–36: Phonetics and phonology;

transliteration)

Pushpak BhattacharyyaCSE Dept., IIT Bombay

16, 27, 28 Oct, 2014

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 1

Page 2: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Motivation

Language origin- speech sound Writing came much later

The primordial sound: a-u-m

ॐ16 oct, 2014

Phonetics-phonology, Pushpak Bhattacharyya 2

Page 3: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Motivation

Language origin- speech sound Writing came much later

Automatic Speech Recognition (ASR) Text to Speech (TTS) Speech to Speech Machine Translation Speech interface to search

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 3

Page 4: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Ancient 5 x 5 Indian Classification of ConsonantsGroupक वग क ख ग घ ङ Velarच वग च छ ज झ ञ Palatalट वग ट ठ ड ढ ण Alveolarत वग त थ द ध न Dentalप वग प फ ब भ म Labial

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 4

Page 5: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Grapheme-phoneme: confused relation

Fish == Ghoti ?

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 5

Page 6: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Phonteic Symbols and IPA notation

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 6

Page 7: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

IPA: vowels

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 7

Page 8: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Places of articulation

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 8

Page 9: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Place of Articulation Labial: Two lips coming together

[p] as in possum, [b] as in bear Dental: Tongue against the teeth

[th] of thing or the [dh] of though Alveolar: Alveolar ridge is the portion of the roof of the mouth just behind the upper teeth; tip

of the tongue against the alveolar ridge. Phones [s], [z], [t], and [d]

Palatal: Roof of the mouth; blade of the tongue against this rising back of the alveolar ridge sounds [sh] (shrimp), [ch] (china), [zh] (Asian), and [jh] (jar)

Velar: Movable muscular flap at the back of the roof of the mouth; back of the tongue up against the velum sounds [k] (cuckoo), [g] (goose), and [N] (kingfisher)

Glottal: closing the glottis (by bringing the vocal folds together) glottal stop [q] (IPA [P]) is made by closing the glotis

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 9

Page 10: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Manner of Articulation: Stops and Nasals

All consonants are produced by restriction of airflow Manner of Articulation; how the restriction is produced:

complete or partial stoppage A stop is a consonant in which airflow is completely blocked for a short time English has voiced stops like [b], [d], and [g] as well as unvoiced stops like [p], [t], and

[k]. Stops are also called plosives Nasal sounds [n], [m], and [ng] are made by lowering the velum and allowing air

to pass into the nasal cavity

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 10

Page 11: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Vowels (1/2)

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 11

Page 12: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Vowels (2/2)

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 12

Page 13: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Fricatives Fricatives, airflow is constricted but not cut off completely. The turbulent airflow that

results from the constriction produces a characteristic “hissing” sound. The English labiodental fricatives [f] and [v] are produced by pressing the lower lip

against the upper teeth, allowing a restricted airflow between the upper teeth. The dental fricatives [th] and [dh] allow air to flow around the tongue between the teeth. The alveolar fricatives [s] and [z] are produced with the tongue against the alveolar

ridge, forcing air over the edge of the teeth. In the palato-alveolar fricatives [sh] and [zh] the tongue is at the back of the alveolar

ridge forcing air through a groove formed in the tongue.

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 13

Page 14: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Affricates, Laterals/Liquids and Taps/Flaps

Affricates are stops followed immediately by fricatives English [ch] (chicken); Marathi chaa (e.g., gharaachaa; of the house)

Lateral or Liquids: tip of the tongue up against the alveolar ridge or the teeth, with one or both sides of the tongue lowered to allow air to flow over it [l] (learn)

Tap or flap: quick motion of the tongue against the alveolar ridge [dx] (IPA [R]) The consonant in the middle of the word lotus ([l ow dx ax s]) is a tap in most dialects

of American English speakers of many UK dialects would use a [t] instead of a tap in this word.

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 14

Page 15: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Articulation of consonants: Larynx action/glottis state (1/2)

Vocal cords are pulled apart. The air passes freely through the glottis. This is called the voicelessness state and sounds produced with this configuration of the vocal cords are called voiceless: p t k f θ s ʃ tʃ

Vocal cords are pulled close together. The air passing through the glottis causes the vocal cords to vibrate. This is called the voicing state and sounds produced with this configuration of the vocal cords are called voiced: b d g v ð z ʒ dʒ

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 15

Page 16: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Articulation of consonants: Larynx action/glottis state (2/2)

Vocal cords are apart at the back and pulled together at the front. This is called the whisper state.

Vocal cords assume the voicing state but are relaxed. This is called the murmur state.

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 16

Page 17: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Phonology: Syllables

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 17

Page 18: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Basic of syllables“Syllable is a unit of spoken languageconsisting of a single uninterrupted soundformed generally by a Vowel and preceded orfollowed by one or more consonants.”

Vowels are the heart of a syllable (MostSonorous Element) (svayam raajate itisvaraH)

Consonants act as sounds attached tovowels.16 oct, 2014

Phonetics-phonology, Pushpak Bhattacharyya 18

Page 19: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Syllable structure

A syllable consists of 3 major parts:- Onset (C) Nucleus (V) Coda (C)

Vowels sit in the Nucleus of a syllable Consonants may get attached as Onset

or Coda. Basic structure - CV

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 19

Page 20: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Possible syllable structures The Nucleus is

always present Onset and Coda

may be absent Possible

structures V CV VC CVC16 oct, 2014

Phonetics-phonology, Pushpak Bhattacharyya 20

Page 21: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

syllable theories Prominence Theory

E.g. entertaining /entəteɪnɪŋ/ The peaks of prominence: vowels /e ə eɪ

ɪ/ Number of syllables: 4

Chest Pulse Theory Based on muscular activities

Sonority Theory Based on relative soundness of segment

within words16 oct, 2014

Phonetics-phonology, Pushpak Bhattacharyya 21

Page 22: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Introduction to sonority theory“The Sonority of a sound is its loudness

relative to other sounds with the same length, stress and speech.”

Some sounds are more sonorous Words in a language can be divided into

syllables Sonority theory distinguishes syllables on

the basis of sounds.

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 22

Page 23: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Sonority hierarchy Defined on the basis of amount of

sound associated The sonority hierarchy is as follows:-

Vowels (a, e, i, o, u) Liquids (y, r, l, v) Nasals (n, m) Fricatives (s, z, f,…..sh, th etc.) Affricates (ch, j) Stops (b, d, g, p, t, k)

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 23

Page 24: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Sonority scale Obstruents can

be further classified into:- Fricatives Affricates Stops

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 24

Page 25: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Sonority theory & syllables“A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.”

Represented as waves of sonority or Sonority Profile of that syllable

Nucleus

Onset Coda

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 25

Page 26: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Sonority sequencing principle

“The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.”

Peak(Nucleus)

Onset Coda

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 26

Page 27: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

examples

ABHIJEET

A

BHI

JEET

ABHI

JEET

Profile-1

Profile-2

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 27

Page 28: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Maximal onset principle“The Intervocalic consonants are maximally

assigned to the Onsets of syllables inconformity with Universal and Language-Specific Conditions.”

Determines underlying syllable division Example

DIPLOMADIP LO MA & DI PLOMA

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 28

Page 29: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Syllable Structure: a more detailed look

Count of no. of syllables in a word is roughly/intuitively the no. of vocalic segments in a word.

Thus, presence of a vowel is an obligatory element in the structure of a syllable. This vowel is called “nucleus”.

Basic Configuration: (C)V(C). Part of syllable preceding the nucleus is called the onset. Elements coming after the nucleus are called the coda. Nucleus and coda together are referred to as the rhyme.

S ≡ Syllable, O ≡ OnsetR ≡ Rhyme, N ≡ NucleusCo ≡ Coda

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 29

Page 30: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Syllable Structure: Examples ‘word’

‘sprint’

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 30

Page 31: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Syllable Structure: Examples ‘may’

‘opt’

‘air’

No Coda.

No Onset.

No Coda, No Onset.

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 31

Page 32: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Syllable Structure Open Syllable: ends in vowel Closed syllable: ends in consonant or consonant cluster

Light Syllable: A syllable which is open and ends in a short vowel General Description – CV. Example, ‘air’.

Heavy Syllable: Closed syllables or syllables ending in diphthong Example: ‘opt’ Example, ‘may’

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 32

Page 33: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Syllabification: Determining Syllable Boundaries

Given a string of syllables (word), what is the coda of one and the onset of another?

In a sequence such as VCV, where V is any vowel and C is any consonant, is the medial C the coda of the first syllable (VC.V) or the onset of the second syllable (V.CV)?

To determine the correct groupings, there are some rules, two of them being the most important and significant: Maximal Onset Principle, Sonority Hierarchy

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 33

Page 34: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

An important grapheme-phoneme data

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 34

Page 35: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

CMU Pronouncing Dictionary machine-readable pronunciation

dictionary for North American English that contains over 125,000 words and their transcriptions.

The current phoneme set contains 39 phonemes

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 35

Page 36: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

“Parallel” CorpusPhoneme Example Translation ------- ------- -----------AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 36

Page 37: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

“Parallel” Corpus cntdPhoneme Example Translation ------- ------- -----------CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 37

Page 38: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Transliteration: grapheme-phoneme interaction

38

Work of Arjun, PhD student, Swapnil, Masters student

Page 39: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Sandhan

39

Mono & Cross Lingual Information Retrieval (CLIR) engine for Indian languages Input: Query in one of the nine Indian

languages (Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya)

Output: In Hindi, English and Query Language

Built on top of Nutch-Solr Framework

Page 40: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Online processing

40

Snippet Translation

Summary Generation

Snippet Generation

Translation /Transliteration

MWE Lookup

NE Lookup

Analyzer

Query Formulation

Index

Information Extraction

Query translation is major component in CLIR Query translation = word translations + word transliteration Transliteration accuracy directly affects CLIR performance [AbdulJaleel & Larkey, 2003]

Page 41: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Query transliteration

41

Page 42: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Transliteration

42

Transliteration is a process of transforming a word from script of one language to other

Word pronunciation is generally preserved, or is modified according its pronunciation in target language

Examples:

Hindi Gujarati Englishअजुन અ ુ ન Arjunअंकुर ુ ર Ankurराहु ल રા ુલ Rahulवि नल વ નલ Swapnil

Page 43: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Linguistic challenges

43

Features affecting transliteration Schwa deletion Diphthongs Differences in character set

Each of them adds to ambiguity

Page 44: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

What is Schwa deletion

44

Phenomenon where implicit Schwa of a word are deleted for correct pronunciation

What is Schwa A short ‘a’ vowel sound attached with every

consonant by default. This sound can be changed into any other vowel

sound by use of matras (dependent vowels) Schwa is the default vowel for a consonant No explicit matra required to represent Schwa

Examples: ‘गलत’ (galat, wrong) – ‘a’ deletion after ‘t’ sound

Page 45: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Effect of Schwa deletion

45

Schwa deletion observed in Indo-Aryan languages but not in Dravidian languages.

For example, In Hindi, ‘अजुन’ is pronounced as ‘Arjun’ while in

Kannada it is pronounced as ‘Arjuna’ Can be present in any part of the word For example,

‘गलती’ (galti, mistake) in Hindi observes implicit schwa deletion after the consonant ‘ल’ (la)

Increases ambiguity in transliteration

Page 46: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Diphthongs

46

Two vowel sounds coming together in a single syllable unit

Schwa along with independent vowel can together be replaced by a diphthong

For example, In Hindi, the word ‘मु ंबई’ (mumbai, Mumbai) can also

be represented as ‘मु ब’ै Schwa (a) and independent vowel ‘ई’ (i) combine to

form dependent vowel ‘◌ै’ (ai) Mostly observed in Dravidian languages

Any vowel in middle of the word gets converted to dependent vowel or diphthong

Page 47: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Differences in character set

47

Indian languages contain different sets of consonants Examples: Assamese, Bengali, Tamil have

lesser consonants In Bengali, the sounds ‘ba’ and ‘va’ are

represented by the same character ‘ব’ (ba) Tamil does not distinguish between voiced

and unvoiced consonants. Sounds ‘k’, ‘kh’, ‘g’ and ‘gh’ represented using

single character

Page 48: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Previous work

48

Om transliteration scheme (Madhavi et. al , 2005) provides a script representation which is common for all Indian languages.

Surana et. al (2008) highlighted the importance of origin of the word and proposed different ways of transliteration based on its origin

Malik (2006) tried to solve a special case of Punjabi machine transliteration. They converted Shahmukhi to Gurumukhi using rule based transliteration

Gupta et. al (2010) used an intermediate notation called as WX notation to transliterate the word from one language to another

Rama et. al (2009) proposed the use of SMT system for machine transliteration. Words are split into constituent characters and each character acts like a word in a sentence.

Page 49: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Approaches for transliteration

49

Rule Based Can be as simple as one to one character

mapping between languages Set of rules for each language pair

Statistical Word divided in to segments. Segmentation can be at character or

syllable level. Phrases learnt using SMT techniques.

Page 50: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Rule based

50

Character to character (Offset) Mapping

Set of rule to tackle mapping ambiguity For example while transliterating to Bengali,

all mappings from ‘va’ can be mapped to ‘ba’.

Hindi Kannadaक ಕ

ख ಖ

ग ಗ

घ ಘ

Page 51: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

UnicodeTransfer basedtransliteration

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 51

Page 52: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Statistical

52

Character Based (CS) Word split into individual characters

Syllable based Word split into syllables Automatic syllabification difficult

Hindi Kannada Englishव ि◌ द ◌् य ◌ा ल य ವ ◌ಿ ದ ◌್ ಯ ◌ಾ ಲ ಯ v i d y a l a y

अ र ◌् ज ◌ु न ಅ ರ ◌್ ಜ ◌ು ನ a r j u n

Hindi Kannada Englishव या लय ಾ ಲ ಯ vid ya layअर् जुन ಅ ಜು ನ ar jun

Page 53: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Vowel based segmentation

53

Split the word at vowel boundaries उ या न ---- u dya n

Learn joint characters as a whole than individual parts

Hindi Kannada Englishव या ल य ಾ ಲ ಯ vi dya layअ जु न ಅ ಜು ನ a rju n

Page 54: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Algorithm

54

Input: word in Indian language Output: Vowel segments Scan characters of input word from left

to right For each character c insert a space after

c if c is a vowel c is a full consonant and next character is not a

vowel next character is anusvar OR visarga OR chandra

bindu OR Chandra

Page 55: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Algorithm - English

55

Input: word in English Output: Vowel segments for input word Scan characters of input word from left

to right For each character c, insert a space

after c if c is a vowel and next character is not a vowel.

End

Page 56: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Experiments

56

Transliteration within Indian languages (IL-IL) System tested for 10 Indian languages viz.

Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (ka), Malayalam (ml), Marathi (mr), Tamil (ta) and Telugu (te)

Transliteration from Indian language to English (IL-En) System tested for transliteration from 10

Indian languages to English

Page 57: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Dataset

57

Names collected from India Child Names1 and Bachpan2

Parallel data collected from IndoWordNet (used as test data)

Data for each language pair is variable. Data size varies from 800 to 25000

words

1: www.indiachildnames.com 2: www.bachpan.com

Page 58: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Indian language to Indian language transliteration

58

Page 59: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Results

59

N-fold evaluation 10 – fold evaluation for 100 Indian

language pairs

Testing on standout test data Both systems test on test data of 1000

words extracted from IndoWordNet

Accuracy at rank 1 is used for evaluation

Page 60: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

N-Fold Evaluation- Accuracies

60

pa as bn hi gu mr te kn ml ta

pa CS:53.28VS:82.71

CS:77.78VS:93.07

CS:57.47VS:98.29

CS:58.12VS:91.00

CS:57.80VS:81.12

CS:94.00VS:97.88

CS:60.96VS:97.73

CS:59.73VS:98.60

CS:58.13VS:98.42

as CS:40.09VS:82.92

CS:54.93VS:84.57

CS:56.14VS:85.81

CS:61.05VS:84.93

CS:48.70VS:82.34

CS:64.79VS:84.25

CS:61.28VS:85.05

CS:57.75VS:82.97

CS:57.38VS:83.48

bn CS:52.16VS:92.98

CS:63.89VS:86.94

CS:57.44VS:97.83

CS:77.51VS:92.38

CS:49.21VS:82.37

CS:95.82VS:98.01

CS:61.99VS:97.70

CS:56.50VS:98.92

CS:60.21VS:98.37

hi CS:49.28VS:89.38

CS:57.87VS:85.98

CS:69.36VS:88.96

CS:54.16VS:91.56

CS:42.57VS:84.20

CS:91.89VS:95.63

CS:60.74VS:95.26

CS:51.79VS:97.12

CS:55.15VS:96.33

gu CS:62.94VS:91.34

CS:60.43VS:86.79

CS:77.34VS:92.21

CS:87.69VS:99.11

CS:52.87VS:85.80

CS:96.94VS:97.38

CS:65.50VS:97.16

CS:54.23VS:98.32

CS:58.00VS:97.82

mr CS:45.56VS:81.60

CS:58.15VS:87.84

CS:57.92VS:84.45

CS:43.55VS:78.45

CS:60.21VS:83.30

CS:62.85VS:77.33

CS:73.60VS:77.16

CS:55.13VS:76.56

CS:47.88VS:80.34

te CS:83.68VS:98.27

CS:50.52VS:80.15

CS:85.84VS:97.63

CS:88.21VS:99.12

CS:95.46VS:97.38

CS:43.72VS:77.60

CS:64.69VS:99.26

CS:84.23VS:97.29

CS:80.15VS:98.11

kn CS:82.65VS:98.11

CS:53.66VS:82.01

CS:83.26VS:97.23

CS:88.17VS:99.06

CS:94.45VS:96.85

CS:49.07VS:79.35

CS:99.05VS:99.21

CS:97.94VS:99.66

CS:90.08VS:99.54

ml CS:84.73VS:99.30

CS:48.62VS:79.19

CS:96.91VS:99.25

CS:88.16VS:99.12

CS:97.18VS:98.62

CS:43.71VS:79.11

CS:98.30VS:99.30

CS:86.23VS:99.65

CS:87.51VS:98.43

ta CS:78.75VS:95.68

CS:52.59VS:78.61

CS:82.81VS:95.82

CS:53.83VS:96.18

CS:81.81VS:96.26

CS:44.58VS:76.46

CS:81.87VS:96.61

CS:82.86VS:97.42

CS:81.77VS:97.09

Page 61: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Observations – N-fold evaluation

61

Vowel Segmentation (VS) system outperforms Character segmentation (CS) system for all pairs

For many pairs, accuracy difference of around 30% is observed

Page 62: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Baseline – Rule based system

62

pa as bn hi gu mr te kn ml ta

pa 56.1 79.5 94.5 84.7 66.2 94 92.3 93.7 67.7

as 62.2 80.1 65.3 59.8 59.8 55.3 58.2 55.7 -

bn 84.6 81 94.1 76.7 58 94.7 90.9 96.4 71.7

hi 84.7 64.3 73.1 85.3 73.7 93.4 92.9 96.1 73.6

gu 87.5 60.3 77.4 97.9 81.4 95.8 95.1 98.4 72.9

mr 65.2 61.1 56.4 73.2 82.9 67.5 66.6 67.2 -

te 96.9 56.1 91.9 98.5 96.8 65.8 98.9 99.1 73.1

kn 97.8 56.9 90.1 98.7 95.8 67.4 98.6 99.2 72.9

ml 99.3 54 95.3 98.9 98.2 68.4 99.4 99.2 72.5

ta 78.5 - 78.5 79.1 79.6 - 81.2 81.5 77.5

Character Set difference impacts

accuracies !!!

Page 63: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

CS versus VS – Accuracies (Ignoring OOVs)

63

pa as bn hi gu mr te kn ml ta

pa CS:77.50VS:82.93

CS:89.80VS:93.78

CS:96.80VS:98.60

CS:90.30VS:89.48

CS:77.80VS:78.91

CS:95.70VS:98.00

CS:96.80VS:98.40

CS:96.90VS:98.59

CS:98.50VS:98.29

as CS:73.17VS:83.72

CS:82.58VS:86.89

CS:76.38VS:87.21

CS:74.45VS:85.32

CS:71.07VS:81.35

CS:71.69VS:83.75

CS:73.95VS:86.58

CS:69.35VS:82.29 -

bn CS:90.30VS:93.09

CS:78.60VS:87.70

CS:97.40VS:97.79

CS:90.40VS:93.80

CS:68.20VS:80.66

CS:96.20VS:96.99

CS:95.50VS:97.09

CS:98.40VS:98.20

CS:97.70VS:97.98

hi CS:86.40VS:87.60

CS:79.30VS:84.94

CS:79.70VS:88.29

CS:81.20VS:87.99

CS:72.77VS:82.88

CS:95.70VS:96.49

CS:93.30VS:93.59

CS:95.40VS:96.79

CS:95.60VS:95.79

gu CS:89.30VS:88.80

CS:83.00VS:87.01

CS:84.10VS:91.19

CS:98.70VS:98.99

CS:81.60VS:82.98

CS:97.00VS:97.00

CS:95.70VS:96.70

CS:98.00VS:98.39

CS:98.00VS:98.19

mr CS:78.70VS:79.88

CS:79.48VS:89.37

CS:75.40VS:84.64

CS:66.87VS:75.88

CS:77.40VS:81.25

CS:67.00VS:76.46

CS:75.05VS:79.28

CS:69.62VS:76.84 -

te CS:97.40VS:98.40

CS:75.35VS:80.76

CS:96.40VS:98.09

CS:99.20VS:99.30

CS:97.60VS:98.20

CS:70.10VS:77.23

CS:98.70VS:98.80

CS:99.00VS:97.79

CS:98.50VS:98.79

kn CS:97.60VS:98.40

CS:76.48VS:82.19

CS:94.60VS:97.40

CS:98.50VS:98.90

CS:96.20VS:96.89

CS:71.64VS:81.03

CS:99.20VS:99.60

CS:99.50VS:99.90

CS:98.90VS:99.30

ml CS:99.00VS:99.10

CS:73.45VS:80.14

CS:99.60VS:99.70

CS:99.10VS:99.30

CS:98.40VS:99.00

CS:72.09VS:80.05

CS:98.90VS:99.40

CS:99.80VS:99.90

CS:97.20VS:97.89

ta CS:84.10VS:94.25 - CS:86.20

VS:95.36CS:86.80VS:95.46

CS:86.70VS:96.68 - CS:86.50

VS:96.59CS:86.90VS:96.18

CS:85.70VS:96.17

Consistent improvement in

accuracies!!!

Page 64: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Side Effect (OOVs)

64

pa as bn hi gu mr te kn ml ta

pa CS:0VS:16

CS:0VS:3

CS:0VS:2

CS:0VS:2

CS:0VS:9

CS:0VS:2

CS:0VS:3

CS:0VS:4

CS:0VS:4

as CS:1VS:17

CS:0VS:0

CS:1VS:62

CS:2VS:19

CS:1VS:51

CS:4VS:114

CS:2VS:76

CS:5VS:153

bn CS:0VS:1

CS:0VS:0

CS:0VS:5

CS:0VS:0

CS:0VS:7

CS:0VS:3

CS:0VS:4

CS:0VS:2

CS:0VS:8

hi CS:0VS:0

CS:0VS:4

CS:0VS:1

CS:0VS:1

CS:0VS:0

CS:0VS:4

CS:0VS:2

CS:0VS:4

CS:0VS:3

gu CS:0VS:0

CS:0VS:15

CS:0VS:1

CS:0VS:5

CS:0VS:1

CS:0VS:1

CS:0VS:1

CS:0VS:4

CS:0VS:3

mr CS:0VS:1

CS:1VS:50

CS:0VS:4

CS:0VS:0

CS:0VS:8

CS:0VS:74

CS:2VS:59

CS:6VS:171

te CS:0VS:2

CS:2VS:101

CS:0VS:4

CS:0VS:2

CS:0VS:2

CS:0VS:60

CS:0VS:2

CS:0VS:3

CS:0VS:9

kn CS:0VS:3

CS:1VS:79

CS:0VS:0

CS:0VS:4

CS:0VS:4

CS:2VS:67

CS:0VS:3

CS:0VS:4

CS:0VS:2

ml CS:0VS:1

CS:17VS:139

CS:0VS:6

CS:0VS:5

CS:0VS:2

CS:4VS:153

CS:0VS:4

CS:0VS:1

CS:0VS:7

ta CS:0VS:8

CS:0VS:9

CS:0VS:9

CS:0VS:6

CS:0VS:3

CS:0VS:4

CS:0VS:8

Lesser coverage!!!

Page 65: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

OOVs Impact

65

pa as bn hi gu mr te kn ml ta

pa CS:77.50VS:81.60

CS:89.80VS:93.50

CS:96.80VS:98.40

CS:90.30VS:89.30

CS:77.80VS:78.20

CS:95.70VS:97.80

CS:96.80VS:98.10

CS:96.90VS:98.20

CS:98.50VS:97.90

as CS:73.10VS:82.30

CS:82.58VS:86.89

CS:76.30VS:81.80

CS:74.30VS:83.70

CS:71.00VS:77.20

CS:71.40VS:74.20

CS:73.80VS:80.00

CS:69.00VS:69.70 -

bn CS:90.30VS:93.00

CS:78.60VS:87.70

CS:97.40VS:97.30

CS:90.40VS:93.80

CS:68.20VS:80.10

CS:96.20VS:96.70

CS:95.50VS:96.70

CS:98.40VS:98.00

CS:97.70VS:97.20

hi CS:86.40VS:87.60

CS:79.30VS:84.60

CS:79.70VS:88.20

CS:81.20VS:87.90

CS:72.77VS:82.88

CS:95.70VS:96.10

CS:93.30VS:93.40

CS:95.40VS:96.40

CS:95.60VS:95.50

gu CS:89.30VS:88.80

CS:83.00VS:85.70

CS:84.10VS:91.10

CS:98.70VS:98.50

CS:81.60VS:82.90

CS:97.00VS:96.90

CS:95.70VS:96.60

CS:98.00VS:98.00

CS:98.00VS:97.90

mr CS:78.70VS:79.80

CS:79.40VS:84.90

CS:75.40VS:84.30

CS:66.87VS:75.88

CS:77.40VS:80.60

CS:67.00VS:70.80

CS:74.90VS:74.60

CS:69.20VS:63.70 -

te CS:97.40VS:98.20

CS:75.20VS:72.60

CS:96.40VS:97.70

CS:99.20VS:99.10

CS:97.60VS:98.00

CS:70.10VS:72.60

CS:98.70VS:98.60

CS:99.00VS:97.50

CS:98.50VS:97.90

kn CS:97.60VS:98.10

CS:76.40VS:75.70

CS:94.60VS:97.40

CS:98.50VS:98.50

CS:96.20VS:96.50

CS:71.50VS:75.60

CS:99.20VS:99.30

CS:99.50VS:99.50

CS:98.90VS:99.10

ml CS:99.00VS:99.00

CS:72.20VS:69.00

CS:99.60VS:99.10

CS:99.10VS:98.80

CS:98.40VS:98.80

CS:71.80VS:67.80

CS:98.90VS:99.00

CS:99.80VS:99.80

CS:97.20VS:97.20

ta CS:84.10VS:93.50 - CS:86.20

VS:94.50CS:86.80VS:94.60

CS:86.70VS:96.10 - CS:86.50

VS:96.30CS:86.90VS:95.80

CS:85.70VS:95.40

Page 66: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Correctness versus coverage

66

Vowel segmentation increases correctness with reduced coverage

Character segmentation has high coverage with low accuracy

Ideal transliteration system should have high coverage and high accuracy

Page 67: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Hybrid System

67

Use Character segmentation system for OOVs

Only segments not transliterated by VS system are transliterated using CS system

Page 68: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Hybrid System

68

pa as bn hi gu mr te kn ml ta

pa CS:77.50VS:82.50

CS:89.80VS:93.70

CS:96.80VS:98.60

CS:90.30VS:89.50

CS:77.80VS:78.90

CS:95.70VS:97.90

CS:96.80VS:98.40

CS:96.90VS:98.50

CS:98.50VS:98.30

as CS:73.10VS:83.10

CS:82.58VS:86.89

CS:76.30VS:85.90

CS:74.30VS:84.80

CS:71.00VS:80.60

CS:71.40VS:81.70

CS:73.80VS:85.20

CS:69.00VS:78.40 -

bn CS:90.30VS:93.10

CS:78.60VS:87.70

CS:97.40VS:97.80

CS:90.40VS:93.80

CS:68.20VS:80.60

CS:96.20VS:96.90

CS:95.50VS:97.00

CS:98.40VS:98.20

CS:97.70VS:98.00

hi CS:86.40VS:87.60

CS:79.30VS:84.80

CS:79.70VS:88.30

CS:81.20VS:88.00

CS:72.77VS:82.88

CS:95.70VS:96.50

CS:93.30VS:93.60

CS:95.40VS:96.70

CS:95.60VS:95.80

gu CS:89.30VS:88.80

CS:83.00VS:87.00

CS:84.10VS:91.20

CS:98.70VS:99.00

CS:81.60VS:83.00

CS:97.00VS:97.00

CS:95.70VS:96.70

CS:98.00VS:98.40

CS:98.00VS:98.20

mr CS:78.70VS:79.90

CS:79.40VS:88.60

CS:75.40VS:84.40

CS:66.87VS:75.88

CS:77.40VS:81.40

CS:67.00VS:74.60

CS:74.90VS:78.60

CS:69.20VS:73.90 -

te CS:97.40VS:98.40

CS:75.20VS:79.80

CS:96.40VS:98.10

CS:99.20VS:99.30

CS:97.60VS:98.20

CS:70.10VS:76.90

CS:98.70VS:98.80

CS:99.00VS:97.70

CS:98.50VS:98.80

kn CS:97.60VS:98.40

CS:76.40VS:81.30

CS:94.60VS:97.40

CS:98.50VS:98.90

CS:96.20VS:96.80

CS:71.50VS:79.60

CS:99.20VS:99.60

CS:99.50VS:99.90

CS:98.90VS:99.30

ml CS:99.00VS:99.10

CS:72.20VS:77.70

CS:99.60VS:99.60

CS:99.10VS:99.30

CS:98.40VS:99.00

CS:71.80VS:77.70

CS:98.90VS:99.40

CS:99.80VS:99.90

CS:97.20VS:97.90

ta CS:84.10VS:94.30 - CS:86.20

VS:95.30CS:86.80VS:95.50

CS:86.70VS:96.60 - CS:86.50

VS:96.60CS:86.90VS:96.20

CS:85.70VS:95.90

Page 69: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Summary

69

Ta - Hi Accuracy F-Score

Rule Based 81.7 0.85

CS vs. VS CS:86.80VS:95.46

CS:0.975VS:0.989

OOVs impact CS:86.80VS:94.60

CS:0.975VS:0.987

Hybrid System CS:86.80VS:95.50

CS:0.975VS:0.989

Page 70: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Error analysis

70

Modified Levenshtein distance algorithm to compute the confusion matrix

Errors in CS based system persists in VS based system but to lesser extent

Common errors include confusion between Matras Voiced / unvoiced consonants Aspirated / Unaspirated consonants

Page 71: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Top 10 Errors Tamil - Hindi

71

Expected Output Count

1 थ(tha) त(ta) 52

2 त(ta) थ(tha) 22

3 ख(kha) क(ka) 19

4 फ(fa) प(pa) 14

5 झ(jha) ज(ja) 13

6 ◌्(halfconsonantmodifier) NULL 4

7 क(ka) ख(kha) 2

8 छ(chha) च(cha) 2

9 न(na) ◌ं('ma'or'na') 2

10 न(na) ण(na) 2

CS System

Expected Output Count

1 थ(tha) त(ta) 16

2 झ(jha) ज(ja) 7

3 ख(kha) क(ka) 5

4 ◌्(halfconsonantmodifier) NULL 4

5 त(ta) थ(tha) 3

6 फ(fa) प(pa) 3

7 न(na) ◌ं('ma'or'na') 2

8 न(na) ण(na) 2

9 म(ma) ◌ं('ma'or'na') 2

10 ◌ी(ee) ि◌(i) 2

VS System

Page 72: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

How much is sufficient for Training?

72

Increase in training data reduces number of OOVs

Threshold on number of OOVs help determine lower limit for training data size

We considered OOVs below 50 as threshold

Page 73: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Tamil – Hindi stepwise accuracies

73

Page 74: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Step Wise results – Tamil as source

74

Tamil has higher coverage for lesser training data

Malayalam needs more training datafor coverage of all segments

Page 75: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Indian language to English transliteration

75

Page 76: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Challenges – Origin plays an important role

76

Many Indian language words are borrowed from Sanskrit called tatsama words.

For example, दुग (durg, fort) is retained in Hindi, Marathi and Kannada

दुगgets transliterated to durg in case of ‘Sindhudurg’ and durga in case of ‘Chitradurga’

Sindhudurg has influence of Marathi while Chitradurga has influence of Kannada

Origin of the place influences how it is written in English

Page 77: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Challenges- One to many mapping

77

Single character in Indian language represented by multiple characters in English

For example, ‘ख’ in Hindi is represented by ‘kh’ in English

Same character in Indian language may be represented by multiple English segments

For example, ‘◌ी’ in Hindi can be represented as ‘i’ as

well as ‘ee’

Page 78: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Results

78

N-fold evaluation performed on 10 Indian languages to English

as bn gu hi kn ml mr pa ta te

CS:04.62

VS:50.68

CS:05.66

VS:37.75

CS:12.45

VS:48.87

CS:11.71

VS:57.21

CS:12.31

VS:48.94

CS:12.72

VS:48.49

CS:08.24

VS:45.77

CS:06.39

VS:39.28

CS:07.74

VS:35.39

CS:10.25

VS:46.99

Page 79: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Error analysis

79

Schwa deletion and insertion is major cause of errors in transliteration

CS system: Character deletion, mainly at the end of the

word Schwa deletion

VS system Schwa deletion and insertion Confusion in matras

Page 80: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Top Ten Errors – Hindi to English

80

Expecte

dOutput Count

a NULL 33754

h NULL 15239

n NULL 12165

y NULL 5498

r NULL 4009

t NULL 3188

e NULL 2377

k NULL 1608

NULL a 1556

l NULL 1505

Expecte

dOutput Count

a NULL 9177

NULL a 3651

NULL i 2784

n NULL 2657

NULL h 1095

e i 1081

e NULL 869

NULL u 730

NULL e 717

i e 502

CS System VS System

Page 81: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Impact on Sandhan: Marathi CLIR

81

Page 82: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Query translation evaluation

82

Char based Transliteration

Vowel segmentation

Google translate

Char based Transliteration

Vowel segmentation

Google translate

Not stemmed Stemmed

0.44375 0.5375 0.6375 0.475 0.55625 0.65

Total Marathi-English translations evaluated = 80

Quality Score

Good 1

Medium 0.5

Bad 0

Page 83: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Examples

83

Marathi query Char based system VS system Google

translate

सागरे वर sagareshawar sagareshwar sagares on

ठोसेघर धबधबा thoseghrwaterfall

thosegharwaterfall

thosegharachute

उरमोडी नद urmodi river uramody river uramodi river

संत ाने वर saint jayaneshawar

saint jjnaneshwar sant gyaneshwar

Page 84: CS626 : Natural Language Processing/Speech, NLP and …pb/cs626-2014/cs626-lect34-36... · CS626 : Natural Language Processing/Speech, ... Places of articulation 16 oct, 2014 Phonetics-phonology,

Error analysis of translations

84

Incorrect entries in translation dictionary Example: district misspelt as distt

Most frequent used translation word not picked Example: क ला (killa,fort) gets translated

to stronghold instead of fort.

Stemming error Example: लेणी (leni, cave) gets stemmed to लेणे (lene)