Post on 29-May-2018
CS460/626 : Natural Language Processing/Speech NLP and the WebProcessing/Speech, NLP and the Web
(Lecture 36syllabification and transliteration)transliteration)
Pushpak BhattacharyyaPushpak BhattacharyyaCSE Dept., IIT Bombay
11th A il 201111th April, 2011
Access from one language to another
Language L1 Language L2
T l ti T lit tiTranslation Transliteration
Gandhiji Boy
Cross Lingual Query
Query Translatio
n
Search
Birthplace of Gandhi Search
Engine
Ga d
English Document
Translate Document
Output
Sound based transliterationSound based transliteration
TransliterationScript Script ()(Gandhiji)
Transliteration using pronunciationConvert script to soundConvert sound to scriptConvert sound to script
Similar to interlingua based MT
MeaningSentence in source Language
Sentence in target LanguageLanguage Language
Two approaches to transliterationTwo approaches to transliterationSyllabification
Rule Machine Based Learning Based
Word SyllabifiedApple ap pleFundamental fun da men tal
Use GIZA++ Moses on parallel corpus
Phoneme-based approachPhoneme-based approach
Word in Word in P(w )Source language Target language
P( ps | ws) P ( wt | pt )
P(wt)
Pronunciationin
Source language
PronunciationIn
target languageP ( pt | ps )
Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) )
Note: Phoneme is the smallest linguistically distinctive unit of sound.
Basics of syllablesBasics of syllablesSyllable is a unit of spoken languageSyllable is a unit of spoken languageconsisting of a single uninterrupted soundformed generally by a Vowel and preceded orfollowed by one or more consonants.
Vowels are the heart of a syllable (MostSonorous Element) (svayam raajate itisvaraH)Consonants act as sounds attached to
lvowels.
Syllable structure
A syllable consists of 3 major parts:-Onset (C)( )Nucleus (V)Coda (C)( )
Vowels sit in the Nucleus of a syllableConsonants may get attached as OnsetConsonants may get attached as Onset or Coda.Basic structure - CVBasic structure CV
Possible syllable structuresThe Nucleus isThe Nucleus is always presentOnset and CodaOnset and Coda may be absentPossiblePossible structures
VVCVVCCVC
Introduction to sonority theoryThe Sonority of a sound is its loudness
relative to other sounds with the same length stress and speech length, stress and speech.
Some sounds are more sonorousSome sounds are more sonorousWords in a language can be divided into syllablesySonority theory distinguishes syllables on the basis of sounds.
Sonority hierarchyDefined on the basis of amount of sound associatedThe sonority hierarchy is as follows:-
Vowels (a, e, i, o, u)Li id ( l )Liquids (y, r, l, v)Nasals (n, m)Fricatives (s z f sh th etc )Fricatives (s, z, f,..sh, th etc.)Affricates (ch, j)Stops (b, d, g, p, t, k)p ( , , g, p, , )
Sonority scaleObstruents canObstruents can be further classified into:-classified into:
FricativesAffricatesAffricatesStops
Sonority theory & syllablesA Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements the surrounding lower sonority elements.
Represented as waves of sonority orRepresented as waves of sonority or Sonority Profile of that syllable
Nucleus
Onset Coda
Sonority sequencing principle
The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.( ),
PeakPeak(Nucleus)
Onset Coda
Maximal onset principleThe Intervocalic consonants are maximally
assigned to the Onsets of syllables inf it ith U i l d Lconformity with Universal and Language-
Specific Conditions.Determines underlying syllable divisionDetermines underlying syllable divisionExample
DIPLOMADIPLOMADIP LO MA & DI PLOMAMA
Syllable Structure: a more detailed look
Count of no. of syllables in a word is roughly/intuitively the no. of vocalic segments in a word.Thus, presence of a vowel is an obligatory element in the structure of a syllable. This vowel is called nucleus.Basic Configuration: (C)V(C).Part of syllable preceding the nucleus is called the onset.Elements coming after the nucleus are called the codaElements coming after the nucleus are called the coda.Nucleus and coda together are referred to as the rhyme.
S Syllable, O OnsetR Rhyme, N NucleusCo Coda
Syllable Structure: ExamplesSyllable Structure: Examples
word
sprint
Syllable Structure: ExamplesSyllable Structure: Examples
may
No Coda.
opt
No Onset.No Onset.
air
No Coda, No Onset.
Syllable StructureSyllable StructureOpen Syllable: ends in vowelClosed syllable: ends in consonant or consonant cluster
Light Syllable: A syllable which is open and ends in a shortLight Syllable: A syllable which is open and ends in a short vowel
General Description CV.Example, air.Example, air .
Heavy Syllable: Closed syllables or syllables ending in diphthong
Example: optExample: optExample, may
Syllabification: Determining Syllable Boundaries
Given a string of syllables (word), what is the coda of one and the onset of another?
In a sequence such as VCV, where V is any vowel and C is any consonant, is the medial C the coda of the first syllable (VC.V) or the onset of the second syllable (V.CV)? ( ) y ( )
To determine the correct groupings, there are some rules, two of them being the most important and significant:two of them being the most important and significant:
Maximal Onset Principle, Sonority Hierarchy
Rule Based SyllabificationRule Based Syllabification
(This and statistical syllabification and application to Transliteration: DDP work by
Ankit Aggrwal 2009)Ankit Aggrwal, 2009)
PhonotacticsPhonotactics
Guidesconstructionofonsetsandcodas
Restrictscombinationofphonemes.
Ingeneral,rulesoperatearoundthesonorityhierarchy.E.g.,Fricative/s/isloweronthesonorityhierarchythanthelateral/l/,sothecombination /sl/ is permitted in onsets and /ls/ is permitted in codas Oppositecombination/sl/ispermittedinonsetsand/ls/ispermittedincodas.Oppositeisnotallowed.
Thus,slipsandpulsearepossibleEnglishwords.
lsipsandpuslarenotpossible.
Constraints on Onsets1
Oneconsonant:Only//cannotoccurinsyllableinitialposition.
Twoconsonant:Werefertothescaleofsonority.Sequencernisruledoutsincethereisadecreaseofsonority.
MinimalSonorityDistance:Distanceinsonoritybetweenthefirstandthesecondelementintheonsetmustbeofatleast2degrees.
Thus,onlyalimitedno.ofpossibletwoconsonantclusters.
Threeconsonant:Restrictedtolicensedtwoconsonantonsetsprecededby/s/.
Also,/s/canonlybefollowedbyavoicelesssound.
Therefore,only/spl/,/spr/,/str/,/skr/,/spj/,/stj/,/skj/,/skw/,/skl/,/ j/ ill b ll d ( li )/smj/willbeallowed.(splinter,spray,strongetc.)
While/sbl/,/sbr/,/sdr/,/sgr/,/sr/willberuledout.
Constraints on Onsets2
Possible2consonatclustersinanOnset
Constraints on Coda1Constraints on Coda
Constraints on Coda2Constraints on Coda
Other ConstraintsOther Constraints
Nucleus:Thefollowingcanoccurasnucleus:Allvowelsounds(monophthongs aswellasdiphthongs).
Syllabic:yBoththeonsetandthecodaareoptional(asseenpreviously).
/j/attheendofanonset(/pj/,/bj/,/tj/,/dj/,/kj/,/fj/,/vj/,/j/,/sj/,/zj/,/hj/,/mj/,/nj/,/lj/,/spj/,/stj/,/skj/)mustbefollowedby/u/or//.
Longvowelsanddiphthongsarenotfollowedby//.//israreinsyllableinitialposition.
Stop+/w/before/u,,,a/areexcluded.
Implementation1p
1. Identify the first nucleus (single or multiple consecutive vowels) in y ( g p )the word.
2. Parse all the preceding consonants as the onset of the first syllable.3. Find the next nucleus in the word. If unsuccessful in finding another
l 'll i l th t t th i ht f thnucleus, we'll simply parse the consonants to the right of the current nucleus as the coda of the first syllable, else we will move to the next step. Now the consonant cluster between these two nuclei have to be divided in two parts, one as coda of 1st syllable and the other onset of 2nd syllable.
4. If the no. of consonants in the cluster is one, it will be the onset of the second nucleus as per the Maximal Onset Principle and Constrains on OnsetConstrains on Onset.
5. If it is two, check whether both of these can go to the onset of the second syllable as per the allowable onsets.
Implementation2p6. If not , the first consonant will be the coda of the first syllable and
the second consonant will be the onset of the second syllable.the second consonant will be the onset of the second syllable.
7. If the no. of consonants in the cluster is three, check if they can b th t f th d ll bl if t h k f th l tbe the onset of the second syllable, if not, check for the last two, if not, parse only the last consonant as the onset of the second syllable.
8. If the no. of consonants in the cluster is more than three, except the last three consonants, parse all the consonants as the coda of the first syllable. With the remaining 3 consonants, apply step 7step 7.
9. Continue the procedure for the rest of the string.
AccuracyAccuracy=(no.ofwordscorrectlysyllabifiedx100) %
no.ofwords
10,000easywordswerechosentoestablishproofofconcept
91wordswerefoundtobeincorrectlysyllabified.
Accuracy:99.09%.
Highprecision,lowrecall(acharacteristicofknowledgebasedAI)
Statistical syllabificationStatistical syllabification
Statistical Machine Syllabification
DataSourcesofdata
1. ElectionCommissionofIndia(ECI)NameList:ThiswebsourceprovidesnativeIndiannameswritteninbothEnglishandHindi.
( )2. DelhiUniversity(DU)StudentList:ThiswebsourcesprovidesnativeIndiannameswritteninEnglishonly.Thesenamesweremanuallytransliteratedforthepurposesoftrainingdata.
3. IndianInstituteofTechnology,Bombay(IITB)StudentList:TheAcademicd a s u e o ec o ogy, o bay ( ) S ude sOfficeofIITBprovidedthisdataofstudentswhograduatedintheyear2007.
4. NamedEntitiesWorkshop(NEWS)2009EnglishHindiparallelnames:Ali t f i d b t E li h d Hi di f i 11k i id dlistofpairednamesbetweenEnglishandHindiofsize11kisprovided.
Westartwith8,000randomlychosennamesfromtheECINameli d ll bif h lllistandsyllabifythemmanually.
Statistical Machine Syllabification
EffectofTrainingFormatSyllableseparatedFormat
Source Targetsudakar sudakarchhagan chhaganj i t h ji t hjitesh jiteshnarayan narayanshiv shivmadhav madhavmohammad mohammadj a y a n t e e d e v i ja yan tee de vi
SyllablemarkedFormat
jayanteedevi jayanteedevi
Source Targetsudakar su_da_karchhagan chha_ganjitesh ji_teshnarayan na_ra_yanshiv shivmadhav ma_dhavmohammad mo_ham_madjayanteedevi ja_yan_tee_de_vi
Statistical Machine Syllabification
Syllableseparated Syllablemarked
80%85%90%95%100%
veAccuracy
y p y
60%65%70%75%80%
1 2 3 4 5
Cumulativ
AccuracyLevel
Top1Accuracy:71.8%(Syllableseparated);80.5%(Syllablemarked)
T 5 A 83 4% (S ll bl d) 90 4% (S ll blTop5Accuracy:83.4%(Syllableseparated);90.4%(Syllablemarked)
Statistical Machine Syllabification
EffectofDataSize4differentexperimentswereperformed:
1. 8k:NamesfromtheECINamelist.
2. 12k:Anadditional4knamesweresyllabifiedtoincreasethedatasize.
3. 18k:IITBStudentListandtheDUStudentListwasincludedandsyllabified.
4. 23k:SomemorenamesfromECINameListandDUStudentListweresyllabifiedandthisdataactsasthefinaldata forus.
93.8%97.5% 98.3% 98.5% 98.6%
90.0%
95.0%
100.0%
ccuracy
8k 12k 18k 23k
70.0%
75.0%
80.0%
85.0%
CumulativeA
1 2 3 4 5
AccuracyLevel
Statistical Machine Syllabification
EffectofLanguageModelngramOrder
99.0%
3gram 4gram 5gram 6gram 7gram
95.0%
97.0%
ccuracy
89.0%
91.0%
93.0%
CumulativeAc
85.0%
87.0%
1 2 3 4 5
AccuracyLevel
Eff t f S ll bifi tiEffect of Syllabification on TransliterationTransliteration
Transliteration using syllabification
Input String
Moses
Syllabified data
(sourceside)Automatic S llabifieMoses Syllabifier
S ll bifi d St i
Learnt syllabifier
Syllabified StringLearnt
transliterator
Moses Automatic TransliteratorSyllabified
data(parallel)(parallel)
Transliterated String
Statistical Machine Transliteration
100%
Syllableseparated Syllablemarked
65%70%75%80%85%90%95%100%
tive
Accuracy
45%50%55%60%65%
1 2 3 4 5 6
Cumulat
A L l
Top1Accuracy:60.1%(Syllableseparated);50.2%(Syllablemarked)
Top 5 Accuracy: 87 2% (Syllable separated); 79 3% (Syllable marked)
AccuracyLevel
Top5Accuracy:87.2%(Syllableseparated);79.3%(Syllablemarked)
No syllabification: Baseline Performance
Topn CorrectCorrect%age
Cumulative%age
Additional Additional
1 1,868 41.5% 41.5%2 520 11.6% 53.1%3 246 5.5% 58.5%4 119 2 6% 61 2%4 119 2.6% 61.2%5 81 1.8% 63.0%
Below5 1,666 37.0% 100.0%4,500
TheTop5Accuracyisjust63.0%whichismuchlowerthanrequired.
40
Manual, i.e., ideal syllabification: , , yskyline Performance
Syllableseparated Syllablemarked
80%85%90%95%100%
veAccuracy
y p y
60%65%70%75%80%
1 2 3 4 5
Cumulativ
AccuracyLevel
Top1Accuracy:71.8%(Syllableseparated);80.5%(Syllablemarked)
T 5 A 87 4% (S ll bl d) 90 4% (S ll blTop5Accuracy:87.4%(Syllableseparated);90.4%(Syllablemarked)
41
Obsevations
Transliteration likely to be helped by syllabificationsyllabificationRule based syllabification: high precision low recallprecision low recallStatistical syllabification: reasonable accuracy at top 5accuracy at top-5
ReferencesReferences1. NasreenAbdulJaleelandLeahS.Larkey.Statisticaltransliterationforenglish
arabiccrosslanguageinformationretrieval.InConferenceonInformationandKnowledgeManagement,pages139146,2003.
2. AnnK.FarmerAndrianAkmajian,RichardM.DemersandRobertM.Harnish.Linguistics:AnIntroductiontoLanguageandCommunication.MITPress,5thgu st cs t oduct o to a guage a d Co u cat o ess, 5tedition,2001.
3. AssociationforComputerLinguistics.CollapsedConsonantandVowelModels:NewApproachesforEnglishPersianTransliterationandBackTransliteration,20072007.
4. SlavenBilacandHozumiTanaka.Directcombinationofspellingandpronunciationinformationforrobustbacktransliteration.InConferencesonComputationalLinguisticsandIntelligentTextProcessing,pages413424,2005.
5. IanLaneBingZhao,NguyenBachandStephanVogel.Aloglinearblocktransliterationmodelbasedonbistreamhmms.HLT/NAACL2007,2007.
References6. H.L.JinandK.F.Wong.AChinesedictionaryconstructionalgorithmfor
informationretrieval.InACMTransactionsonAsianLanguageInformationProcessing,pages281296,December2002.
7. K.KnightandJ.Graehl.Machinetransliteration.InComputationalLinguistics,pages24(4):599612,Dec.1998.pages ( ) 599 6 , ec 998
8. LeeFengChienLongJiang,MingZhouandChenNiu.Namedentitytranslationwithwebminingandtransliteration.InInternationalJointConferenceonArtificialIntelligence(IJCAL07),pages16291634,2007.
9. DanMateescu.EnglishPhoneticsandPhonologicalTheory.2003.
10. DellaPietraP.BrownandR.Mercer.Themathematicsofstatisticalmachinetranslation:Parameterestimation.InComputationalLinguistics,page19(2):263U 311, 1990.19(2):263U 311,1990.
11. GanapathirajuMadhaviPrahalladLavanyaandPrahalladKishore.AsimpleapproachforbuildingtransliterationeditorsforIndianlanguages.ZhejiangUniversitySCIENCE2005,2005.