Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and...

9
Sentence-level Morphological and Phonological Analyzer for Filipino Alina, Angelo Nico 3315 Michaelangelo BFRV, Las Piñas City (+63)9179921179 [email protected] Cambaliza, Carlo 258 Cuenca St. Ayala Alabang Village, Muntinlupa (+63)9178222822 [email protected] Sta. Ana, Xedric 103 M.H. del Pilar St. SFDM, Quezon City (+63)9175255377 [email protected] Sosa, Judd 14 P. Gomez St. Sta. Ana, San Mateo, Rizal (+63)9267057654 juddphilip_sosa @yahoo.com Chu, Shirley De La Salle University 2401 Taft Avenue, Manila, Philippines (+632)524-4611 [email protected] ABSTRACT This paper discusses filSPAM (Sentence-level Phonological and Morphological Analyzer for Filipino). Given an input sentence in Tagalog, this system outputs the corresponding parts-of-speech, root words, affixes and phonemic notation of each word in the sentence. The system will make use of the existing systems MAGTag and HATPOST in handling morphological analysis and part-of-speech tagging, respectively. The system has four modules: POS tagger which has 54% accuracy, the morphological analyzer which has 73.02% accuracy, the phonological analyzer is corpus-based and unknown handler which has two functions, the automaton and the generalized tree which has 67% accuracy and 64% respectively. Categories and Subject Descriptors D.2.10 [Software Engineering]: Design – methodologies, representation. General Terms Algorithms, Documentation, Performance, Experimentation, Languages, Theory Keywords Natural language processing, phonological analysis, morphological analysis, part-of-speech tagging 1. INTRODUCTION Morphological analysis is an important process in natural language processing. It deals with the identification of a root word and its affixes (morphemes) from a morphed word. Phonology is another facet of morphology that has to do with how a word is voiced or sounded out. There are various approaches and systems that exist and are used in morphological analysis for generating rules for different languages such as MAGTag [1] for Tagalog and KIMMO [9] for Japanese. These differ in each of their methods in identification and classification of morphemes as well as handling ambiguity. Although there are systems which handle morphology for Filipino, most of these are limited in that they are only word-level and they do not cover rules for phonology. MAGTag in particular was used for this research as it is the only morphological analysis system available for the language and accessible to the proponents. MAGTag utilizes a set of rules in determining the root word and the affixes of an input word. Part-of-Speech tagging is an integral part in sentence analysis that is concerned with annotating the part-of-speech of a particular word in a sentence. There are existing tools for part- of-speech tagging for Tagalog such as HATPOST [2] and TPOST [8]. HATPOST is chosen because it utilizes both rule- based and statistical based to increase the accuracy for tagging words. These components, namely MAGTag and HATPOST, function independently from one another. However, they have their own individual limitations that need to be addressed. The research will seek to construct a sentence-level morphological and phonological analyzer for the Filipino language that will possibly integrate the aforementioned components in order to identify the part-of-speech of a Filipino word in the sentence and generate the root word and phonology of the identified words. Phonology is the study of meaningful sounds in speech and how they are used in natural language. According to the work of Schacter & Otanes [4], some general rules may be made regarding Tagalog phonology. In Tagalog, vowel length is significant since some word pairs which have the same spelling yet different in meaning are differentiated on the basis of their vowel length. Tagalog vowel length is marked by a raised dot as in /i·/ for phonemic notation. There are also instances wherein different phonemes may be interchanged without any effect on the meaning of the word. In general, Tagalog words are spelled in the manner that they are pronounced, even in the case of some loan words. This means that consonant phonemes usually represented according to the letter used in the word. However, vowel phonemes have two allophones for each although these allophones may also be used interchangeably without any effect on the meaning of the word. In a syllable, the vowel is understood to be its syllable nucleus which is the most prominent sound in the syllable. The patterns of Tagalog syllables are often either consonant-vowel or consonant-vowel-consonant. A syllable may also include a consonant cluster, which is considered as one consonant in the pattern. However, there are certain restrictions in what pairs of consonants are accepted consonant clusters. In the case of words that begin with a vowel, there is always a glottal stop /’/ at the initial position of the word since all words must start with a consonant phoneme. In this case, a glottal stop /’/ is considered to be a consonant although it is not represented in conventional spelling. In disyllabic words, if the first syllable does not end in a consonant, the general rule states that the 72 Proceedings of the 8th National Natural Language Processing Research Symposium, pages 72-80 De La Salle University, Manila, 24-25 November 2011

Transcript of Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and...

Page 1: Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and Phonological Analyzer for Filipino ... In a syllable, ... The Sentence-level Morphological

Sentence-level Morphological and Phonological Analyzer for Filipino

Alina, Angelo Nico3315 Michaelangelo

BFRV, Las Piñas City

(+63)9179921179

[email protected]

Cambaliza, Carlo258 Cuenca St. Ayala

Alabang Village, Muntinlupa

(+63)9178222822

[email protected]

Sta. Ana, Xedric103 M.H. del Pilar St.SFDM, Quezon City(+63)9175255377

[email protected]

Sosa, Judd14 P. Gomez St. Sta. Ana, San Mateo, Rizal

(+63)9267057654

juddphilip_sosa

@yahoo.com

Chu, ShirleyDe La Salle University

2401 Taft Avenue, Manila, Philippines

(+632)[email protected]

ABSTRACTThis paper discusses filSPAM (Sentence-level Phonological and Morphological Analyzer for Filipino). Given an input sentence in Tagalog, this system outputs the corresponding parts-of-speech, root words, affixes and phonemic notation of each word in the sentence. The system will make use of the existing systems MAGTag and HATPOST in handling morphological analysis and part-of-speech tagging, respectively. The system has four modules: POS tagger which has 54% accuracy, the morphological analyzer which has 73.02% accuracy, the phonological analyzer is corpus-based and unknown handler which has two functions, the automaton and the generalized tree which has 67% accuracy and 64% respectively.

Categories and Subject DescriptorsD.2.10 [Software Engineering]: Design – methodologies, representation.

General TermsAlgorithms, Documentation, Performance, Experimentation, Languages, Theory

KeywordsNatural language processing, phonological analysis, morphological analysis, part-of-speech tagging

1. INTRODUCTIONMorphological analysis is an important process in natural language processing. It deals with the identification of a root word and its affixes (morphemes) from a morphed word. Phonology is another facet of morphology that has to do with how a word is voiced or sounded out. There are various approaches and systems that exist and are used in morphological analysis for generating rules for different languages such as MAGTag [1] for Tagalog and KIMMO [9] for Japanese. These differ in each of their methods in identification and classification of morphemes as well as handling ambiguity. Although there are systems which handle morphology for Filipino, most of these are limited in that they are only word-level and they do not cover rules for phonology. MAGTag in particular was used for this research as it is the only morphological analysis system available for the language and accessible to the proponents. MAGTag utilizes a set of rules in determining the root word and the affixes of an input word.

Part-of-Speech tagging is an integral part in sentence analysis that is concerned with annotating the part-of-speech of a particular word in a sentence. There are existing tools for part-of-speech tagging for Tagalog such as HATPOST [2] and TPOST [8]. HATPOST is chosen because it utilizes both rule-based and statistical based to increase the accuracy for tagging words.

These components, namely MAGTag and HATPOST, function independently from one another. However, they have their own individual limitations that need to be addressed. The research will seek to construct a sentence-level morphological and phonological analyzer for the Filipino language that will possibly integrate the aforementioned components in order to identify the part-of-speech of a Filipino word in the sentence and generate the root word and phonology of the identified words.

Phonology is the study of meaningful sounds in speech and how they are used in natural language. According to the work of Schacter & Otanes [4], some general rules may be made regarding Tagalog phonology. In Tagalog, vowel length is significant since some word pairs which have the same spelling yet different in meaning are differentiated on the basis of their vowel length. Tagalog vowel length is marked by a raised dot as in /i·/ for phonemic notation. There are also instances wherein different phonemes may be interchanged without any effect on the meaning of the word.

In general, Tagalog words are spelled in the manner that they are pronounced, even in the case of some loan words. This means that consonant phonemes usually represented according to the letter used in the word. However, vowel phonemes have two allophones for each although these allophones may also be used interchangeably without any effect on the meaning of the word. In a syllable, the vowel is understood to be its syllable nucleus which is the most prominent sound in the syllable. The patterns of Tagalog syllables are often either consonant-vowel or consonant-vowel-consonant. A syllable may also include a consonant cluster, which is considered as one consonant in the pattern. However, there are certain restrictions in what pairs of consonants are accepted consonant clusters.

In the case of words that begin with a vowel, there is always a glottal stop /’/ at the initial position of the word since all words must start with a consonant phoneme. In this case, a glottal stop /’/ is considered to be a consonant although it is not represented in conventional spelling. In disyllabic words, if the first syllable does not end in a consonant, the general rule states that the

72 Proceedings of the 8th National Natural Language Processing Research Symposium, pages 72-80

De La Salle University, Manila, 24-25 November 2011

Page 2: Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and Phonological Analyzer for Filipino ... In a syllable, ... The Sentence-level Morphological

vowel of this syllable is long. Furthermore, in words that do not end with a consonant, a glottal stop /’/ or glottal fricative /h/ may also occur in the word-final position as they are consonant phonemes but are also not represented in conventional orthography.

Finite state transducers can be used to represent phonological rules. Figure 1 represents the English flapping rule using sub-sequential finite state transducer. The English flapping rule shown in figure 1: an underlying t is realized as a flap after a stressed vowel and any number of r’s and before an unstressed vowel.

Figure 1. Sub-sequential transducer for English flapping rule.

The phonological-rule induction is based on the Onward Sub-sequential Transducer Inference Algorithm (OSTIA). OSTIA takes as input a training set of input-output pairs. The root of the tree is the initial transducer state, and each leaf of the tree corresponds to the input sample. The output symbols are placed as near as possible to the root of the tree.

Figure 2. Example tree constructed

2. Architectural DesignThe Sentence-level Morphological and Phonological Analyzer for Filipino is a Java-based system that analyzes and generates the morphology and phonology of Filipino words in a sentence. The system identifies the part-of-speech, root word, affixes and phonology of each word of the input Tagalog sentence. The features of the system may be summarized into the following functions, namely Morphological Analysis, Part-of-Speech Tagging and Phonological Analysis. These modules operate independently; meaning that they are able to function apart from each other the results of one does not affect another. An exception to this will be discussed in 2.4.

2.1 Part-of-Speech TaggingHATPOST [2] was utilized to handle most of the part-of-speech tagging for the wrapper class. HATPOST (shown in Figure 4) has a set of generated POS tags which consist of the Rabo-Buban Tag set along with its own set of defined tags. Refer to Table 1. The tagset was not modified further by the proponents.

Figure 3. System Flow

Figure 4. Process Flow of POS Module

2.2 Morphological AnalysisMorphological analysis will be handled by MAGTag [1] (shown in Figure 5). It is utilized in determining the root word and affixes of a word. It uses numerous rules to determine the rootword of each word in the sentence. It also can generate the POS Tags of those words using the same ruleset.

Figure 5. Process Flow of Morphological Analyzer Module

73

Page 3: Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and Phonological Analyzer for Filipino ... In a syllable, ... The Sentence-level Morphological

Table 1. Rabo-Buban TagsetPart-of-Speech

CategoryPOS

Engine Output

Description

others

pronoun

PRS Personal Singular

PRSP Possessive Subject

PROP Possessive Object

PRQP Interrogative Plural

PRL Location

PRC Comparison

PRF Found

PRI Indefinite Number

determiner

DTC Common Noun

DTCC Plural Common Noun

DTP Proper Noun

DTPP Plural Proper Noun

conjunction

CCA Proposition

CCP Ligatures

CCT

UndefinedCCR

CCB

cardinality CDB Digit, Rank, Count

punctuation

PMP Period

PME Exclamation Point

PMQ Question Mark

PMC Comma

PMS Symbol

unknown n/a etc etc

Figure 6. Process Flow of Phonological Analyzer Module

2.3 Phonological AnalysisThe phonological analysis module will be created by the proponents. This module identifies the phonology of the word by getting its corresponding phonology from the database. The database consists of two tables: known and unknown. The

Part-of-SpeechPOS

Engine Output

Description

Common Noun NNC Common

Proper NounNNP Proper

NNPP Proper Abbreviation

Adjective

JJD Describing

JJC Same-level Comparison

JJCC Comparison Comparative

JJCS Comparison Superlative

JJCN Comparison Negation

JJN Describing Number

Verb

VBW Neutral Infinitive

VBS Pseudoverb

VBH Existential

VBL Linking Verb

VBN Non-existential

VBTS Time Past

VBTR Time Present

VBTF Time Future

VBTP Recent Past

VBAF Actor Focus

VBOF Object Focus

VBOB Benefactive Focus

VBOL Locative Focus

VBOI Instrumental Focus

VBRF Referential Focus

Adverb

RBD Describing “How”

RBN Number

RBC Comparison

RBK Conditional

RBP Causative

RBB Benefactive

RBR Referential

RBQ Question

RBT Agree

RBF Disagree

RBW Frequency

RBM Possibility

RBI Enclitics

RBL Place

RBJ Interjections

RBS Social Formula

74

Page 4: Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and Phonological Analyzer for Filipino ... In a syllable, ... The Sentence-level Morphological

known table consists of words and its corresponding phonology from [4] and words verified by the linguist. Also, the known corpus was populated with Tagalog function words (particles, conjunctions, prepositions) along with pronouns as opposed to content words (nouns, verbs) since function words are constant and stable for any language. The listings for the function words were retrieved from [5]. The unknown table consists of words and its corresponding phonology which are output from pre-defined automata used in the unknown word handler.

Figure 7. Process Flow of Unknown Handler Module

2.4 Unknown Word HandlerThe unknown word handler module identifies the phonology of words that are not found in the corpus (shown in Figure 7). This module will apply general phonological rules for determining the phonology of the word. The words and its corresponding phonology are then added to the unknown table. The implementation of the rules will be manually created by the proponents. The words in the unknown table will be ultimately be added to the known corpus during the latter phases of the module after the collaborating linguists have verified the words.

The unknown handler has two methods in determining phonology, namely the Automata and the Generalized Phonology Tree.

Initial Automata The initial automaton was based on one to two syllable Tagalog words. This representation attempts to determine the stress/vowel length in the disyllabic word and also generates word-initial glottal stop for words beginning with a vowel. The vowel length is determined by checking if the penultimate syllable does not end in a consonant such as lunes or if the following syllable begins with any of the accepted consonant clusters such as senyas. A special consonant cluster to be considered is /ng/ in which it is one phoneme represented as /ŋ/. Word-final consonant phoneme for words ending with a vowel are set to glottal fricative /`/ by default. Any word more than two syllables will only have its last two syllables processed and the preceding syllables will be retained.

Figure 8. Initial finite-state automaton

The initial automaton was tested on 51 words from [4]. The automaton’s accuracy is only 27%. The problems encountered involve wrong symbols and word-final consonant phonemes. The /./ represents stress and not vowel length. The notation for major stress is denoted by the symbol /:/. Some other errors encountered involved incorrect lengthening of vowel sounds such as /maga:ling/. However, there is also an ambiguous case wherein the word labi may be transcribed to both /la:bi’/ or /labi’/, both of which have different meanings. The word final consonant phoneme is not always the glottal stop denoted by /`/ but also can be the glottal fricative denoted by /h/ as in /maramih/ which was processed as /marami’/. Some phonological rules were also missed such as two consecutive vowel phonemes from two separate syllables requires a glottal stop /‘/ between the syllables as in /maba’it/ which was transcribed by the system as /maba:it/. Nevertheless, words that start with a vowel are correctly represented with a glottal stop /‘/ at the initial position as in /’a:teh/.

Revised Automata to handle Multi-syllabic words For the second automaton, it was modified to handle words with more than two syllables by applying rules to the syllables before the penultimate. Also the stress symbol /./ used in the previous automaton was replaced by the major stress symbol /:/. The final consonant phoneme was identified by doing a frequency count, counting which of the two final consonant phoneme (glottal stop /`/ or glottal fricative /h/) occurs the most among the known table. The initial automaton was modified as shown in Figure 9.

Figure 9. Second finite-state automaton

The second automaton was tested from the same 51 words from [4]. The second automata’s accuracy was 87%. The increase

75

Page 5: Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and Phonological Analyzer for Filipino ... In a syllable, ... The Sentence-level Morphological

was attributed to correct generation of word-final consonant phoneme wherein /h/ happened to occur more. Yet there were still words with incorrect vowel-lengthening such as /da:mit/ which proved difficult to determine when the rule would or would not be applicable. The second automata were again tested from 217 arbitrary Tagalog words from [5]. The testing resulted to 55% accuracy.

After consultation with the collaborating linguist, there were misconceptions that were addressed regarding the usage of some phonological symbols. The /:/ denotes vowel length and not major stress. A consonant cluster in a syllable does not always include vowel length for the vowel of the syllable after it such as gagamba which was erroneously transcribed as /gaga:mbah/. This is because the cluster /mb/ is not an accepted consonant cluster when considering preceding vowel length. A correct output would be /sigari:lyoh/. As for the consonant-vowel syllable with vowel length /:/ (CV:) followed by a consonant in the succeeding syllable does not occur almost all the time for Tagalog words as observed from the 51 words from [4]. Generally, a consonant-vowel pattern located at the second to the last syllable of the word has vowel length /:/ such as /pe:rah/. The issue with 2 consecutive vowel phonemes separated with a glottal stop /`/ has also been addressed with an example paos being correctly tagged as /pa’os/.

The concept of word-initial consonant-glide clusters was also not covered. For example, a word with word-initial /uw/ like the word buwan is pronounced as /bwan/ and a word with /iy/ like the word niyog is pronounced as /nyog/ wherein the vowel before the glide consonant /w/ or /y/ may be omitted from the phonology although this is not applicable to words such as /pruweba/ since the /uw/ is in separate syllables.

Minor errors were caused by the word-final consonant phoneme. Also, there are certain words that are ambiguous in that they may have two phonological representations based on vowel length but this is not to say that they are totally incorrect. Lastly, Tagalog compound words have also not been taken into consideration, as the automaton processes them as one word. The second automata were modified based on the feedback of the linguist as shown in Figure 10.

Figure 10. Third finite-state automaton

Third Automata indicating improvement suggested by Linguists The third automaton was tested from the same 217 words from [5]. The third automata’s accuracy is 67%. Errors encountered by the third automata were words that do not fall under the general pattern CV for a second to the last syllable of the word has vowel length /:/ as well as for words that have the pattern CV:C before the final syllable. For instance the word batas is pronounced as /batas/ but was incorrectly annotated as /ba:tas/. There are minor errors which still involve the final consonant phoneme. The previous error regarding word-initial consonant-glide clusters has also been handled, with an example kuwago being correctly transcribed as /kwa:goh/. However, it incorrectly annotated impluwensya as /implwensyah/ with which the aforementioned rule is not applicable.

Generally, the rise in accuracy in the third automaton was attributed from the issues that were addressed from the second automaton. However, the words that were correct in the second automaton were now incorrect and vice versa. For instance, the word hadlang was incorrectly transcribed as /ha:dlaŋ/ in the second automaton but became correct in the third representation as /hadlaŋ/; on the other hand, the correct /rebe:ldeh/ from the second automaton became incorrect /rebeldeh/ on the third. This is because the second automaton was based on the observations made the 51 words from [4] which were more applicable on borrowed words from Spanish or English such as /se:rmon/ and /ma:rsoh/. On the other hand, the third automaton is based from the output of the second representation tested on Tagalog words with the feedback from the linguist regarding the output of these words. Compound words have also been tackled. This was done by splitting the compound word and running each through the automaton and concatenating the results for its output since compound words generally retain the phonology of each constituent word.

For compound words, the notation used affects the meaning of the word. According to the linguist, words which have been compounded do not necessarily have a hyphen. For instance, in bahag-hari, /bahag-hari’/, both words will be taken with their corresponding literal meaning and bahaghari, /bahag+hari‘/ has an idiomatic nature with regard to its meaning. Since ambiguity which requires context, is one of the limitations for the phonological analyzer module, words that have two pronunciations will only have one pronunciation. Since the phonology compound words is similar to a non-compound word, the symbol /+/ will be used to denote that these words are combined without using the symbol /-/ to form a new word. The third representation of the automata was then modified based on this feedback as shown in Figure 11.

76

Page 6: Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and Phonological Analyzer for Filipino ... In a syllable, ... The Sentence-level Morphological

Figure 11. Final automata

Final automata with POS In an attempt to deal with ambiguity, the part-of-speech is now also included in the database for known words. Ambiguity may deal with certain word-pairs that have the same spelling yet possess different phonology. These word-pairs may also differ in their part-of-speech. An example of this would be for the word sama which maybe either /sa:mah/ which is a verb or /sama‘/ which is an adjective. As such, it would be useful to also recognize the part-of-speech of the word in identifying its phonology, given that there is varying parts-of-speech for the word-pairs. However, this will not be able to handle word-pairs having different phonology yet have the same part-of-speech.

The input will run through the part-of-speech tagger, which will produce the corresponding tags for each word. The automaton will check if the tag produced by the POS tagger for each word matches the part-of-speech contained in the database in determining the appropriate phonology to use in the output. If the word does not exist with the generated tag, it will go through the normal process using the unknown handler. The generated tag will be saved in the table for unknown words along with the phonology.

Specific tags generated by HATPOST and MAGTag were mapped to a general tag as shown in Table 2 since both use different tagsets.

Table 2. Final Automaton Tags used in Corpus

Part-of-Speech Symbol

Nouns (common, proper) N

Adjectives J

Verbs V

Adverbs R

Pronouns, Function Words P

Unknown/Untagged YYY

2.5 Sentence-level Phonological AnalysisThis module identifies the phonology of the sentence as a whole by combining the word-level phonology and applying sentence-level phonological rules.

3. RESULTS AND DISCUSSIONSThis chapter will discuss per module an overview and some of the design and implementation issues for each followed by their corresponding test phases.

For the testing, a general test data set was used for the system. These were articles in the Tagalog language that were retrieved from Tagalog Wikipedia and an internet blog. In the following sections, the two articles used for testing will be referred to as Blog and Wiki (refer to Table 3 for details of each).

The results were manually evaluated by the linguist. The accuracy is computed by getting the percentage of the number of correct words over the total number of words.

Table 3. Details for Test Data

Source WordCount

Refer as

1st http://tl.wikipedia.org/wiki/Pilipinas 189 Wiki

2nd http://www.perfspot.com/blogs/blog.asp?BlogId=49231

241 Blog

3.1 Part-of-Speech TaggerHATPOST is an undergraduate thesis developed by Ciego, et. al (2007). It is integrated to the system to handle part-of-speech tagging. Integration of HATPOST is straightforward. Essential files of HATPOST were simply transferred to the root directory of the system so that the functions of HATPOST necessary for determining the parts-of-speech may be utilized.

Evaluation of the articles for the HATPOST proved to be highly dependent on its training sets. The results varied greatly, given the quality of said training sets. The POS tagger was fed two articles of varying contents. For Blog, the results consisted of 110 incorrectly tagged words which imply that 54.36% of the total words were correctly translated. The POS tagging module of MAGTag was introduced in order to countercheck whether words that have been tagged as unknown by HATPOST are indeed unknown. The results have improved for there was a deduction in the amount of incorrectly tagged words. The amount of incorrectly tagged words was reduced to 60 which imply that 75.10% were correctly tagged.

As for Wiki, the results consisted of 76 incorrectly tagged words which imply that 59.79% of the total words were correctly tagged. After the introduction of the POS tagging module of MAGTag, results show that only 42 words were incorrectly tagged which ultimately results to 75.10% correctly tagged words.

HATPOST was not altered in the implementation nor was it retrained in any manner.

77

Page 7: Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and Phonological Analyzer for Filipino ... In a syllable, ... The Sentence-level Morphological

3.2 Morphological AnalyzerMAGTag was used to analyze the words given to it and outputs its base form (as a root word) and it also doubled as a POS Tagger given that HATPOST fails to recognize the word’s appropriate POS tag. Initially, MAGTag was used as is; however, some errors were encountered relating to rule implementation and multiple executions and thus had to be subsequently resolved.

Unlike HATPOST, MAGTag is purely rule-based when it comes to its word analysis. Given this approach, the results that will be yielded will be consistent to subsequent executions. The articles that were fed to the POS tagger were the same articles that were fed to MAGTag for morphological analysis. For Blog, there are 192 correctly translated words which is 77.78% of the total amount of words. Upon feeding Wiki, MAGTag produced 138 correctly translated words which is 73.02% of the total amount of words. The errors generated by the MAGTag were sorted into different categories.

Errors in Affixations There have been instances wherein the system produced errors after the extraction of the affixes of the words. These errors consisted of words with wrong affix definitions during the analysis. These occurred in the processing of the rules for the pattern of an affix of a word. An example of which is the word bansang. MAGTag did not consider the ng in bansang as an affix, specifically a suffix. If the affixes were wrongly or incompletely extracted, the analysis of its root word would also be incorrect. Results show that for the Blog, 31.48% (17/54) of the errors fall under this classification. As for Wiki, 47.06% (24/51) of the errors were under this category as well. Most of the errors consisted of the [g] affix that was not extracted for words ending with [an] (ex: “bansang”).

Errors due to Overanalysis Overanalysis occurs when a word or a group of words were unnecessarily analyzed. Most of the occurences of this event happened upon encountering determiners and adverbs. These were words that were root words in essence but not technically defined as actual root words. Examples of such words are nang and habang. MAGTag analyzed these words and returned ng and haba respectively. But it was expected that these words would retain their current forms. 40.74% (22/54) of the errors that have occurred were classified into this category for Wiki and 25.49%(13/51) for Blog.

Errors due to Underanalysis There were also instances wherein a certain word was not thoroughly analysed. Words which exhibited affix reduplication were classified into this category. These errors consisted of words that were lacking in analysis reiterations. This means that the words are too complex or the patterns do not match any of the rules for it to be analyzed even further. An example of such an occurrence is the word katimugan which results to katimog but should have been timog. Wiki and Blog had 15.67% (8/51) and 27.68% (15/54) respectively.

3.3 Phonological AnalyzerThe automata used by this module were built through empirical analysis of input-output pairs and general observations on Tagalog phonology based from [4]. The automata are presented to the linguist and are modified accordingly based on the linguist’s feedback. There were two major sets of Tagalog words used in testing the automata during its different phases.

The first one is a set of 51 words from [1] and the second set is 217 from [5].

Sentence-level Phonology As for the sentence-level analyzer, it was tested on a Tagalog article composed of 11 sentences. According to [5], words in medial position in a sentence will have the initial and final consonant phonemes omitted. This is because in sentence-level speech, such phonemes will be negligible in pronunciation but will not affect the meaning of the sentence. On the other hand, the initial and final words are retained as normal. Also, punctuation marks are ignored and space is denoted by / . /. A correct output for the sentence-level analyzer would be /`i:saŋ . kapulu`an . aŋ . bansaŋ . pilipi:nas/. The only errors encountered were caused by the erroneous transcription in word-level phonology regarding each individual word.

Generalized Phonology Tree An experiment was conducted in order to obtain a comparative analysis for the accuracy and efficiency of the initial automaton method. This experiment involved pattern-matching with multiple trees which will be generated based on training data. This was partially derived from the OSTIA algorithm [3].

The algorithm for the Generalized Phonology Tree has two main phases, namely training and output generation. To populate the trees, training data must be provided. The trees consist of input-output syllable pairs represented as CV (consonant-vowel) patterns. Each input-output syllable pair corresponds to a node wherein the root nodes are the first syllables and subsequent syllables compose of the child nodes. Each node has a weight value assigned to it which denotes how many times the pattern has occurred with regard to the training data. The resulting generalized tree (an example is shown in Figure 12) can be used to generate phonology from given input words.

Figure 12. Sample Generated Tree

Results The final stand-alone automaton resulted to 67% accuracy using the 217 words, 66.23% using Wiki and 60.23% using Blog (64% on average). The errors that occurred were commonly attributed to word-final consonant phoneme such as /hila:ga’/ incorrectly transcribed as /hila:gah/. The second error consists of words that did not fall under the pattern that was addressed with the second representation of the automaton, which was mostly incorrect vowel lengthening. These words mainly consisted of borrowed words such as /bentilador/ and /te:rnoh/ erroneously transcribed as /bentila:dor/ and /ternoh/ respectively. A less occurring error consists of both, an example of which is /‘aruga’/ incorrectly transcribed as /‘aru:gah/. Table 4 shows more details with each type of error for each test with their corresponding accuracy from total words

78

Page 8: Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and Phonological Analyzer for Filipino ... In a syllable, ... The Sentence-level Morphological

Table 4. Breakdown of errors from PA testsType of Error 217 words Wiki Blog

Incorrect vowel lengthening

23.5% 31.17% 31.82%

Word-final consonant phoneme

8.76% 2.60% 6.81%

Both 0.46% n/a 1.13%

In testing the automaton with POS querying, Wiki was used. According to the results, 53% of the words were known which were mostly composed of function words such as ang, mga and sa. The other 47% contained the unknown words and words that were incorrectly tagged by the POS tagger. For these unknown words, the unknown handler was used to generate the phonology. Table 5 shows a more detailed breakdown of these errors.

Table 5. Breakdown of errors from PA w/ POS testsType of Error Percentage from total

unknown words

Found in corpus but incorrect POS 6%

Unknown in both tagger and corpus 35%

Unknown in corpus but tagged 59%

In testing the implementation of the generalized phonology tree, the training data consists of the 217 words from [5] which were also used for the automaton testing. All of the words were used for training and testing. In this test, the accuracy amounted to 64% as opposed to the 67% accuracy of the automaton using the same data set. It was able to correctly generate the output for the majority of the words given that these patterns occurred most based from the training data. However, the slight drop in accuracy was attributed to certain words having incomplete output based from the issues mentioned in the previous section. This solution was not effective with certain input words. Examples of such erroneous words were babae and halaan having incomplete transcriptions of /baba:/ and /hala:/ respectively. The implementation of the generation phase in processing similar words has to be improved. With regard to the aforementioned glide-consonant cluster syllable issue, the training phase encountered 5 words which were of this occurrence from the 217 words. Therefore, a total of 212 words were only used for the training.

Wiki was used for testing data against the tree generated from the previously used 212 words. It resulted to 37%, with most of the errors coming from the function words. Most of the function words used in the article were not found in the tree. Thus, they had to be run through the automaton. The reason for this is that the training data did not contain patterns that constitute most function words, which were mostly monosyllabic. This shows that the results are very much reliant on the patterns derived from the training data. Another test was conducted using Wiki as both the training and testing data. The accuracy improved to 55% for this test.

4. CONCLUSION AND FUTURE WORK4.1 ConclusionfilSPAM (Sentence-level Phonological and Morphological Analyzer for Filipino) was developed to analyze a given

Filipino sentence input and generate the corresponding part-of-speech, root word, and phonology of the input sentence. To complete the task of generating the POS, root word and phonology, the following modules were created:

• The Part-of-Speech tagging module, which successfully tags input sentences using both HATPOST and the POS tagging module of MAGTag.• Nothing was modified with the implementation of HATPOST.• For the Morphological Analysis module, it was successful in determining root words with the respective affixes using the Morphological Analysis module of MAGTag.• The rules were modified to accommodate the overlooked exceptions produced by MAGTag.

Although the exceptions in MAGTag were resolved, the alteration that was made to the rules of MAGTag did not affect the accuracy of its analysis.

For the Phonology module, the Automata or the Generalized Phonology Tree was created to represent and implement phonological rules and to generate phonology with accuracy of 64% and 52% respectively. As for compound words, the system can only identify them if they include a hyphen.

The Generalized phonology tree still has some issues to be handled regarding its implementation, which lead to a lesser accuracy than expected. The output generation phase may have to be improved. This can be done by implementing a backtrack method such that if the subtree encountered has reached its final node when there are still more input syllables to process, it will search for another appropriate subtree. Also, it is limited based on the training data provided. As shown in the testing, the training data used did not include patterns for function words.

Akin to the rule-based automaton, several of the incorrectly transcribed words were attributed to the word-final consonant phoneme being usually glottal fricative /h/ instead of glottal stop /’/ since both used a similar implementation (frequency count) in determining the phoneme. Therefore, the training-based experiment did not have a significant effect concerning this issue as it bears the same result with the rule-based automaton.

As a stand-alone module, the phonological analyzer resulted to 67%. However, the integration with HATPOST and MAGTag affected the accuracy of the phonological analyzer. The accuracy dropped, resulting to 54%. The handling of ambiguous words will depend on the output of the POS tagger.

The phonology module also provides a save function for future reference. The function outputs a text file using the following format shown in Figure 13:

Figure 13. Format of Output text file

<I> represents the input sentence, <SLP> represents the sentence-level phonology, <W> represents the words in the input sentence, and <WP> represents the word-level

79

Page 9: Sentence-level Morphological and Phonological · PDF fileSentence-level Morphological and Phonological Analyzer for Filipino ... In a syllable, ... The Sentence-level Morphological

phonology. The output file can also be used as training data in building the generalized phonology tree.

This research is a stepping stone towards a back-end for human-computer interaction, specifically using the Tagalog Language. This purpose is achieved through understanding and analyzing basic sentence components such as parts-of-speech, root words and word pronunciations.

4.2 RecommendationDuring the course of research and development of filSPAM, several issues were encountered and some are still left unaddressed. Aside from which, improvements can also be made to increase the performance of the system. The following suggestions were noted for future research endeavors:

Part-of-Speech Tagger Training data for HATPOST as of the moment is not sufficient, Having the appropriate training data for HATPOST will improve the quality of the tagging and it will also improve its accuracy. The training data needs to be of the same context as the expected input to produce the best results. The degree of the formality of the words may prove of some use in this scenario. Another case would be the genre from which is input was retrieved from (e.g. if the input will come from short stories, the POS tagger will perform better if the training sets came from short stories as well).

Morphological Analyzer The architecture of MAGTag can be improved. The implementation of the word analysis of MAGTag is tedious to simply modify. Therefore, a complete and careful redesign of the coding of the Analysis Module would help improve its performance.

Phonological Analyzer The phonology module cannot handle words that are spelled the same but have two different pronunciations. Ambiguity can be solved through context. Context can be identified through the use of part-of-speech. For example /ora:san/ meaning to record time and /orasan/ which refers to a watch. On the other hand, part-of-speech is not enough. For examaple /ba:ta’/ meaning a child and /ba:tah/ meaning a cloth are both nouns and can be differentiated by identifying its thematic role, /ba:ta’/ is an agent/actor while /ba:tah/ is an instrument/object.

The Generalized Phonology Tree was constructed using training data consisting of words from various sources. Using the training data, different approaches for training can be used such as HMM (Hidden Markov Model) [7]. Instead of using the frequency of the node or the number of instances a node occurred in the training data as basis which node to visit, a probability measure can also be used.

To improve the sentence-level phonology, the concept of intonation and assimilation can be used. By studying the concept of assimilation, wherein a sound accommodates itself to a neighboring sound, the word final consonant phoneme of one word can blend into the sound of the beginning of the following word such as ang pader becoming /am pader/. This phenomenon is involved mostly with Tagalog nasal consonant phonemes such as /n/, /m/ and /ng/ followed by words beginning in a labial/labiodental consonant such as /p/ or /b/. This commonly occurs during normal rapid conversation and does not affect the meaning of the words. Intonation on the other hand, deals with pitch phenomena in varying positions in

a sentence. An utterance may have a different connotation or may give emphasis if used with a different pitch.

To improve the overall accuracy of the system, a recommendation would be to integrate external dictionaries such as FilWordNet and external systems such as fiLex. FilWordNet may be used to augment the training data used by HATPOST once its tagset has been converted to the Rabo-Buban Tagset. Since part-of-speech is not enough to solve ambiguity in determining phonology for certain words, the thematic roles from fiLex as well as the POS from HATPOST and MAGTAG may be used as additional parameters in building the Generalized Phonology Tree.

5. REFERENCES[1] Aquino, M., Fernandez, E., & Villanueva, K. (2007).

Morphological Analyzer and Generator for Tagalog. De La Salle University, Manila.

[2] Ciego, R., Huang, Z., Navarro, G., Roxas, R., & Torres, M. (2007). Hybrid Approach to Tagalog Part-of-Speech Tagging. De La Salle University, Manila.

[3] Gildea, D., & Jurafsky, D. (1995). Automatic Induction of Finite State Transducer for Simple Phonological Rules.

[4] Otanes, F., & Schachter, N. (1972). Tagalog reference grammar. (University of California Press, Berkeley, Los Angeles)

[5] Tagalog Dictionary. (n.d.) Retrieved February 23, 2011 from http://www.seasite.niu.edu/Tagalog/Dictionary/diction.htm

[6] Llamzon, T. (1976). Modern Tagalog: A Functional Structural Description. The Hague: Mouton

[7] Rabiner, L. R. (1989). A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE77 (2): 257-286.

[8] Rabo, V. (2004). TPOST: A Template-based, n-gram part-of-speech tagger for Tagalog. (De La Salle University, Manila)

[9] Yukiko, S. (1983). A two-level morphological analysis for Japanese. (Retrieved November 22, 2010 from http://www2.parc.com/istl/members/karttune/publications/archive/kimmo/kimmo-japanese.pdf)

80