A PROJECT REPORT ON PART-OF-SPEECH TAGGING FOR BENGALI › xadm › data_entry_module › project...

49
Bengali Part-Of-Speech Tagging 1 A PROJECT REPORT ON PART-OF-SPEECH TAGGING FOR BENGALI IN PARTIAL FULFILLMENT OF THE REQUIRMENT FOR THE DEGREE OF MASTER OF COMPUTER SCIENCE DEPARTMENT OF COMPUTER SCIENCE ASSAM UNIVERSITY, SILCHAR 2016 Submitted by: DEEPANKAR DAS Roll: 101614 No.: 22220380 Under the Guidance of PROF. BIPUL SYAM PURKYASTHA HEAD OF DEPARTMENT, PROFESSOR DEPARTMENT OF COMPUTER SCIENCE ASSAM UNIVERSITY, SILCHAR-788011

Transcript of A PROJECT REPORT ON PART-OF-SPEECH TAGGING FOR BENGALI › xadm › data_entry_module › project...

  • Bengali Part-Of-Speech Tagging

    1

    A

    PROJECT REPORT

    ON

    PART-OF-SPEECH TAGGING

    FOR BENGALI

    IN PARTIAL FULFILLMENT OF THE REQUIRMENT FOR THE DEGREE OF

    MASTER OF COMPUTER SCIENCE

    DEPARTMENT OF COMPUTER SCIENCE

    ASSAM UNIVERSITY, SILCHAR

    2016

    Submitted by:

    DEEPANKAR DAS

    Roll: 101614 No.: 22220380

    Under the Guidance of

    PROF. BIPUL SYAM PURKYASTHA

    HEAD OF DEPARTMENT, PROFESSOR

    DEPARTMENT OF COMPUTER SCIENCE

    ASSAM UNIVERSITY, SILCHAR-788011

  • Bengali Part-Of-Speech Tagging

    2

    CERTIFICATE

    This is to certify that Deepankar Das bearing Roll: 101614 No: 22220380 has

    carried out her work for the project entitled “PART-OF-SPEECH TAGGING FOR

    BENGALI” under my supervision in partial fulfillment for the requirement of the

    award of degree of Master of Science in Computer Science of Assam University,

    Silchar. He has done sincerely his work for preparing this project. He has fulfilled

    all the requirements laid down in the regulations of the MSc (2 years) 4th

    Semester Examination (Paper MS-405) of the Department of Computer Science,

    Assam University, Silchar, for the session 2015-2016.

    Date: Signature of the Guide

    Place: (PROF. BIPUL SYAM PURKAYASTHA)

    Supervisior, Professor

    Department of Computer Science

    Assam University, Silchar

  • Bengali Part-Of-Speech Tagging

    3

    CERTIFICATE

    This is to certify that Deepankar Das bearing Roll: 101614 No: 22220380 has

    carried out her work for the project entitled “PART-OF-SPEECH TAGGING FOR

    BENGALI” under my supervision in partial fulfillment for the requirement of the

    award of degree of Master of Science in Computer Science of Assam University,

    Silchar. He has done sincerely his work for preparing this project. He has fulfilled

    all the requirements laid down in the regulations of the MSc (2 years) 4th

    Semester Examination (Paper MS-405) of the Department of Computer Science,

    Assam University, Silchar, for the session 2015-2016.

    Date: Signature of the HOD

    Place: (PROF. BIPUL SYAM PURKAYASTHA)

    HOD, Professor

    Department of Computer Science

    Assam University, Silchar

  • Bengali Part-Of-Speech Tagging

    4

    DECLARATION

    I, Deepankar Das, student of 4th semester (MSc 2 years), Department of Computer

    Science do hereby solemnly declare that I have duly worked on my project

    entitled “PART-OF-SPEECH TAGGING FOR BENGALI” under the supervision of Prof. Bipul

    Syam Purkayastha, Professor, Department of Computer Science, Assam

    University, Silchar.

    Date: Signature

    Place: ( Deepankar Das )

    Msc 4th Semester

    Roll: 101614 No.: 22220380

    Regn. No.: 02-110018703 of 2011-12

    Department of Computer Science

    Assam University, Silchar

  • Bengali Part-Of-Speech Tagging

    5

    ACKNOWLWDGEMENT

    At the very outset, I take the privilege to convey my gratitude to those

    persons whose co-operation, suggestions and heartfelt support helped

    us to accomplish the term paper successfully.

    I take immense pleasure to express my sincere thanks and profound

    gratitude to my respected guide Prof. Bipul Shyam Purkayastha, Head

    of the Department of Computer Science, Assam University, Silchar, for

    his excellence and able guidance, valuable suggestions and

    encouragement he rendered for completing the term paper and also for

    his valuable suggestions.

    I also indebted to my family members, friends and well-wishers who

    encouraged me to do this work with vigor and seriousness.

    Last but not the least I would like to acknowledge the cooperation I

    received from the entire staff of our department and thanks to all those

    who directly or indirectly extended their helpful hands and moral

    support while making this project.

    ( Deepankar Das )

  • Bengali Part-Of-Speech Tagging

    6

    Table of Contents

    Chapters Title Page No Chapter 1 Introduction 1

    1.1 NLP 2

    1.2 Applications of NLP 2

    1.3 POS Tagging 6

    1.4 The POS Tagging Problem 7

    1.5 Applications of POS Tagging 9

    1.6 Motivations 10

    1.7 Goals of Our Work 10

    1.8 Organization of the report 11

    Chapter 2 Prior Work 12

    2.1 Prior Work in POS Tagging 13

    2.2 Linguistics Taggers 13

    2.3 POS Tagging Approaches 14

    2.4 Indian Language POS Taggers 18

    Chapter 3 Foundational Consideration 20

    3.1 Corpora Collection 21

    3.2 The Tagset 21

  • Bengali Part-Of-Speech Tagging

    7

    Chapter 4 Tagging with Rule Based Approach 24

    4.1 Rule Based Approach 25

    4.2 Our Approach 25

    Chapter 5 Experimental Result & Discussion 28

    5.1 Tools Used 29

    5.2 Graphical User Interface 30

    5.3 Experimental Result 31

    5`4 Result Discussion 32

    Chapter 6 Conclusion & Future Direction 33

    6.1 Conclusion 34

    6.2 Future Work 34

    References 35

  • Bengali Part-Of-Speech Tagging

    8

    Abstract

    Part-of-Speech (POS) tagging is the process of assigning the appropriate part of

    speech or lexical category to each word in a natural language sentence. Part-of-speech

    tagging is an important part of Natural Language Processing (NLP) and is useful for most

    NLP applications. It is often the first stage of natural language processing following which

    further processing like chunking, parsing, etc are done.

    POS tagging is considered as the one of the basic necessary tool. Its simplified form is

    commonly taught to school age children, in the identification of words as nouns, pronouns,

    verbs, adjectives, adverbs, prepositions, conjunctions,, interjections etc. Development of any

    Indian language POS tagger will influence several pipelined modules of natural language

    understanding system including Information Extraction(IE); Information Retrieval(IR);

    Machine Translation (MT); Partial Parsing (PP) and Word Sense Disambiguation(WSD).

    Our objective in this work is to develop an effective POS tagger for Bengali Language. Once

    performed by manual, POS tagging is now done with the context of computational

    linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech,

    in accordance with a set of descriptive tags. POS tagging algorithms fall into two distinctive

    groups: rule based and stochastic. E. Brill's tagger, one of the first and most widely used

    English POS taggers, employs rule based algorithms.

    Bengali is the main language spoken in Bangladesh, the second most commonly

    spoken language in India, and the seventh most commonly spoken language in the world with

    nearly 230 million total speakers(189 million native speakers). Natural language processing

    of Bengali is in its infancy. POS tagging of Bengali is a necessary component for most NLP

    applications of Bengali.

    The developed system is tested with a set of experimental data and result analysis has

    been made. The system gives accuracy over 74.50%. The performance can be increased by

    increasing the size of the lexicon.

  • Bengali Part-Of-Speech Tagging

    9

    CHAPTER 1

    Introduction

  • Bengali Part-Of-Speech Tagging

    10

    1.1 NLP

    The goal of natural language processing (NLP) is to build computational models of natural

    language for its analysis and generation. First, there is technological motivation of building

    intelligent computer systems such as machine translation systems, natural language interfaces

    to databases, man-machine interfaces to computers in general, speech understanding systems,

    text analysis and understanding systems, computer aided instruction systems, systems that

    read and understand printed or handwritten text. Second, there is a cognitive and linguistic

    motivation to gain a better in- sight into how humans communicate using natural language

    (NL).

    Natural language processing (NLP) is a field of computer science and linguistics

    concerned with the interactions between computers and human (natural) languages; it began

    as a branch of artificial intelligence .In theory, natural language processing is a very attractive

    method of human computer interaction. Natural language understanding is sometimes

    referred to as an AI-computer problem because it seems to require extensive knowledge

    about the outside world and the ability to manipulate it. Natural language processing (NLP) is

    a collection of techniques used to extract grammatical structure and meaning from input in

    order to perform a useful task as a result, natural language generation builds output based on

    the rules of the target language and the task at hand. NLP is useful in the tutoring systems,

    duplicate detection, computer supported instruction and database interface fields as it

    provides a pathway for increased interactivity and productivity.

    The tools of work in NLP are grammar formalisms, algorithms and data structures,

    formalism for representing world knowledge, reasoning mechanisms, etc. Many of these have

    been taken from and inherit results from Computer Science, Artificial Intelligence,

    Linguistics, Logic, and Philosophy.

    1.2 Applications of NLP

    Automatic summarization : Produce a readable summary of a chunk of text. Often used to

    provide summaries of text of a known type such as articles in the financial section of a

    newspaper.

    Machine translation: Automatically translate text from one human language to another. This

    is one of the most difficult problems, and is a member of a class of problems colloquially

    http://en.wikipedia.org/wiki/Automatic_summarizationhttp://en.wikipedia.org/wiki/Machine_translation

  • Bengali Part-Of-Speech Tagging

    11

    termed "AI-complete", i.e. requiring all of the different types of knowledge that humans

    possess (grammar, semantics, facts about the real world, etc.) in order to solve properly.

    Morphological segmentation: Separate words into individual morphemes and identify the

    class of the morphemes. The difficulty of this task depends greatly on the complexity of the

    morphology (i.e. the structure of words) of the language being considered. English has fairly

    simple morphology, especially inflectional morphology, and thus it is often possible to ignore

    this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened,

    opening") as separate words. In languages such as Turkish, however, such an approach is not

    possible, as each dictionary entry has thousands of possible word forms. Not only for Turkish

    but also the Manipuri which is a highly agglutinated Indian language.

    Named entity recognition (NER): Given a stream of text,determine which items in the text

    map to proper names, such as people or places, and what the type of each such name is (e.g.

    person, location, organization). Note that, although capitalization can aid in recognizing

    named entities in languages such as English, this information cannot aid in determining the

    type of named entity, and in any case is often inaccurate or insufficient. For example, the first

    word of a sentence is also capitalized, and named entities often span several words, only

    some of which are capitalized. Furthermore, many other languages in non-Western scripts

    (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with

    capitalization may not consistently use it to distinguish names. For example, German

    capitalizes all nouns, regardless of whether they refer to names, and French and Spanish do

    not capitalize names that serve as adjectives.

    Natural language generation: Convert information from computer databases into readable

    human language.

    Natural language understanding: Convert chunks of text into more formal representations

    such as first-order logic structures that are easier for computer programs to manipulate.

    Natural language understanding involves the identification of the intended semantic from the

    multiple possible semantics which can be derived from a natural language expression which

    usually takes the form of organized notations of natural languages concepts. Introduction and

    creation of language metamodel and ontology are efficient however empirical solutions. An

    explicit formalization of natural languages semantics without confusions with implicit

    assumptions such as closed world assumption (CWA) vs. open world assumption, or

    http://en.wikipedia.org/wiki/AI-completehttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/Morphemehttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Inflectional_morphologyhttp://en.wikipedia.org/wiki/Turkish_languagehttp://en.wikipedia.org/wiki/Manipuri_languagehttp://en.wikipedia.org/wiki/Named_entity_recognitionhttp://en.wikipedia.org/wiki/Capitalizationhttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Arabic_languagehttp://en.wikipedia.org/wiki/German_languagehttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/French_languagehttp://en.wikipedia.org/wiki/Spanish_languagehttp://en.wikipedia.org/wiki/Adjectivehttp://en.wikipedia.org/wiki/Natural_language_generationhttp://en.wikipedia.org/wiki/Natural_language_understandinghttp://en.wikipedia.org/wiki/First-order_logichttp://en.wikipedia.org/wiki/Computer

  • Bengali Part-Of-Speech Tagging

    12

    subjective Yes/No vs. objective True/False is expected for the construction of a basis of

    semantics formalization.

    Optical character recognition (OCR): Given an image representing printed text, determine

    the corresponding text.

    Part-of-speech tagging(POST) : Given a sentence, determine the part of speech for each

    word. Many words, especially common ones, can serve as multiple parts of speech. For

    example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set"

    can be a noun, verb or adjective; and "out" can be any of at least five different parts of

    speech. Some languages have more such ambiguity than others. Languages with little

    inflectional morphology, such as English are particularly prone to such ambiguity. Chinese is

    prone to such ambiguity because it is a tonal language during verbalization. Such inflection is

    not readily conveyed via the entities employed within the orthography to convey intended

    meaning.

    Parsing: Determine the parse tree (grammatical analysis) of a given sentence. The grammar

    for natural languages is ambiguous and typical sentences have multiple possible analyses. In

    fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses

    (most of which will seem completely nonsensical to a human).

    Question answering: Given a human-language question, determine its answer. Typical

    questions have a specific right answer (such as "What is the capital of Canada?"), but

    sometimes open-ended questions are also considered (such as "What is the meaning of

    life?"). Recent works have looked at even more complex questions.

    Relationship extraction: Given a chunk of text, identify the relationships among named

    entities (e.g. who is the wife of whom).

    Sentence breaking (also known as sentence boundary disambiguation): Given a chunk of

    text, find the sentence boundaries. Sentence boundaries are often marked by periods or other

    punctuation marks, but these same characters can serve other purposes (e.g. marking

    abbreviations).

    Sentiment analysis: Extract subjective information usually from a set of documents, often

    using online reviews to determine "polarity" about specific objects. It is especially useful for

    identifying trends of public opinion in the social media, for the purpose of marketing.

    http://en.wikipedia.org/wiki/Optical_character_recognitionhttp://en.wikipedia.org/wiki/Part-of-speech_tagginghttp://en.wikipedia.org/wiki/Part_of_speechhttp://en.wikipedia.org/wiki/Parts_of_speechhttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/Verbhttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/Verbhttp://en.wikipedia.org/wiki/Adjectivehttp://en.wikipedia.org/wiki/Inflectional_morphologyhttp://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Tonal_languagehttp://en.wikipedia.org/wiki/Parsinghttp://en.wikipedia.org/wiki/Parse_treehttp://en.wikipedia.org/wiki/Grammarhttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Ambiguoushttp://en.wikipedia.org/wiki/Question_answeringhttp://en.wikipedia.org/wiki/Relationship_extractionhttp://en.wikipedia.org/wiki/Sentence_breakinghttp://en.wikipedia.org/wiki/Sentence_boundary_disambiguationhttp://en.wikipedia.org/wiki/Full_stophttp://en.wikipedia.org/wiki/Punctuation_markhttp://en.wikipedia.org/wiki/Abbreviationhttp://en.wikipedia.org/wiki/Sentiment_analysis

  • Bengali Part-Of-Speech Tagging

    13

    Speech recognition: Given a sound clip of a person or people speaking, determine the textual

    representation of the speech. This is the opposite of text to speech and is one of the extremely

    difficult problems colloquially termed "AI-complete" (see above). In natural speech there are

    hardly any pauses between successive words, and thus speech segmentation is a necessary

    subtask of speech recognition (see below). Note also that in most spoken languages, the

    sounds representing successive letters blend into each other in a process termed co

    articulation, so the conversion of the analog signal to discrete characters can be a very

    difficult process.

    Speech segmentation: Given a sound clip of a person or people speaking, separate it into

    words. A subtask of speech recognition and typically grouped with it.

    Topic segmentation and recognition: Given a chunk of text, separate it into segments each of

    which is devoted to a topic, and identify the topic of the segment.

    Word segmentation: Separate a chunk of continuous text into separate words. For a language

    like English, this is fairly trivial, since words are usually separated by spaces. However, some

    written languages like Chinese, Japanese and Thai do not mark word boundaries in such a

    fashion, and in those languages text segmentation is a significant task requiring knowledge of

    the vocabulary and morphology of words in the language.

    Word sense disambiguation: Many words have more than one meaning; we have to select the

    meaning which makes the most sense in context. For this problem, we are typically given a

    list of words and associated word senses, e.g. from a dictionary or from an online resource

    such as WordNet. In some cases, sets of related tasks are grouped into subfields of NLP that

    are often considered separately from NLP as a whole. Examples include:

    Information retrieval (IR): This is concerned with storing, searching and retrieving

    information. It is a separate field within computer science (closer to databases), but IR relies

    on some NLP methods (for example, stemming). Some current research and applications seek

    to bridge the gap between IR and NLP.

    Information extraction (IE): This is concerned in general with the extraction of semantic

    information from text. This covers tasks such as named entity recognition, Co reference

    resolution, relationship extraction, etc.

    http://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Text_to_speechhttp://en.wikipedia.org/wiki/AI-completehttp://en.wikipedia.org/wiki/Natural_speechhttp://en.wikipedia.org/wiki/Speech_segmentationhttp://en.wikipedia.org/wiki/Coarticulationhttp://en.wikipedia.org/wiki/Coarticulationhttp://en.wikipedia.org/wiki/Speech_segmentationhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Topic_segmentationhttp://en.wikipedia.org/wiki/Word_segmentationhttp://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Japanese_languagehttp://en.wikipedia.org/wiki/Thai_languagehttp://en.wikipedia.org/wiki/Vocabularyhttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/Word_sense_disambiguationhttp://en.wikipedia.org/wiki/Meaning_%28linguistics%29http://en.wikipedia.org/wiki/WordNethttp://en.wikipedia.org/wiki/Information_retrievalhttp://en.wikipedia.org/wiki/Information_extractionhttp://en.wikipedia.org/wiki/Named_entity_recognitionhttp://en.wikipedia.org/wiki/Coreferencehttp://en.wikipedia.org/wiki/Coreferencehttp://en.wikipedia.org/wiki/Relationship_extraction

  • Bengali Part-Of-Speech Tagging

    14

    1.3 POS Tagging

    Part-of-Speech (POS) tagging is the process of automatic annotation of lexical categories.

    Part-of–Speech tagging assigns an appropriate part of speech tag for each word in a sentence

    of a natural language. The development of an automatic POS tagger requires either a

    comprehensive set of linguistically motivated rules or a large annotated corpus. But such

    rules and corpora have been developed for a few languages like English and some other

    languages. POS taggers for Indian languages are not readily available due to lack of such

    rules and large annotated corpora.

    A part-of-speech is a grammatical category commonly including nouns, pronouns,

    verbs, adjectives, adverbs, prepositions, conjunctions, interjections. Parts of speech can be

    divided into two broad categories: closed classes and open classes. Closed classes are those

    that have relatively fixed membership. For example, pronouns are categorized in closed class

    because there is a fixed set of them in English; new pronouns are rarely added. But nouns are

    in open class because new nouns are continually added in every language.

    The linguistic approach is the classical approach to POS tagging was initially

    explored in middle sixties and seventies (Harris, 1962; Klein and Simmons, 1963; Greene

    and Rubin, 1971). People manually engineered rules for tagging. The most representative of

    such pioneer tagger was TAGGIT (Greene and Rubin, 1971), which was used for initial

    tagging of the Brown Corpus. The development of ENGTWOL (an English tagger based on

    constraint grammar architecture) can be considered most important in this direction (Karlsson

    et al., 1995). These taggers typically use rule-based models manually written by linguists.

    The advantage of this model is that the rules are written from a linguistic point of view and

    can be made to capture complex kinds of information. This allows the construction of an

    extremely accurate system. But handling all rules is not easy and requires expertise. The

    context frame rules have to be developed by language experts and it is costly and difficult to

    develop a rule based POS tagger. Further, if one uses of rule based POS tagging, transferring

    the tagger to another language means starting from scratch again.

    On the other hand, recent machine learning techniques makes use of annotated

    corpora to acquire high-level language knowledge for different tasks including PSO tagging.

    This knowledge is estimated from the corpora which are usually tagged with the correct part

    of speech labels for the words. Machine learning based tagging techniques facilitate the

    development of taggers in shorter time and these techniques can be transferred for use with

    corpora of other languages. Several machine learning algorithms have been developed for the

  • Bengali Part-Of-Speech Tagging

    15

    POS disambiguation task. These algorithms range from instance based learning to several

    graphical models. The knowledge acquired may be in the form of rules, decision trees,

    probability distribution, etc. The encoded knowledge in stochastic methods may or may not

    have direct linguistic interpretation. But typically such taggers need to be trained with a

    handsome amount of annotated data to achieve high accuracy. Though significant amounts of

    annotated corpus are often not available for most languages, it is easier to obtain large

    volumes of un-annotated corpus for most of the languages. The implication is that one may

    explore the power of semi-supervised and unsupervised learning mechanism to get a POS

    tagger.

    Our interest is in developing taggers for Bengali Languages. Annotated corpora are

    not readily available for this language, but the language is morphologically rich. The use of

    morphological features of a word, as well as word suffixes can enable us to develop a POS

    tagger with limited resources. In the present work, these morphological features (affixes)

    have been incorporated in different machine learning models (Maximum Entropy,

    Conditional Random Field, etc.) to perform the POS tagging task. This approach can be

    generalized for use with any morphologically rich language in poor-resource scenario.

    The development of a tagger requires either developing an exhaustive set of linguistic

    rules or a large amount of annotated text. However no tagged corpus was available to us for

    use in this task. We had to start with creating tagged resources for Bengali. Manual part of

    speech tagging is quite a time consuming and difficult process. So we tried to work with

    methods so that small amount of tagged resources can be used to effectively carry out the part

    of speech tagging task.

    1.4 The Part-of-Speech Tagging Problem

    Natural languages are ambiguous in nature. Ambiguity appears at different levels of the

    natural language processing (NLP) task. Many words take multiple part of speech tags. The

    correct tag depends on the context.

    Consider, for instance, the following English and Bengali sentence

    1. Keep the book on the top shelf.

    2. সকাবো তারা ক্ষেবত াঙ দিবে কাজ কবর

    The sentences have lot of POS ambiguity which should be resolved before the

    sentence can be understood. For instance in example sentence 1, the word “ keep ” and

  • Bengali Part-Of-Speech Tagging

    16

    “book” can be a noun or a verb; “on” can be a preposition, an adverb, an adjective; finally,

    “top” can be either an adjective or a noun. Similarly, in Bengali example sentence 2, the

    word “তারা ” can be either a noun or a pronoun; “দিবে” can be either a verb or a postposition

    ”করে” can be a noun, a verb, or a postposition. In most cases POS ambiguity can be

    resolved by examining the context of the surrounding words. Figure1 shows a detailed

    analysis of the POS ambiguity of an English sentence considering only the basic 8 tags. The

    box with single line indicates the correct tag for a particular word where no ambiguity exists

    i.e. only one tag is possible for the word. On the contrary, the boxes with double line indicate

    the correct POS tag of a word form a set of possible tags.

    Figure 1: POS ambiguity of an English sentence with eight basic tags.

    Figure 2: POS ambiguity of a Bengali sentence with tagset of experiment.

    Figure 2 illustrate the detail of the ambiguity class for the Bengali sentence as per the

    tagset used for our experiment. As we are using a fine grained tagset compare to the basic 8

    tags, the number of possible tags for a word increases POS tagging is the task of assigning

    appropriate grammatical tags to each word of an input text in its context of appearance.

    Essentially, the POS tagging task resolves ambiguity by selecting the correct tag from the set

    of possible tags for a word in a sentence.

    সকাবো তারা ক্ষেবত াঙ দিবে কাজ কবর

    N PR N N V N

    PSP

    V

    PSP

  • Bengali Part-Of-Speech Tagging

    17

    1.5 Applications of POS Tagging

    POS disambiguation task is useful in several natural language processing tasks. It is often the

    first stage of natural language understanding following which further processing e.g.,

    chunking, parsing, etc are done. Part-of –speech tagging is of interest for a number of

    applications, including – speech synthesis and recognition , machine translation, lexicography

    etc.

    Most of the natural language understanding systems are formed by a set of pipelined

    modules; each of them is specific to a particular level of analysis of the natural language text.

    Development of a POS tagger influences several pipelined modules of the natural language

    understanding task. As POS tagging is the first step towards natural language understating, it

    is important to achieve a high level of accuracy which otherwise may hamper further stages

    of the natural language understanding. In the following, we briefly discuss some of the above

    applications of POS tagging.

    Speech synthesis and recognition, Part-of-speech gives significant amount of information

    about the word and its neighbours which can be useful in a language model for speech

    recognition (Heeman et al., 1997). Part of Speech of a word tells us something about how

    the word is pronounced depending on the grammatical category (the noun is pronounced

    Object and the verb object).

    Information retrieval and extraction, by augmenting a query given to a retrieval

    system with POS information, more refined information extraction is possible. For

    example, if a person wants to search for document containing “ book” as a noun, adding

    the POS information will eliminate irrelevant documents with only “ book” as a verb.

    Also, patterns used for information extraction from text often use POS references.

    Machine translation, the probability of translating a word in the source

    language into a word in the target language is effectively dependent on the

    POS category of the source Word.

    As mentioned earlier, POS tagging has been used in several other application such as a processor

    to high level syntactic processing (noun phrase chunker), lexicography, stylometry, and word

    sense disambiguation. These applications are discussed in some detail in (Church, 1988;

    Ramshaw and Marcus, 1995; Wilks and Stevenson, 1998).

  • Bengali Part-Of-Speech Tagging

    18

    1.6 Motivation

    A lot of work has been done in part of speech tagging of several languages, such as English.

    While some work has been done on the part of speech tagging of different Indian languages

    (Ray et al., 2003; Shrivastav et al., 2006; Arulmozhi et al., 2006; Singh et al., 2006; Dalal et

    al., 2007), the effort is still in its infancy. Very little work has been done previously with part

    of speech tagging of Bengali. Bengali is the main language spoken in Bangladesh, the second

    most commonly spoken language in India, and the seventh most commonly spoken language

    in the world.

    Apart from being required for further language analysis, Bengali POS tagging is of

    interest due to a number of applications like speech synthesis and recognition. Part-of-speech

    gives significant amount of information about the word and its neighbours which can be

    useful in a language model for different speech and natural language processing applications.

    Development of a Bengali POS tagger will also influence several pipelined modules of

    natural language understanding system including: information extraction and retrieval;

    machine translation; partial parsing and word sense disambiguation. The existing POS

    tagging technique shows that the development of a reasonably good accuracy POS tagger

    requires either developing an exhaustive set of linguistic rules or a large amount of annotated

    text. We have the following observations.

    i. POS tagging has wide range of applications.

    ii. Reputed companies like Google, Microsoft are concentrated on NLP

    applications so POS tagging has got more importance.

    iii. Part of speech tagging using rule based approach is a challenging task. Part of

    Speech resolves ambiguities

    Therefore, there is a pressing necessity to develop a automatic Part-of-Speech tagger for

    Bengali. With this motivation, major goals of this report have been made.

    1.7 Goals of Our Work

    The primary goal of the thesis is to develop a reasonably good accuracy part-of-speech

    tagger for Bengali. To address this broad objective, we identify the following goals:

    We wish to investigate different machine learning algorithm to develop a part-of-

    speech tagger for Bengali.

  • Bengali Part-Of-Speech Tagging

    19

    Bengali is a morphologically-rich language. We wish to use the morphological

    features of a word, as well as word suffix to enable us to develop a POS tagger with

    limited resource.

    As stemming is one of the pre-processing steps to develop an effective POS tagger, so

    we wish to stem a few Bengali text documents

    1.8 Organization of the Report

    Rest of this report is organized into chapters as follows:

    Chapter 2 provides a review of the previous work on POS tagging. Comparative review

    of the work is not shown in this chapter because such an attempt is extremely difficult due

    to the large number of publications in this area and the works based on several theories

    and techniques used by researchers over the years. Instead, a brief review i.e. the work

    based on different techniques used for POS tagging has been presented. This chapter also

    presents a discussion on English language POS taggers and Indian languages POS

    taggers.

    Chapter 3 supply some information about several important issues related to POS

    tagging, which can greatly influence the performance of the taggers i.e. corpora and the

    Bengali tagset.

    Chapter 4 provides information about the developed system and the way the system is

    developed. Also in this chapter the system architecture has been shown.

    Chapter 5 provides the experimental result and a discussion was made on the

    experimental result.

    Chapter 6 presents the general conclusion, summary of the work and contributions are

    outlined along with a discussion on scope for future research work.

  • Bengali Part-Of-Speech Tagging

    20

    CHAPTER 2

    Prior Work

  • Bengali Part-Of-Speech Tagging

    21

    2.1 Prior Work in POS Tagging

    The area of automated Part-of-speech tagging has been enriched over the last few decades by

    contribution from several researchers. Since its inception in the middle sixties and seventies

    (Harris, 1962; Klein and Simmons, 1963; Greene and Rubin, 1971), many new concepts have

    been introduced to improve the efficiency of the tagger and to construct the POS taggers for

    several languages. Initially, people manually engineered rules for tagging. Linguistic taggers

    incorporate the knowledge as a set of rules or constraints written by linguists. More recently

    several statistical or probabilistic models have been used for the POS tagging task for

    providing transportable adaptive taggers. Several sophisticated machine learning algorithms

    have been developed that acquire more robust information. In general all the statistical

    models rely on manually POS labelled corpora to learn the underling language model, which

    is difficult to acquire for a new language. Finally, combinations of several sources of

    information (linguistic, statistical and automatically learned) have been used in current

    research direction.

    This chapter provides a brief review of the prior work in POS tagging. For the sake of

    consciousness, we do not aim to give a comprehensive review of the related work. Instead,

    we provide a brief review on the different techniques used in POS tagging. Further, we focus

    onto the detail review of the Indian language POS taggers.

    2.2 Linguistic Taggers

    Automated part of speech tagging was initially explored in middle sixties and seventies

    People manually engineered rules for tagging. The most representative of such pioneer tagger

    was TAGGIT (Greene and Rubin, 1971), which was used for initial tagging of the Brown

    Corpus. Since that time to nowadays, a lot of effort has been devoted to improving the quality

    of the tagging process in terms of accuracy and efficiency.

    Recent linguistic taggers incorporate the knowledge as a set of rules or constraints,

    written by linguists. The current models are expressive and accurate and they are used in very

    efficient disambiguation algorithms. The linguistic rules range from a few hundred to several

    thousands, and they usually require years of labour. The development of ENGTWOL (an

    English tagger based on constraint grammar architecture) can be considered most important

    in this direction .The constraint grammar formalism has also been applied for other languages

    like Turkish.

  • Bengali Part-Of-Speech Tagging

    22

    The accuracy reported by the first rule-based linguistic English tagger was slightly

    below 80%. A Constraint Grammar for English tagging (Samuelsson and Voutilainen, 1997)

    is presented which achieves a recall of 99.5% with a very high precision around 97%. Their

    advantages are that the models are written from a linguistic point of view and explicitly

    describe linguistic phenomena, and the models may contain many and complex kinds of

    information. Both things allow the construction of extremely accurate system. However, the

    linguistic models are developed by introspection (sometimes with the aid of reference

    corpora). This makes it particularly costly to obtain a good language model. Transporting the

    model to other languages would require starting over again.

    2.3 POS Tagging Approaches

    POS taggers are broadly classified into three categories called rule based, Empirical based

    and Hybrid based .In case of rule based approach hand-written rules are used to distinguish

    the tag ambiguity. The empirical POS taggers are further classified into Example based and

    Stochastic based taggers. Stochastic taggers are either HMM based, choosing the tag

    sequence which maximizes the product of word likelihood and tag sequence probability, or

    cue-based, using decision trees or maximum entropy models to combine probabilistic

    features. The stochastic taggers are further classified in to supervised and unsupervised

    taggers. Each of these supervised and unsupervised taggers are categorized into different

    groups based on the particular algorithm used. The Fig.2.3 shows the classification of parts of

    speech approaches.

    2.3.1 Rule Based POS tagging

    The rule based POS tagging models apply a set of hand written rules and use

    contextual information to assign POS tags to words. These rules are often known as context

    frame rules. For example, a context frame rule might say something like: “If an

    ambiguous/unknown word X is preceded by a Determiner and followed by a Noun, tag it as

    an Adjective”. One of the first and widely used English POS-taggers employs rule based

    algorithms is “Brill‟s tagger”. The earliest algorithms for automatically assigning part-of-

    speech were based on two-stage architecture. The first stage used a dictionary to assign each

    word a list of potential parts of speech. The second stage used large lists of hand-written

    disambiguation rules to bring down this list to a single part-of-speech for each word. The

  • Bengali Part-Of-Speech Tagging

    23

    ENGTWOL tagger is based on the same two-stage architecture, although both the lexicon

    and the disambiguation rules are much more sophisticated than the early algorithms.

    Fig.2.3 : Classification of POS tagging

    2.3.2 Empirical Based POS tagging

    The relative failure of rule-based approaches, the increasing availability of machine

    readable text and the increase in capability of hardware (CPU, memory, disk space) with

    decrease in cost are some of the reasons, researchers to prefer corpus based pos tagging. The

    empirical approach of parts speech tagging is further divided in to two categories: Example-

    based approach and Stochastic based approach. Literature shows that majority of the

    developed POS taggers belongs to empirical based approach.

  • Bengali Part-Of-Speech Tagging

    24

    2.3.2(a) Example Based POS tagging

    Example based approach are depend on trained or tagged corpus which have to

    be trained with the machine with learning technique. In example based

    morphoynthetic tagging this problem must be formulated as a classification task. The

    features usually include POS of neighbouring tokens, their auto graphics forms ,

    sometimes also fixed width affixes of the word forms.

    2.3.2(b) Stochastic based POS tagging

    The stochastic approach finds out the most frequently used tag for a specific word in

    the annotated training data and uses this information to tag that word in the unannotated text.

    A stochastic approach required a sufficient large sized corpus and calculates frequency,

    probability or statistics of each and every word in the corpus. The problem with this approach

    is that it can come up with sequences of tags for sentences that are not acceptable according

    to the grammar rules of a language. The use of probabilities in tags is quite old; probabilities

    in tagging were first used in 1965, a complete probabilistic tagger with Viterbi decoding was

    sketched by Bahl and Mercer (1976), and various stochastic taggers were built in the 1980's

    (Marshall, 1983; Garside, 1987; Church, 1988; DeRose, 1988). Supervised and unsupervised

    are two broad categories of stochastic based approach.

    Supervised POS tagging: The supervised POS tagging models require pre-tagged

    corpora which are used for training to learn information about the tagset, word-tag

    frequencies, rule sets etc. The performance of the models generally increases with the

    increase in size of this corpus. The following are the two familiar examples for supervised

    POS taggers Hidden Markov Model and Support Vector Machines .

    Hidden Markov Model (HMM) based POS tagging: An alternative to the

    word frequency approach is known as the n-gram approach that calculates the

    probability of a given sequence of tags. It determines the best tag for a word

    by calculating the probability that it occurs with the n previous tags, where the

    value of n is set to 1, 2 or 3 for practical purposes. These are known as the

    Unigram, Bigram and Trigram models. The most common algorithm for

    implementing an n-gram approach for tagging new text is known as the

    HMM‟s Viterbi Algorithm. The Viterbi algorithm is a search algorithm that

    avoids the polynomial expansion of a breadth first search by trimming the

  • Bengali Part-Of-Speech Tagging

    25

    search tree at each level using the best „m‟ Maximum Likelihood Estimates

    (MLE) where „m‟ represents the number of tags of the following word. For a

    given sentence or word sequence, HMM taggers choose the tag sequence that

    maximizes as in formula 1

    P(word | tag ) X P(tag | previous n tags) (1)

    A bigram-HMM tagger of this kind chooses the tag ti for word wi that is most

    probable given the previous tag ti-1 and the current word wi :

    ti = arg max P( ti | ti-1 , wi) (2)

    j Support Vector Machines (SVM ): SVM is a machine learning algorithm for

    binary classification, which has been successfully applied to a number of

    practical problems, including NLP. Let {(x1, y1). . . (xN, yN)} be the set of N

    training examples, where each instance xi is a vector in RN and yi ∈ {−1,+1}

    is the class label. In their basic form, a SVM learns a linear hyperplane, that

    separates the set of positive examples from the set of negative examples with

    maximal margin (the margin is defined as the distance of the hyperplane to the

    nearest of the positive and negative examples). This learning bias has proved

    to have good in terms of generalization bounds for the induced classifiers.

    The SVM Tool is intended to comply with all the requirements of modern

    NLP technology, by combining simplicity, flexibility, robustness, portability

    and efficiency with state–of–the–art accuracy. This is achieved by working in

    the Support Vector Machines (SVM) learning framework, and by offering

    NLP researchers a highly customizable sequential tagger generator.

    Unsupervised POS Tagging: Unlike the supervised models, the unsupervised POS

    tagging models do not require a pre-tagged corpus. Instead, they use advanced computational

    methods like the Baum-Welch algorithm to automatically induce tagsets, transformation rules

    etc. Based on the information, they either calculate the probabilistic information needed by

    the stochastic taggers or induce the contextual rules needed by rule-based systems or

    transformation based systems.

    Transformation-based POS tagging :In general, the supervised tagging

    approach usually requires large sized pre-annotated corpora for training, which

    is difficult for most of the cases. But recently, good amount of work has been

    done to automatically induce the transformation rules. One approach to

  • Bengali Part-Of-Speech Tagging

    26

    automatic rule induction is to run an untagged text through a tagging model

    and get the initial output. A human then goes through the output of this first

    phase and corrects any erroneously tagged words by hand. This tagged text is

    then submitted to the tagger, which learns correction rules by comparing the

    two sets of data. Several iterations of this process are sometimes necessary

    before the tagging model can achieve considerable performance. The

    transformation based approach is similar to the rule based approach in the

    sense that it depends on a set of rules for tagging.

    2.3.3 Hybrid Based Tagger

    A hybrid approach combines the features of both Rule based & Stochastic Based

    approaches. Like rule based systems, they use rules to specify tags. Like stochastic systems

    they use machine-learning to induce rules from a tagged training corpus automatically. The

    transformation-based learning (TBL) tagger or Brill tagger shares features of the hybrid

    approach. This approach follows the advantages and disadvantages of both rule based and

    stochastic based approach.

    2.4 Indian Language POS Taggers

    There has been a lot of interest in Indian language POS tagging in recent years. POS tagging

    is one of the basic steps in many language processing tasks, so it is important to build good

    POS taggers for these languages. However it was found that very little work has been done

    on Bengali POS tagging and there are very limited amount of resources that are available.

    The oldest work on Indian language POS tagging we found is by Bharati et al. (Bhartai et al.,

    1995). They presented a framework for Indian languages where POS tagging is implicit and

    is merged with the parsing problem in their work on computational Paninian parser.

    For Bengali, ( Dandapat et al. 2007) studied the possibility of developing a tagger

    using HMM and Maximum Entropy (ME) models. They too used a morphological analyzer

    for compensating the shortage of annotated corpus. With these two modes they implemented

    a supervised tagger and a semi-supervised tagger and reported an accuracy of around 88% for

    the two approaches. ( Ekbal et al 2007) annotated news corpus and developed an SVM based

    tagger. They reported an accuracy of 86.84% for their tagger

  • Bengali Part-Of-Speech Tagging

    27

    An attempt on Hindi POS disambiguation was done by Ray (Ray et al. 2003). The

    part-of-speech tagging problem was solved as an essential requirement for local word

    grouping. Lexical sequence constraints were used to assign the correct POS labels for Hindi.

    A morphological analyzer was used to find out the possible POS of every word in a sentence.

    A rule based POS tagger for Tamil (Arulmozhi et al., 2004) has been developed in

    combination of both lexical rules and context sensitive rules. They used a very coarse grained

    tagset of only 12 tags. They reported an accuracy of 83.6% using only lexical rules and

    88.6% after applying the context sensitive rules. The accuracy reported in the work, are tested

    on a very small reference set of 1000 words.

    Shrivastav et al. (Shrivastav et al. 2006) presented a CRF based statistical tagger for

    Hindi. They used 24 different features (lexical features and spelling features) to generate the

    model parameters. They experimented on a corpus of around 12,000 tokens and annotated

    with a tagset of size 23. The reported accuracy was 88.95% with a 4-fold cross validation.

    Smriti et al. (Smriti et al. 2006) in their work, describes a technique for morphology-

    based POS tagging in a limited resource scenario. The system uses a decision tree based

    learning algorithm (CN2). They used stemmer, morphological analyzer and a verb group

    analyzer to assign the morphotactic tags to all the words, which identify the Ambiguity

    Scheme and Unknown Words. Further, a manually annotated corpus was used to generate If-

    Then rules to assign the correct POS tags for each ambiguity scheme and unknown words. A

    tagset of 23 tags were used for the experiment. An accuracy of 93.5% was reported with a 4-

    fold cross validation on modestly-sized corpora (around 16,000 words).

    In 2006, two machine learning contests were organized on part-of-speech tagging and

    chunking for Indian Languages for providing a platform for researchers to work on a

    common problem. Both the contests were conducted for three different Indian languages:

    Hindi, Bengali and Telugu. All the languages used a common tagset of 27 tags. The results of

    the contests give an overall picture of the Indian language POS tagging. The first contest was

    conducted by NLP Association of India (NLPAI) and IIIT-Hyderabad in the summer of 2006.

  • Bengali Part-Of-Speech Tagging

    28

    CHAPTER 3

    Foundational Considerations

  • Bengali Part-Of-Speech Tagging

    29

    In this chapter we discuss several important issues related to the POS tagging problem, which

    can greatly influence the performance of a tagger. Another important issue of POS tagging is

    collecting and annotating corpora. Most of the statistical techniques rely on some amount of

    annotated data to learn the underlying language model. The sizes of the corpus and amount of

    corpus ambiguity have a direct influence on the performance of a tagger. Finally, there are

    several other issues e.g. how to handle unknown words, smoothing techniques which

    contribute to the performance of a tagger.

    In the following sections, we discus three important issues related to POS tagging.

    The first section discuss the process of corpora collection. In second section we present the

    tagset which is used for our experiment.

    3.1. Corpora Collection

    The compilation of raw text corpora is no longer a big problem, since nowadays most of the

    documents are written in a machine readable format and are available on the web. Collecting

    raw corpora is a little more difficult problem in Bengali (might be true for other Indian

    languages also) compared to English and other European languages. This is due to the fact

    that many different encoding standards are being used. Also, the number of Bengali

    documents are available in the web is comparatively quite limited.

    Raw corpora do not have much linguistic information. Corpora acquire higher

    linguistic value when they are annotated, that is, some amount of linguistic information (part-

    of-speech tags, semantic labels, syntactic analysis, named entity etc.) is embedded into it.

    Although, many corpora (both raw and annotated) are available for English and other

    European languages but, we had no tagged data for Bengali to start the POS tagging task. The

    raw corpus developed at TDIL was available to us. We used a portion of the TDIL corpus to

    develop the annotated data for the experiments.

    3.2. The Tagset

    With respect to the tagset, the main feature that concerns us is its granularity, which is

    directly related to the size of the tagset. If the tagset is too coarse, the tagging accuracy will

    be much higher, since only the important distinctions are considered, and the classification

    may be easier both by human manual annotators as well as the machine. But, some important

    information may be missed out due to the coarse grained tagset. On the other hand, a too fine-

    grained tagset may enrich the supplied information but the performance of the automatic POS

  • Bengali Part-Of-Speech Tagging

    30

    tagger may decrease. A much richer model is required to be designed to capture the encoded

    information when using a fine grained tagset and hence, it is more difficult to

    So, when we are about to design a tagset for the POS disambiguation task, some

    issues needs to be considered. Such issues include – the type of applications (some

    application may require more complex information whereas only category information may

    sufficient for some tasks), tagging techniques to be used (rule based which can adopt large

    tagsets very well, supervised/unsupervised learning). Further, a large amount of annotated

    corpus is usually required for rule based POS taggers. A too fine grained tagset might be

    difficult to use by human annotators during the development of a large annotated corpus.

    Hence, the availability of resources needs to be considered during the design of a tagset.

    learn.

    The Bureau of Indian Standards (BIS) Tagset has recommended the use of a common

    tagset for the part of speech annotation of Indian languages. The tagset, incorporating the

    advice of the experts and the stakeholders in the area of natural language processing and

    language technology of Indian languages, has to be followed in the annotation tasks taking

    place in Indian languages after August, 2010.

    The BIS tagset has a total of 38 annotation level tags which are common to all the

    Indian languages covered under this tagset. We are using the basic eight (8) part-of-speech

    tagset i.e. Noun, Pronoun, Verb, Adjective , Adverb, Preposition, Conjunction, Interjection,

    along with Residuals and Quantifier from the BIS tagset.

    The below table describes the individual tags with examples used in our experiments:

  • Bengali Part-Of-Speech Tagging

    31

    Category Annotation

    TAG

    Examples

    Noun N িীপঙ্কর , রাম, লযাম , দিল্লী etc

    Pronoun PR ক্ষস, দতদি,তা, দযদি, আদম, তুদম , আমরা, তারা etc

    Verb V কদর, করাম, খাওো, ে, ক্ষদখ etc

    Adjective JJ খারাপ, ভাবা, েড়, ক্ষছটা etc

    Adverb RB অদিকতর, অিবূর, এতটা, etc

    Preposition /

    Postposition

    PSP ক্ষেবক, হইবত, উপবর, দভতর etc

    Conjunction CC এেং, দকন্তু , অেচ, অেো

    Interjection INJ প্লীজ,িন্নোি,সােিাি, হাাঁ, etc

    Residuals RD । , , , ?, “” , ‘ ‘ ,

    Quantifiers QT প্রেম , ,১,২.etc

    .

    Table 3.2 : The tagset for Bengali with 10-tags

  • Bengali Part-Of-Speech Tagging

    32

    CHAPTER 4

    Tagging with Rule Based

    Approach

  • Bengali Part-Of-Speech Tagging

    33

    In the first section we describe Rule Based Approach for POS tagging. Since only a small

    labeled training set is available to us for Bengali POS tagging. Second section devoted to our

    particular approach to Bengali POS tagging using Rule Based Approach.

    4.1. Rule Based Approach

    The rule based POS tagging models apply a set of hand written rules and use contextual

    information to assign POS tags to each word in a sentence. These rules are often known as

    context frame rules. Most of the rule based taggers have two- stage architecture. The first

    stage is simply a dictionary look-up procedure, which returns a set of potential tags and

    appropriate syntactic features for each word. The second stage uses a set of hand written rules

    to discard contextually illegitimate tags to get a single best POS for each word. A context

    frame rule might say something like: “If current word is post position then there is high

    probability that previous word will be noun.” e.g. in the sentence “ক্ষস লদিির উপর পাের ছুবর

    মার।” the noun-adjective {N, JJ} ambiguity is present in the word “লদিির”. So the

    mentioned rule simply resolve this ambiguity problem.

    In addition to contextual information, many taggers use morphological information to

    help in the disambiguation process. An example of a rule that makes use of morphological

    information is: IF word ends with –“ইরেছি / ছিলাম ” and preceding word is a verb THEN

    label it a verb (V).

    Speed is an advantage of the rule based tagger, and unlike stochastic taggers, they are

    deterministic. Maximum effort is required in writing the disambiguation rules. Also rule

    based tagger is usable for only one language i.e. it is language dependent. Using it for another

    one requires a rewrite of most of the program.

    4.2. Our Approach

    4.2.1 System Flow Diagram

    This section is concern with all the processing tasks are designed. Here we concerned about

    the following:

    What are the modules need to be designed?

    How they are interconnected?

  • Bengali Part-Of-Speech Tagging

    34

    No Yes

    Start

    Show the GUI

    Accept Bengali

    Language

    Divide the sentence into tokens

    Tokens with

    suffix / affix ?

    Split tokens into its stem by

    Stemming

    Assign the TAGS to tokens in Tagger

    Find ambiguous Word

    Assign the TAGS to ambiguous word using

    POS tagging rules

    View the result

    Stop

    Fig 4.2.1: Flow diagram

  • Bengali Part-Of-Speech Tagging

    35

    The fig 4.2.1 shows the diagrammatic representation of flow of data throughout the

    system. It consist of the following components/modules:

    GUI(Graphical User Interface), the interface by which user will communicate with

    the back-end files. The interface should be simple in view and easy to maintain

    .

    Tokenizer : This module generates the tokens of the given input sentence. It also

    calls the other modules when required. The tokens of the sentence are basically stored

    in a String array for further processing.

    Stemming : The Stemming module split a word into its stem, i.e. root. It is one of the

    important applications and common requirement of any Natural Language Processing

    task. Word stemming is useful for indexing and search systems also indexing and

    searching are the key concepts of Text Mining applications and IR systems. It also has

    been used to improve the performance of spelling checkers where morphological

    analysis would be computationally expensive. A stemmer can also reduce the size of a

    dictionary which is the main feature to use a stemmer in spelling checker applications

    in mobile and other handheld device.

    Tagging : The tagging module assigns tags to tokens and also search for ambiguous

    words and according to their type assign some special symbols to them. If we

    encounter words which are not present in the Lexicon they are treated as unknown.

    The ambiguous words are those words which act as a noun and adjective or adjective

    and adverb according to different context.

    Resolving Ambiguity : The ambiguity which is identified in the tagging module is resolved

    using the Bengali grammar rules.

    Displaying results : This module will be displaying the final result. The tokens i.e.

    words in the sentences are shown with their corresponding parts of speech

  • Bengali Part-Of-Speech Tagging

    36

    CHAPTER 5

    Experimental Result &

    Discussion

  • Bengali Part-Of-Speech Tagging

    37

    5.1 Tools Used

    Software: Few open source software tools were used in the development of the project work

    which are mentioned below:

    - jdk 1.7.0_05

    NetBenas IDE 7.1.1, NetBeans IDE lets you quickly and easily develop

    java desktop ,mobile and web

    application. It can be directly

    downloaded at

    https://netbeans.org/downloads/

    Fig 5.1.1 NetBenas IDE

    Notepad, Notepad is a simple text editor for Microsoft Windows and a basic

    text editing program that you can use to create documents. It has been include

    in all versions of Microsoft Windows since Windows 1.0 in 1985. So, no need

    to download it. It is a common text only (plain text) editor. The resulting file

    typically saved wit the .txt extension. It looks simple application but it has a

    great impact in software

    development. It can

    write the programming

    languages like

    C.C++,Java, HTML and

    many more but saved

    with different

    extensions.

    Fig 5.1.2 Notepad

    https://netbeans.org/downloads/

  • Bengali Part-Of-Speech Tagging

    38

    Hardware: We design and developed the whole system on a ACCR Notebook with the

    following specification:

    Processor: Intel(R) Pentium(R) CPU 2030M @ 2.50GHz

    RAM : 4.00 GB

    HDD : 500 GB

    Although the current system is ok for development but terrible for huge dada handling

    i.e. higher the size of data slower the speed of system reply and this is just because of

    Processor, if anyone use i3 or more then the speed will be better.

    5.2 Graphical User Interface

    Snapshot1: This is the welcome screen of our project. Click the Proceed button to go

    to the Tagging section.

    Fig 5.2.1: Welcome Screen

  • Bengali Part-Of-Speech Tagging

    39

    Snapshot 2: Here we first enter the Bengali sentence for tagging purpose in the

    specified blank text filled then press the TAG button for tagging. The RESET button

    will remove all the texts from the text field.

    Fig 5.2.2 : The Tagging Menu

  • Bengali Part-Of-Speech Tagging

    40

    5.3 Experimental Results

    The system has been tested with a set of data. The input text is taken from the corpus

    which was discussed in the chapter 3. Here only four results are shown in the following

    snapshot.

    Result I:

  • Bengali Part-Of-Speech Tagging

    41

    Result II:

  • Bengali Part-Of-Speech Tagging

    42

    Result III:

  • Bengali Part-Of-Speech Tagging

    43

    Result IV:

  • Bengali Part-Of-Speech Tagging

    44

    5.4 Result Discussion

    Accuracy of the tagger is computed as the ratio of the number of words correctly tagged by

    the system to the total number of tested words.

    x 100%

    The following are the observations that have been made during testing the system.

    Test No of tested words Accuracy

    Test 1 150 67 %

    Test 2 400 71 %

    Test 3 800 78%

    Test 4 1200 82 %

    The overall accuracy of the system was computed by taking the mean of four tested

    results. The overall accuracy of the system was achieved 74.50%.

    .

  • Bengali Part-Of-Speech Tagging

    45

    CHAPTER 6

    Conclusion & Future

    Works

  • Bengali Part-Of-Speech Tagging

    46

    6.1 Conclusion

    Part-of-speech tagging is playing an important role in various speech and language

    processing applications in NLP. Since many of the reputed companies like Google and

    Microsoft are concentrating on Natural language processing applications, it has got more

    importance. Currently, many tools are available to do the task of part of speech tagging. In

    this report, our effort was computational linguistics analysis for Bengali language by

    developing a tagging system and we achieved accuracy over 74.50%. It had shown that the

    performance of the tagger depends upon the size of the lexicon and corpus. The performance

    can be increased by increasing the size of the lexicon.

    6.2 Future Work

    Future work is still to be done in several directions. Though we attained accuracy over

    74.50% for known words, it is still an open area to enhance the performance of the tagger.

    This can be achieved by increasing the tagset and enlarge the size of the lexicon so that the

    tagger can do less ambiguous classification of the text. One can also compare our results with

    the result achieved by other Indian language tagging system.

  • Bengali Part-Of-Speech Tagging

    47

    References

    Church K. W. 1988. A stochastic parts program and noun phrase parser for unrestricted text.

    Proceedings of the second conference on Applied Natural Language Processing.

    Austin, Texas, 136-143.

    Ramshaw L. A. and Marcus M. P. 1995. Text chunking using transformation-based learning.

    In Proc. Third Workshop on Very Large Corpora. ACL, 1995

    Wilks Y., and Stevenson M. 1997. Combining Independent Knowledge Sources for Word

    Sense Disambiguation. In Proceedings of the Third Conference on Recent Advances

    in Natural Language Processing Conference (RANLP-97), Bulgeria. 1-7.

    Heeman, P. A. and J. F. Allen. 1997. Incorporating POS tagging into language modelling. In

    Proceedings of the 5th European Conference on Speech Communication and

    Technology (Eurospeech), Rhodes, Greece.

    Ray P. R., Harish V., Basu A. and Sarkar S., 2003. Part of Speech Tagging and Local Word

    Grouping Techniques for Natural Language Processing. In Proceedings 1st

    International Conference on Natural Language Processing

    Shrivastav M., Melz R., Singh S., Gupta K. and Bhattacharyya P., 2006. Conditional

    Random Field Based POS Tagger for Hindi. In Proceedings of the MSPIL, Bombay,.

    63-68.

    Dandapt, S., Sarkar, S., Basu, A.(2007) “Automatic Part-of-Speech Tagging for Bengali :An

    Approach for Morphological Rich Languages in a Poor Resource Scenario”. In:

    Association for Computational Linguistic,pp 221-224.

    Bharati, A., Chaitanya V., Sangal R., (1995) “Natural Language Processing- A PAninian

    Perspective”. Prentice-Hall India, New Delhi(1995)

    Arulmozhi P., Rao R. K. and Sobha L., 2006. A Hybrid POS Tagger for a Relatively Free

    Word Order Language. In Proceedings of the Modeling and Shallow Parsing of

    Indian Language (MSPIL), Bombay. 79-85.

  • Bengali Part-Of-Speech Tagging

    48

    Singh S., Gupta K., Shrivastav M. and Bhattacharyya V. 2006. Morphological Richness

    Offset Resource Demand – Experience in constructing a POS Tagger for Hindi. In

    Proceedings of COLLING/ACL 06. 779-786.

    Dalal, K. Nagaraj, U. Swant, S. Shelke and P. Bhattacharyya. 2007. Building Feature Rich

    POS Tagger for Morphologically Rich Languages: Experience in Hindi. In

    Proceedings of ICON, India.

    Greene B. B. and Rubin G. M., 1971. Automatic grammatical tagging of English. Technical

    Report, Department of Linguistics, Brown University.

    Samuelsson C., Voutilainen A. 1997. Comparing a linguistic and a stochastic tagger. In

    Proceedings of the eighth conference on European chapter of the Association for

    Computational Linguistics (EACL), Madrid, Spain. 246-253

    Ekbal, A., Bandyopadhyay, S., (2007) ”Lexicon Development and POS tagging using A

    Tagged for Marathi Text” 2014 in proceeding of: International Journal of Computer

    Science and Information Technologies, Vol.5 (2),2014,1322-1326.

  • Bengali Part-Of-Speech Tagging

    49

    APPENDIX

    CD