Chapter 7 Evaluation and Results -...

31
221 Chapter 7 Evaluation and Results 7.1 Introduction The evaluation of a Machine Translation System and measuring translation performance is a difficult and complex task. There are many factors involved, most important being that a natural language is not exact in the way that mathematical models and theories in science are. Therefore, Commercial MT Systems cannot translate all texts reliably. [158] 7.2 Types of Evaluation [95] Three broad classes of MT evaluation strategy are enlisted below: Typological Evaluation seeks to specify which particular linguistic constructions the system handles satisfactorily and which it does not. The principal tool for such an investigation is a test suite – a set of sentences which individually represent specified constructions and hence constitute performance probes. Declarative Evaluation seeks to specify how a MT system performs relative to various dimensions of translation quality. Operational Evaluation seeks to establish how effective an MT system is likely to be (i.e. in terms of cost effects) as part of a given translation process.

Transcript of Chapter 7 Evaluation and Results -...

  • 221

    Chapter 7 Evaluation and Results

    7.1 Introduction

    The evaluation of a Machine Translation System and measuring translation

    performance is a difficult and complex task. There are many factors involved,

    most important being that a natural language is not exact in the way that

    mathematical models and theories in science are. Therefore, Commercial MT

    Systems cannot translate all texts reliably. [158]

    7.2 Types of Evaluation [95]

    Three broad classes of MT evaluation strategy are enlisted below:

    Typological Evaluation seeks to specify which particular linguistic constructions

    the system handles satisfactorily and which it does not. The principal tool for such

    an investigation is a test suite – a set of sentences which individually represent

    specified constructions and hence constitute performance probes.

    Declarative Evaluation seeks to specify how a MT system performs relative to

    various dimensions of translation quality.

    Operational Evaluation seeks to establish how effective an MT system is likely to

    be (i.e. in terms of cost effects) as part of a given translation process.

  • 222

    7.2.1 Typological Evaluation

    Typological evaluation is primarily of interest to system developers. Potential

    users may not be familiar with the linguistic descriptions used, nor is it likely to be

    apparent how frequently some missing or badly-handled construction might occur

    in their particular text type. The system is being tested by a suite of sentences.

    These sentences illustrate particular types of linguistic constructions that the

    system is likely to encounter in its lifetime. If the system is intended to operate

    within a particular subject field then it is obvious that its design will reflect this

    sublanguage. The test-suite approach has an advantage over corpus-based

    approach. The corpus contains a large amount of redundancy, i.e. most

    constructions will be encountered more than once, whereas in case of test suite

    each combination of concepts appear once.

    Once a corpus has been established for the task in hand, statistical information

    regarding the type and frequency of lexical and grammatical phenomena

    contained therein should be obtained in order to be able to evaluate the capability

    of the system to successfully translate sentences contained in the corpus. If good

    observed frequency data were not available then the system’s potential would

    either be over estimated or even under estimated. At present, however, this

    statistical information will almost certainly be gained by hand (a laborious

    process), as the necessary tool capable of parsing texts in this way is not

    available. Assuming that we have established the relative frequency of the

    phenomena contained in the corpus, the test suite can now be constructed.

  • 223

    7.3 Metrics used for Automatic Evaluation

    A human evaluation of MT system is rather time consuming and exhaustive

    which is not practical for developers. It takes human labour which can not be

    reused. Human evaluations of Machine Translation (MT) have many aspects of

    translation including adequacy, fidelity and fluency of the translation. [159]. A

    metric that evaluates Machine Translation output represents the quality of the

    output. The quality of a translation is inherently subjective and there is no

    objective or quantifiable "good." Therefore, any metric that exists must assign

    quality scores so that the quality of translation can be correlated with the human

    judgment of quality. That is, a metric should score high for the translations that

    humans score high, and give low scores to the ones for which humans give low

    scores. Human judgment is the benchmark for assessing automatic metrics, as

    humans are the end-users of any translation output.[161]

    Many automated measures have been proposed to facilitate fast and cheap

    evaluation of MT systems. Most efforts focus on devising metrics based on

    measuring the closeness of the output of MT systems to one or more human

    translation; the closer it is, the better it is.

    Some Methods of Automatic Evaluation of MT are discussed below:

    BLEU : (BiLingual Evaluation Understudy) The rationale behind the development

    of Bleu is that human evaluation of Machine Translation can be time consuming

    and expensive. An automatic evaluation metric, on the other hand, can be used

  • 224

    for frequent tasks like monitoring incremental system changes during

    development, which are seemingly infeasible in a manual evaluation. The quality

    of translation is indicated as a number between 0 and 1. It is measured as

    statistical closeness to a given set of good quality human reference translations.

    It does not directly take into account translation intelligibility or grammatical

    correctness. The primary programming task in BLEU implementation is to

    compare n-grams of the candidate with the n-grams of the reference translation

    and count the number of matches. These matches are position independent, the

    more the matches, the better the candidate translation. The modified n-gram

    precision computation for any n is all candidate n-gram counts and their

    corresponding max. reference counts.

    The candidate counts are clipped by their and corresponding reference max.

    value. These values are summed and divided by the total number of candidate n-

    grams. The modified n-gram precision on a multi-sentence test set is computed

    by the formula:

    Pn= Σ Σ Count clip (n-gram)

    C {Candidates} n-gram C

    _______________________________________

    Σ Σ Count clip (n-gram’)

    C’ {Candidates} n-gram’ C’

  • 225

    This means that a word-weighted average of the sentence-level modified

    precision is used rather than a sentence-weighted average.

    NIST: NIST is another method for evaluating the quality of the text translated

    using Machine Translation. It is based on BLEU metric with some alterations. It

    calculates how informative a particular n-gram is. When calculating brevity

    penalty, small variations in translation length do not impact overall score very

    much.

    METEOR: The current version of the METEOR automatic evaluation metric, scores

    Machine Translation hypotheses by aligning them to one or more reference

    translations. Alignments are based on exact stem, synonym and paraphrase

    matches between words and phrases. A lexical similarity score is then calculated

    based on the alignment for each hypothesis-reference pair. The metric includes

    several free parameters that are tuned to emulate various human judgment tasks

    including adequacy, ranking, and HTER.

    Word Error Rate: WER works at the word level. It was originally used for

    measuring the performance of speech recognition systems, but is also used in

    the evaluation of Machine Translation. The metric is based on the calculation of

    the number of words that differ between a piece of machine translated text and a

    reference translation. This measure is based on the Levenshtein distance — the

  • 226

    minimum number of substitutions, deletions and insertions that have to be

    performed to convert the automatic translation into a valid translation.

    PER. Position-Independent Word Error Rate. A shortcoming of the WER

    measure is that it does not allow reordering of words. In order to overcome this

    problem, the position independent word error rate (PER) compares the words in

    the two sentences without taking the word order into account.[162]

    TER. Translation Edit Rate[163]. TER measures the amount of post-editing that a

    human would have to perform to change a system output so it exactly matches a

    reference translation. Possible edits include insertions, deletions, and

    substitutions of single words as well as shifts of word sequences. All edits have

    equal cost.

    7.4 Related Works:

    ALPAC report in 1966 was a study comparing different levels of human

    translation with Machine Translation output, using human subjects as judges. It

    consists of two variables Fidelity and Intelligibility. Fidelity (or Accuracy) is a

    measure of how much information the translated sentence retained compared to

    the original. Intelligibility is a measure of understandability of results of automatic

    translation.[164] Advanced Research Projects Agency (ARPA) created a

    methodology to evaluate Machine Translation Systems, and continues to perform

    evaluations based on this methodology. The evaluation programme was

  • 227

    instigated in 1991, and continues to this day. The evaluation programme involved

    testing several systems based on different theoretical approaches: statistical,

    rule-based and human-assisted. A number of methods for the evaluation of the

    output from these systems were tested in 1992. These methods include:

    comprehension evaluation, quality panel evaluation, and evaluation based on

    adequacy and fluency. [161] The first approach to metric combination based on

    human likeness was given by Corston-Oliver who used decision trees to

    distinguish between human-generated (‘good’) and machine generated (‘bad’)

    translations.[165] They suggested using classifier confidence scores directly as a

    quality indicator. High levels of classification accuracy were obtained. However,

    they focused on evaluating only the well-formedness of automatic translations

    (i.e., subaspects of fluency). Preliminary results using Support Vector Machines

    were also discussed. Kulesza and Shieber extended the approach by Corston-

    Oliver to take into account other aspects of quality further than fluency alone.

    Instead of decision trees, they trained Support Vector Machines (SVM). They

    used features inspired by well-known metrics such as BLEU, NIST, WER, and

    PER. Metric quality was evaluated both in terms of classification accuracy and in

    terms of correlation with human assessment at the sentence level. A significant

    improvement with respect to standard individual metrics was reported.[167]

    In a different research line, Akiba et al. suggested directly predicting human

    scores of acceptability, approached as a multiclass classification task. They used

    decision tree classifiers trained on multiple edit-distance features based on

  • 228

    combinations of lexical, morphosyntactic and lexical semantic information (e.g.,

    word, stem, part-of-speech, and semantic classes from a Thesaurus). Promising

    results were obtained in terms of local accuracy over an internal predefined set of

    overall quality assessment categories. Quirk presented a similar approach, also

    with the aim to approximate human quality judgements, with the particularity that

    human references were not required. Recently, Paul extended these works so as

    to account for separate aspects of quality: adequacy, fluency and acceptability.

    They used SVM classifiers to combine the outcomes of different automatic

    metrics at the lexical level (BLEU, NIST, METEOR, GTM, WER, PER and TER).

    Also very recently, Albrecht and Hwa re-examined the SVM-classification

    approach by Kulesza and Shieber and Corston-Oliver and, inspired by the work

    of Quirk suggested a regression-based learning approach to metric combination,

    with and without human references.[170-171] Their results outperformed those by

    Kulesza and Shieber in terms of correlation with human assessments. In a

    different approach, Ye suggested approaching sentence level MT evaluation as a

    ranking problem.They used the Ranking SVM algorithm to sort candidate

    translation. Assessments were based on a 1-4 scale similar to overall quality

    categories used by Akiba. [161-165]

    7.5 Approach followed for Evaluation of Machine Translation System

    Different metrics are used to evaluate different stages of the Machine Translation

    System. The stage-I, tagging the words with its part of speech is evaluated.

  • 229

    Stage 2 corresponds to the evaluation of Phrase Chunker followed by stage 3 i.e

    the evaluation of final translator. These stages are discussed in the sub-sections

    followed below:

    7.5.1 Evaluation of Part-of-Speech Tagging

    The most commonly used evaluation measure used for part-of-speech tagging is

    Accuracy. This is sometimes expressed in percentage or as a value between 0

    and 1. This accuracy measure can be defined as given below:

    For evaluation, part-of-speech tagger was applied on tagging a set of 1000

    sentences collected from the crime news from various newspapers and from

    legal documents. The sentences contain about 8665 words. The outcome was

    manually evaluated to mark the correct and incorrect tag assignments. There

    are total 527 tag sets and a set of 1000 sentences with 8922 words was used as

    tagged corpus for training the part of speech tagger. Table 7.1 provides the

    tagging results.

    Table 7.1 Part-of-Speech Tagging Results

    Tagged Words Unknown Words

    Incorrect Tags

    Correct Tags

    Total Words

    501 407 7757 8665

    Accuracy = Total Number of Words Tagged

    Total Number of Words having Correct Tags

  • 230

    Based on the data given in the above table for part-of-speech tagging results and

    the accuracy definition, the following different accuracy measures (in percentage)

    were calculated:

    Accuracy 1 represents the typical accuracy results for our tagger, i.e. total

    number of words having a unique correct tag divided by total number of words

    tagged.

    In the sentences chosen for testing the system, there are some words which are

    not recognized since these are not present in the morph database. These words

    include some proper nouns which are not recognized by the proper noun

    gazetteer as it may not be followed or preceded by some special words stored in

    our database made for recognizing proper nouns. Accuracy 2 is calculated by

    dividing the correct tags by the difference of total words and the unknown words.

    Accuracy results are shown below.

    Table 7.2 Parts-of-Speech Tagging Accuracy

    Accuracy 1 Accuracy 2

    89.52 95.01

    Accuracy 2 = Total No. of Words – Unknown Words Total No. of Words having Correct Tag

    Accuracy 1 = Total No. of Words

    Total No. of Words having Correct Tag

  • 231

    Comparison with Existing Systems

    The accuracy measure was compared with the accuracy of the rule based tagger

    developed at Punjabi University Patiala. The rule based tagger has the accuracy

    of 80.29% including unknown words and 88.86% by excluding unknown words.

    By using statistical approach, the tagging accuracy including unknown words is

    increased to 89.52% and by excluding unknown words, it is raised to 95.01%.

    One of the popular POS taggers is TnT tagger which has shown to have high

    accuracy in English and some other languages. It provides overall tagging

    accuracy of 96.64%, specifically, 97.01% on known words. Dandapat et al.

    reported 95% accuracy for Hindi using HMM.[136] Shacham reported 87.27%

    accuracy for Hebrew.[174] Bharati and Mannem reported 67-77% accuracy for

    Hindi, Telugu, and Bengali using 24 tags and applying various statistical

    techniques.[142] A hybrid POS tagger for Tamil using HMM technique and a rule

    based system and had a precision of 97.2%. The accuracy tagger for Machine

    Translation System in present study is comparable to highly accurate taggers.

    7.5.2 Evaluation of Phrase Chunking

    Chunking phrase includes the noun phrases, verb phrases, adjective phrases

    and the postpositional phrases. The accuracy of the chunking phrase can be

    calculated using the formulas defined below.

    Precision = Number of Proposed Chunks

    Number of Correct Proposed

  • 232

    Precision can be seen as a measure of exactness or conformity, and Recall is a

    measure of completeness. In other words, precision tells how accurate the

    system is and recall specifies how complete the system is. On a measurement

    scale of 0 to 1, a value close to 1 is desirable for both of these measures. Fβ

    measure or just F-measure (for β = 1) is the weighted harmonic mean of

    precision and recall. The value of β allows precision and recall to be weighted

    differently. In all our experiments being conducted, in this and the next section, β

    is be set to 1, thus, giving an equal weight to both the precision and recall. When

    β=1, the weighted Harmonic Mean is calculated as

    The phrase chunker for the present study was manually evaluated for 1000

    sentences having structure as per our system’s input scope. The results for three

    phrase chunks are provided below in Table 7.3. According to the input scope of

    sentences, the sentences containing only adjective are almost negligible, so

    while evaluating the chunker, adjective phrases are not taken into account.

    F = Precision + Recall

    2 x Precision x Recall

    Fβ = β2 x Precision + Recall

    Recall = Number of Correct Chunks

    (β2+1) x Precision x Recall

    Number of Correct Proposed Chunks

  • 233

    Table 7.3 Phrase Chunking Counts

    Phrase Type Number of Proposed Chunks

    Number of Correct Proposed Chunks

    Number of Correct Chunks

    Noun Phrase 1534 1426 1512

    Postpositional Phrase

    244 200 278

    Verb Phrase 1022 910 964

    Grand Total 2800 2536 2754

    Based on the values of Table 7.3, the precision, recall, and F-measure values for

    different chunk types are shown in Table 7.4 below. These values are expressed

    in terms of percentage.

    Table 7.4 : Phrase Chunking Results

    Phrase Group Type Precision Recall F-measure

    Noun Phrase 92.95 94.31 93.62

    Postpositional

    Phrase

    81.96 71.94 76.62

    Verb Phrase 89.04 94.39 91.63

    Average 87.98 86.88 87.29

    As per the above table, average precision comes out to be 87.98%, average

    recall 86.88%, and average F-measure value 87.29%, which reveals that if the

  • 234

    words are tagged accurately and the structure of the sentences follow the

    assumptions, the precision and recall of the phrase chunker can reach to high

    levels. Precision and recall for noun phrases and verb phrases is more as

    compared to post positional phrases as the structure of postpositional phrase has

    complexities whereas noun phrase and verb phrases are relatively simpler. It can

    be further increased by training the system with more phrases.

    Comparison with existing systems

    Singh et al. followed a rule-based approach for Hindi and reported 91% precision

    and 100% recall using 5 phrase tags.[153] Tamil text chunking had a precision of

    97.4%.[175] Hindi phrase chunker developed at IIT Kharagpur has a precision of

    87.22% and recall of 94.62%. The recall of chunker can be improved by addition

    of more rules.

    7.5.3 Evaluation of Final Translation

    The collection of sentences to be given as input for evaluation of Machine

    Translation System in various research areas varies according to their specific

    research considerations. Some of them include

    7.5.3.1 Selection of a Set

    It is very important aspect in MT evaluation to make appropriate selection of the

    sentences for evaluating the Machine Translation System. These sentences

    correspond to the following:

  • 235

    Test Corpora: It is a collection of naturally occurring text in electronic form.

    Test Suites: It is a collection of artificially constructed inputs, where each input is

    designed to probe a system's treatment of a specific phenomenon or set of

    phenomena. Inputs may be in the form of sentences, sentence fragments, or

    even sequence of sentences.

    Test Collections: It is a set of inputs associated with a corresponding set of

    expected outputs.

    For the present system, the random selection of sentences has been made to be

    used as input in the evaluation process. These sentences correspond to news

    related to some crime either from various Punjabi newspapers or from the First

    Information Reports(FIR’s) gathered from local police stations and lawyers. The

    sentences were too complex. So these were first divided to form simple

    sentences following certain assumptions being considered for the system.

    Sentence length has also been restricted to maximum of 12 words and the

    phrase length is restricted to maximum of 6 words. The test data set is shown in

    the table 7.5

    Table 7.5: Test Data Set for the Evaluation of Punjabi to English Machine

    Translation System

    Total Sentences

    1000

    Total Words 8665

  • 236

    7.5.3.2 Selection of Tests for Evaluation

    There are number of tests available for evaluating the Machine Translation

    Systems. In the evaluation procedure for present Machine Translation System

    being developed, both Qualitative (Subjective) and Quantitative tests have been

    applied. The subjective test includes two tests, Intelligibility Test and Accuracy

    Test and the Quantitative test includes only one i.e. Word Error Rate (WER) Test.

    These tests are explained below:

    7.5.3.2.1 Intelligibility Tests:

    A traditional way of assessing the quality of translation is to assign scores to

    output sentences. This test is used to check the intelligibility of the MT System. A

    common aspect to score for is Intelligibility, where the intelligibility of a translated

    sentence is affected by grammatical errors, mistranslations and untranslated

    words. A four point scale is most adequate, in that it measures intelligibility only,

    has a low scatter and is of a sufficiently discriminatory character since the

    evaluation covers several hundreds of sentences and the average calculated as

    a percentage is sufficiently precise. Scoring scales reflect top marks for those

    sentences that look like perfect target language sentences and bottom marks for

    those that are so badly degraded so as to prevent the average

    translator/evaluator from guessing what a reasonable sentence might be in the

    context. In between these two extremes, output sentences are assigned higher or

  • 237

    lower scores depending on their degree of awfulness.[176] The scale is given in

    tablr 7.6

    Table 7.6 Score Sheet for Intelligibility Test

    Score Significance

    3 The sentence is perfectly clear and intelligible. It is grammatically correct.

    2 The sentence is generally clear and intelligible. Despite some

    inaccuracies, one can understand the information to be conveyed.

    1 The general idea is intelligible only after considerable study. The sentence contains grammatical errors and/or poor word choice.

    0 The sentence is unintelligible. The meaning of the sentence is not understandable.

    7.5.3.2.2 Accuracy Test / Fidelity Measure

    By measuring intelligibility we get only a partial view of translation quality. A

    highly intelligible output sentence need not be a correct translation of the source

    sentence. It is important to check whether the meaning of the source language

    sentence is preserved in the translation. This property is called Accuracy or

    Fidelity [176]. Scoring for accuracy is normally done in combination with (but

    after) scoring for intelligibility. As with intelligibility, some sort of scoring scheme

    for accuracy must be devised. Whilst it might initially seem tempting to just have

    simple `Accurate' and `Inaccurate' labels, this could be somewhat unfair to an MT

    system which routinely produces translations which are only slightly deviant in

  • 238

    meaning. The evaluation procedure is fairly similar to the one used for the scoring

    of intelligibility. However the scorers obviously have to refer to the source

    language text (or a high quality translation of it in case they cannot speak the

    source language), so that they can compare the meaning of input and output

    sentences.

    A Four Point Scale is selected in which highest score is assigned to those

    sentences that are completely faithful and lowest score is assigned to the

    sentence which are un-understandable and unacceptable. The scale looks like

    Table 7.7 Score Sheet for Accuracy Test

    Score Significance

    3 Completely faithful

    2 Fairly faithful: more than 50 % of the original

    information passes in the translation.

    1 Barely faithful: less than 50 % of the original

    information passes in the translation.

    0 Completely unfaithful. Doesn’t make sense.

  • 239

    7.5.4 Experiments

    To evaluate the system, different evaluators were chosen. About 30 people are

    chosen who are well qualified and most of them are in teaching profession having

    knowledge of both the languages and also translation rules for translating Punjabi

    sentences to English. Some of the persons amongst them are more familiar with

    English and have less knowledge about Punjabi, but have knowledge about

    Hindi. These persons are provided with experiments related to intelligibility tests.

    Average ratings for the sentences of the individual translations are then summed

    up separately according to intelligibility and accuracy to get the average scores.

    Percentage of accurate sentences and intelligent sentences is calculated.

    7.5.4.1 Intelligibility Evaluation

    The evaluators did not have any clue about the source language i.e. Punjabi.

    They judged each sentence of target language i.e English, which is the output of

    the translator on the basis of its comprehensibility. The target user had been a

    layman who was interested only in the comprehensibility of translations.

    Intelligibility in this case is affected by grammatical errors, mis-translations, and

    un-translated words.

    7.5.4.1.1 Scoring

    The scoring is done based on the degree of intelligibility and comprehensibility. A

    four point scale is made in which highest score is assigned to those sentences

  • 240

    that look perfectly alike the target language and lowest score is assigned to the

    sentence which is un-understandable. Detail is as follows:

    Score 3: The sentence is perfectly clear and intelligible. It is grammatically correct

    and reads like ordinary text.

    Score 2: The sentence is generally clear and intelligible. Despite some

    inaccuracies, one can understand immediately what it means.

    Score 1: The general idea is intelligible only after considerable study. The

    sentence contains grammatical errors and/or poor word choice.

    Score 0: The sentence is unintelligible. Studying the meaning of the sentence is

    hopeless. Even allowing for context, one feels that guessing would be too

    unreliable.

    7.5.4.1.2 Results

    According to the responses of 30 respondents, who were asked to judge the

    translated sentences on the basis of the 4-point scale as discussed above, the

    observations are as given in the table 7.8

  • 241

    Table 7.8 : Summary of Respondents’ Perception of Translated Sentences for Intelligibility Rating

    Total Number of Sentences Respondents

    Score 0 Score 1 Score 2 Score 3

    1 187 110 342 361

    2 215 215 198 372

    3 118 171 302 409

    4 154 129 206 511

    5 186 186 225 403

    6 146 169 310 375

    7 195 195 311 299

    8 164 110 301 425

    9 138 151 226 485

    10 125 164 392 319

    11 168 169 263 400

    12 237 105 281 377

    13 165 146 298 391

    14 115 95 346 444

  • 242

    Total Number of Sentences Respondents

    Score 0 Score 1 Score 2 Score 3

    15 210 116 272 402

    16 138 148 275 439

    17 106 118 265 511

    18 188 188 274 350

    19 165 147 370 318

    20 154 118 307 421

    21 147 163 375 315

    22 118 167 314 401

    23 105 200 318 377

    24 118 105 408 369

    23 125 94 336 445

    26 201 156 252 391

    27 148 154 230 468

    28 180 117 242 461

    29 198 168 218 416

    30 141 121 321 417

    Percentage 15.85 14.65 29.26 40.24

  • 243

    The responses by the evaluators were analyzed and following results were

    observed:

    40.24 % sentences got the score 3 i.e. they were perfectly clear and

    intelligible.

    29.26 % sentences got the score 2 i.e. they were generally clear and

    intelligible.

    14.65 % sentences got the score 1 i.e. they were hard to understand.

    15.85 % sentences got the score 0 i.e. they were not understandable.

    So we can say that about 69.50 % sentences are intelligible. These sentences

    are those which have score 2 or above.

    Sample Translations with Intelligibility Results

    S No. English Sentence Score Number

    1. He told his name surinder singh 2

    2. Bag of cloth was searched 3

    3. I was present on moment 1

    4. I should be informed of case number 3

    5. Mouth of bottle is sealed with lid 3

    6. 18 bottle whisky found 2

    7. I is three boys and a girl 0

    8. Elder son of mohan is balwinder singh 2

  • 244

    S No. English Sentence Score Number

    9. Age of balwinder is 34 year 2

    10. He works in grain market 3

    7.5.4.2 Accuracy Evaluation / Fidelity Measure

    The evaluators were provided with source text along with translated text. A highly

    intelligible output sentence need not be a correct translation of the source

    sentence. It is important to check whether the meaning of the source language

    sentence is preserved in the translation or not. This property is called as

    accuracy.

    7.5.4.2.1 Scoring:

    The scoring is done on the basis of the degree of intelligibility and

    comprehensibility. A four point scale is used in which highest point is assigned to

    those sentences that look perfectly like the target language and lowest point is

    assigned to the sentence which is not understandable and unacceptable. The

    description of the scale is as given below:

    Weight Description

    Score 3 Completely faithful

    Score 2 Fairly faithful: more than 50 % of the original information passes in

    the translation.

  • 245

    Weight Description

    Score 1 Barely faithful: less than 50 % of the original information passes in the translation.

    Score 0 Completely unfaithful. It doesn’t make any sense

    7.5.4.2.2 Results

    According to the responses of 30 respondents, who were asked to judge the

    translated sentences on the basis of the 4-point scale as discussed above, the

    observations are as given in the table 7.9

    Table 7.9 : Summary of Respondents’ Perception of Translated Sentences for

    Accuracy Rating

    Total Number of Sentences Respondents

    3 2 1 0

    1 412 262 211 115

    2 422 276 200 102

    3 471 230 210 89

    4 472 202 220 106

    5 434 280 175 111

    6 507 202 170 121

    7 432 307 139 122

    8 525 228 162 85

  • 246

    Total Number of Sentences Respondents

    3 2 1 0

    9 452 208 175 165

    10 365 316 205 114

    11 436 280 165 119

    12 431 303 145 121

    13 365 312 218 105

    14 411 260 232 97

    15 459 230 218 93

    16 359 290 253 98

    17 552 230 135 83

    18 332 403 167 98

    19 459 260 151 130

    20 440 235 212 113

    21 422 270 210 98

    22 444 236 217 103

    23 325 349 217 109

    24 346 322 182 150

    23 332 367 180 121

    26 441 270 192 97

  • 247

    Total Number of Sentences Respondents

    3 2 1 0

    27 547 208 153 92

    28 539 221 150 90

    29 552 220 142 86

    30 411 290 165 134

    Percentage 43.65 26.89 18.57 10.89

    The table 7.9 depicts the total number of sentences rated by each respondent for

    a particular category of scores. It shows that 43.65 % of sentences got the score

    3 i.e. these are completely faithful. 26.89 % sentences got the score 2 i.e. they

    were fairly faithful. 18.57 % sentences got the score 1 i.e. barely faithful. 10.89 %

    sentences got the score 0 i.e. completely unfaithful.

    So we can say that about 70.54 % sentences are faithful i.e. they are completely

    correct translations or more than 50% of the information is conveyed in the

    translation of these sentences. These sentences belong to the category of

    scoring 2 or above. The results also depict that the percentage of the sentences

    conveying no meaning at all came out to be the least whereas the completely

    meaningful sentences or the correct sentences had the highest score. This

    concludes for going towards the acceptability of the system.

  • 248

    Some Sample Translations with Accuracy Results

    S No Punjabi Sentence Transliteration English Sentence Score

    1. ਉਸ ਨੰੂ ਗੁਰੂ ਨਾਨਕ ਦੇਵ ਹਸਪਤਾਲ ਿਵਚ ਦਾਖਲ ਕਰਵਾਇਆ ਿਗਆ

    us nūṃ gurū nānak dēv haspatāl vic dākhal karvāiā giā

    He was admitted in guru nanak dev hospital

    3

    2. ਉਹ ਆਪਣੇ ਕੁਝ ਸਾਥੀਆਂ ਨਾਲ ਆਏ ਸਨ

    uh āpaṇē kujh sāthīāṃ nāl āē san

    They came with their some friends

    2

    3. ਉਸ ਨੇ ਹਮਲਾ ਕੀਤਾ us nē hamlā kītā He attacked 3

    4. ਮੋਹਨ ਉਹਨਾ ਨੰੂ ਜਖਮੀ ਕਰਕੇ ਫਰਾਰ ਹੋ ਿਗਆ

    mōhan uhnā nūṃ jakhmī karkē pharār hō giā

    Mohan ran away by injuring them

    1

    5. ਓ◌ੁਹਨੇ ਮੋਟਰ ਸਾਇਕਲ ਿਵਚ ਗੱਡੀ ਮਾਰ ਿਦਤੀ

    uhnē mōṭar sāikal vic gaḍḍī mār ditī

    He stuck car in motorcycle

    1

    6. ਭਗਵਾਨ ਿਸੰਘ ਦੀ ਮੌਤ ਹੋ ਗਈ।

    bhagvān siṅgh dī maut hō gaī.

    Bhagwan singh died 3

    7. ਉਹ ਆਪਣੇ ਿਰਸ਼ਤਦਾਰਾਂ ਕੋਲ ਜਾ ਿਰਹਾ ਸੀ।

    uh āpaṇē rishtadārāṃ kōl jā rihā sī.

    He was going to his relatives

    2

    8. ਉਹ ਿਵਆਹ ਦੇਖਣ ਜਾ ਿਰਹਾ ਸੀ।

    uh viāh dēkhaṇ

    jā rihā sī.

    He was going to see

    marriage

    3

  • 249

    S No Punjabi Sentence Transliteration English Sentence Score

    9. ਮੈ ਿਬਸ਼ਨ ਿਸੰਘ ਪੁੱਤਰ ਹਮੀਰ ਿਸੰਘ ਖੰਨੇ ਦਾ ਰਿਹਣਵਾਲਾ ਹਾ ਂ

    mai bishan siṅgh puttar hamīr siṅgh

    khannē dā rahiṇvālā hāṃ

    I am resident of khanna bishan singh son hameer singh

    0

    10. ਚੋਰ ਹਾਰਡਵੇਅਰ ਦਾ ਸਾਮਾਨ ਚੋਰੀ ਕਰਕੇ ਲੈ ਗਏ

    cōr hārḍavēar dā sāmān cōrī karkē lai gaē

    Thieves took luggage of hardware

    1

    7.5.4.3 Word Error Analysis

    Error analysis is done against pre classified error list. All the errors in translated

    text were identified and their frequencies were noted. Errors were just counted

    and not weighted. After analyzing the sentences under testing, 1129 words out of

    8665 words are found to be incorrect i.e. the Word Error rate is found to be

    13.02%.

    Table 7.10: Percentage of Type of Errors Out of the Total Errors Found

    Type of Word Error Number of words Percentage error

    Wrongly Translated Words 113 10.01%

    Untranslated Words 338 29.93%

    Wrong Choice of Words 375 33.21%

    Addition and Removal of Words

    303 26.83%

  • 250

    From the above table it is concluded that most of the errors are due to wrong

    choice of words which has been the main reason that word sense ambiguity has

    not been removed and hence it has been limited to a legal domain. Even then

    there are certain words which are ambiguous. Since the type of sentences or

    phrases is limited to a particular order. Some words when not found according to

    the rule of phrase are not translated. Since there are structural differences of

    languages, there is a need to insert some words like ‘to’, ‘has’, ‘have’, ‘had’ etc.

    and deletion of some words in the sentences is also required. As the tagging has

    high accuracy, so the number of wrongly translated words have low percentage.

    7.6 Comparison with Other Existing Systems:

    The accuracy level of other existing systems are compared as shown in the table 7.11

    Table 7.11: Comparison of Present System with Other Existing Systems

    MT SYSTEM Accuracy Test Used

    Hinglish Satisfactory results in more than 90% of

    cases.

    Accuracy Test

    Mantra (English-Hindi) 93% Accuracy Test

    English-Arabic 85% Accuracy Test

    Hindi-to-Punjabi 94%

    90.84%

    Intelligibility Test

    Accuracy Test

    Punjabi-English 69.50%

    70.54%

    Intelligibility Test

    Accuracy Test

  • 251

    The comparison of the accuracy level of these systems with the present system

    shows that the system has lower accuracy as compared to these compared

    systems. It is due to the reason that word level ambiguity is not resolved here.

    After adding the WSD module, the accuracy of the system can be highly

    improved.

    7.7 Conclusion

    By applying subjective tests and the quantitative metrics for evaluation, it has

    been found that the Machine Translation System for translation of legal

    documents from Punjabi to English is found to be 69.50% on the basis of

    intelligibility test and 70.54% on the basis of accuracy test. The accuracy can be

    improved by training the system with large corpus and by adding word sense

    disambiguation module. Improving the post processing module can even raise

    the accuracy and intelligibility level.