Chapter 7 Evaluation and Results -...

221

Chapter 7 Evaluation and Results

7.1 Introduction

The evaluation of a Machine Translation System and measuring translation

performance is a difficult and complex task. There are many factors involved,

most important being that a natural language is not exact in the way that

mathematical models and theories in science are. Therefore, Commercial MT

Systems cannot translate all texts reliably. [158]

7.2 Types of Evaluation [95]

Three broad classes of MT evaluation strategy are enlisted below:

Typological Evaluation seeks to specify which particular linguistic constructions

the system handles satisfactorily and which it does not. The principal tool for such

an investigation is a test suite – a set of sentences which individually represent

specified constructions and hence constitute performance probes.

Declarative Evaluation seeks to specify how a MT system performs relative to

various dimensions of translation quality.

Operational Evaluation seeks to establish how effective an MT system is likely to

be (i.e. in terms of cost effects) as part of a given translation process.

222

7.2.1 Typological Evaluation

Typological evaluation is primarily of interest to system developers. Potential

users may not be familiar with the linguistic descriptions used, nor is it likely to be

apparent how frequently some missing or badly-handled construction might occur

in their particular text type. The system is being tested by a suite of sentences.

These sentences illustrate particular types of linguistic constructions that the

system is likely to encounter in its lifetime. If the system is intended to operate

within a particular subject field then it is obvious that its design will reflect this

sublanguage. The test-suite approach has an advantage over corpus-based

approach. The corpus contains a large amount of redundancy, i.e. most

constructions will be encountered more than once, whereas in case of test suite

each combination of concepts appear once.

Once a corpus has been established for the task in hand, statistical information

regarding the type and frequency of lexical and grammatical phenomena

contained therein should be obtained in order to be able to evaluate the capability

of the system to successfully translate sentences contained in the corpus. If good

observed frequency data were not available then the system’s potential would

either be over estimated or even under estimated. At present, however, this

statistical information will almost certainly be gained by hand (a laborious

process), as the necessary tool capable of parsing texts in this way is not

available. Assuming that we have established the relative frequency of the

phenomena contained in the corpus, the test suite can now be constructed.

223

7.3 Metrics used for Automatic Evaluation

A human evaluation of MT system is rather time consuming and exhaustive

which is not practical for developers. It takes human labour which can not be

reused. Human evaluations of Machine Translation (MT) have many aspects of

translation including adequacy, fidelity and fluency of the translation. [159]. A

metric that evaluates Machine Translation output represents the quality of the

output. The quality of a translation is inherently subjective and there is no

objective or quantifiable "good." Therefore, any metric that exists must assign

quality scores so that the quality of translation can be correlated with the human

judgment of quality. That is, a metric should score high for the translations that

humans score high, and give low scores to the ones for which humans give low

scores. Human judgment is the benchmark for assessing automatic metrics, as

humans are the end-users of any translation output.[161]

Many automated measures have been proposed to facilitate fast and cheap

evaluation of MT systems. Most efforts focus on devising metrics based on

measuring the closeness of the output of MT systems to one or more human

translation; the closer it is, the better it is.

Some Methods of Automatic Evaluation of MT are discussed below:

BLEU : (BiLingual Evaluation Understudy) The rationale behind the development

of Bleu is that human evaluation of Machine Translation can be time consuming

and expensive. An automatic evaluation metric, on the other hand, can be used

224

for frequent tasks like monitoring incremental system changes during

development, which are seemingly infeasible in a manual evaluation. The quality

of translation is indicated as a number between 0 and 1. It is measured as

statistical closeness to a given set of good quality human reference translations.

It does not directly take into account translation intelligibility or grammatical

correctness. The primary programming task in BLEU implementation is to

compare n-grams of the candidate with the n-grams of the reference translation

and count the number of matches. These matches are position independent, the

more the matches, the better the candidate translation. The modified n-gram

precision computation for any n is all candidate n-gram counts and their

corresponding max. reference counts.

The candidate counts are clipped by their and corresponding reference max.

value. These values are summed and divided by the total number of candidate n-

grams. The modified n-gram precision on a multi-sentence test set is computed

by the formula:

Pn= Σ Σ Count clip (n-gram)

C {Candidates} n-gram C

_______________________________________

Σ Σ Count clip (n-gram’)

C’ {Candidates} n-gram’ C’

225

This means that a word-weighted average of the sentence-level modified

precision is used rather than a sentence-weighted average.

NIST: NIST is another method for evaluating the quality of the text translated

using Machine Translation. It is based on BLEU metric with some alterations. It

calculates how informative a particular n-gram is. When calculating brevity

penalty, small variations in translation length do not impact overall score very

much.

METEOR: The current version of the METEOR automatic evaluation metric, scores

Machine Translation hypotheses by aligning them to one or more reference

translations. Alignments are based on exact stem, synonym and paraphrase

matches between words and phrases. A lexical similarity score is then calculated

based on the alignment for each hypothesis-reference pair. The metric includes

several free parameters that are tuned to emulate various human judgment tasks

including adequacy, ranking, and HTER.

Word Error Rate: WER works at the word level. It was originally used for

measuring the performance of speech recognition systems, but is also used in

the evaluation of Machine Translation. The metric is based on the calculation of

the number of words that differ between a piece of machine translated text and a

reference translation. This measure is based on the Levenshtein distance — the

226

minimum number of substitutions, deletions and insertions that have to be

performed to convert the automatic translation into a valid translation.

PER. Position-Independent Word Error Rate. A shortcoming of the WER

measure is that it does not allow reordering of words. In order to overcome this

problem, the position independent word error rate (PER) compares the words in

the two sentences without taking the word order into account.[162]

TER. Translation Edit Rate[163]. TER measures the amount of post-editing that a

human would have to perform to change a system output so it exactly matches a

reference translation. Possible edits include insertions, deletions, and

substitutions of single words as well as shifts of word sequences. All edits have

equal cost.

7.4 Related Works:

ALPAC report in 1966 was a study comparing different levels of human

translation with Machine Translation output, using human subjects as judges. It

consists of two variables Fidelity and Intelligibility. Fidelity (or Accuracy) is a

measure of how much information the translated sentence retained compared to

the original. Intelligibility is a measure of understandability of results of automatic

translation.[164] Advanced Research Projects Agency (ARPA) created a

methodology to evaluate Machine Translation Systems, and continues to perform

evaluations based on this methodology. The evaluation programme was

227

instigated in 1991, and continues to this day. The evaluation programme involved

testing several systems based on different theoretical approaches: statistical,

rule-based and human-assisted. A number of methods for the evaluation of the

output from these systems were tested in 1992. These methods include:

comprehension evaluation, quality panel evaluation, and evaluation based on

adequacy and fluency. [161] The first approach to metric combination based on

human likeness was given by Corston-Oliver who used decision trees to

distinguish between human-generated (‘good’) and machine generated (‘bad’)

translations.[165] They suggested using classifier confidence scores directly as a

quality indicator. High levels of classification accuracy were obtained. However,

they focused on evaluating only the well-formedness of automatic translations

(i.e., subaspects of fluency). Preliminary results using Support Vector Machines

were also discussed. Kulesza and Shieber extended the approach by Corston-

Oliver to take into account other aspects of quality further than fluency alone.

Instead of decision trees, they trained Support Vector Machines (SVM). They

used features inspired by well-known metrics such as BLEU, NIST, WER, and

PER. Metric quality was evaluated both in terms of classification accuracy and in

terms of correlation with human assessment at the sentence level. A significant

improvement with respect to standard individual metrics was reported.[167]

In a different research line, Akiba et al. suggested directly predicting human

scores of acceptability, approached as a multiclass classification task. They used

decision tree classifiers trained on multiple edit-distance features based on

228

combinations of lexical, morphosyntactic and lexical semantic information (e.g.,

word, stem, part-of-speech, and semantic classes from a Thesaurus). Promising

results were obtained in terms of local accuracy over an internal predefined set of

overall quality assessment categories. Quirk presented a similar approach, also

with the aim to approximate human quality judgements, with the particularity that

human references were not required. Recently, Paul extended these works so as

to account for separate aspects of quality: adequacy, fluency and acceptability.

They used SVM classifiers to combine the outcomes of different automatic

metrics at the lexical level (BLEU, NIST, METEOR, GTM, WER, PER and TER).

Also very recently, Albrecht and Hwa re-examined the SVM-classification

approach by Kulesza and Shieber and Corston-Oliver and, inspired by the work

of Quirk suggested a regression-based learning approach to metric combination,

with and without human references.[170-171] Their results outperformed those by

Kulesza and Shieber in terms of correlation with human assessments. In a

different approach, Ye suggested approaching sentence level MT evaluation as a

ranking problem.They used the Ranking SVM algorithm to sort candidate

translation. Assessments were based on a 1-4 scale similar to overall quality

categories used by Akiba. [161-165]

7.5 Approach followed for Evaluation of Machine Translation System

Different metrics are used to evaluate different stages of the Machine Translation

System. The stage-I, tagging the words with its part of speech is evaluated.

229

Stage 2 corresponds to the evaluation of Phrase Chunker followed by stage 3 i.e

the evaluation of final translator. These stages are discussed in the sub-sections

followed below:

7.5.1 Evaluation of Part-of-Speech Tagging

The most commonly used evaluation measure used for part-of-speech tagging is

Accuracy. This is sometimes expressed in percentage or as a value between 0

and 1. This accuracy measure can be defined as given below:

For evaluation, part-of-speech tagger was applied on tagging a set of 1000

sentences collected from the crime news from various newspapers and from

legal documents. The sentences contain about 8665 words. The outcome was

manually evaluated to mark the correct and incorrect tag assignments. There

are total 527 tag sets and a set of 1000 sentences with 8922 words was used as

tagged corpus for training the part of speech tagger. Table 7.1 provides the

tagging results.

Table 7.1 Part-of-Speech Tagging Results

Tagged Words Unknown Words

Incorrect Tags

Correct Tags

Total Words

501 407 7757 8665

Accuracy = Total Number of Words Tagged

Total Number of Words having Correct Tags

230

Based on the data given in the above table for part-of-speech tagging results and

the accuracy definition, the following different accuracy measures (in percentage)

were calculated:

Accuracy 1 represents the typical accuracy results for our tagger, i.e. total

number of words having a unique correct tag divided by total number of words

tagged.

In the sentences chosen for testing the system, there are some words which are

not recognized since these are not present in the morph database. These words

include some proper nouns which are not recognized by the proper noun

gazetteer as it may not be followed or preceded by some special words stored in

our database made for recognizing proper nouns. Accuracy 2 is calculated by

dividing the correct tags by the difference of total words and the unknown words.

Accuracy results are shown below.

Table 7.2 Parts-of-Speech Tagging Accuracy

Accuracy 1 Accuracy 2

89.52 95.01

Accuracy 2 = Total No. of Words – Unknown Words Total No. of Words having Correct Tag

Accuracy 1 = Total No. of Words

Total No. of Words having Correct Tag

231

Comparison with Existing Systems

The accuracy measure was compared with the accuracy of the rule based tagger

developed at Punjabi University Patiala. The rule based tagger has the accuracy

of 80.29% including unknown words and 88.86% by excluding unknown words.

By using statistical approach, the tagging accuracy including unknown words is

increased to 89.52% and by excluding unknown words, it is raised to 95.01%.

One of the popular POS taggers is TnT tagger which has shown to have high

accuracy in English and some other languages. It provides overall tagging

accuracy of 96.64%, specifically, 97.01% on known words. Dandapat et al.

reported 95% accuracy for Hindi using HMM.[136] Shacham reported 87.27%

accuracy for Hebrew.[174] Bharati and Mannem reported 67-77% accuracy for

Hindi, Telugu, and Bengali using 24 tags and applying various statistical

techniques.[142] A hybrid POS tagger for Tamil using HMM technique and a rule

based system and had a precision of 97.2%. The accuracy tagger for Machine

Translation System in present study is comparable to highly accurate taggers.

7.5.2 Evaluation of Phrase Chunking

Chunking phrase includes the noun phrases, verb phrases, adjective phrases

and the postpositional phrases. The accuracy of the chunking phrase can be

calculated using the formulas defined below.

Precision = Number of Proposed Chunks

Number of Correct Proposed

232

Precision can be seen as a measure of exactness or conformity, and Recall is a

measure of completeness. In other words, precision tells how accurate the

system is and recall specifies how complete the system is. On a measurement

scale of 0 to 1, a value close to 1 is desirable for both of these measures. Fβ

measure or just F-measure (for β = 1) is the weighted harmonic mean of

precision and recall. The value of β allows precision and recall to be weighted

differently. In all our experiments being conducted, in this and the next section, β

is be set to 1, thus, giving an equal weight to both the precision and recall. When

β=1, the weighted Harmonic Mean is calculated as

The phrase chunker for the present study was manually evaluated for 1000

sentences having structure as per our system’s input scope. The results for three

phrase chunks are provided below in Table 7.3. According to the input scope of

sentences, the sentences containing only adjective are almost negligible, so

while evaluating the chunker, adjective phrases are not taken into account.

F = Precision + Recall

2 x Precision x Recall

Fβ = β2 x Precision + Recall

Recall = Number of Correct Chunks

(β2+1) x Precision x Recall

Number of Correct Proposed Chunks

233

Table 7.3 Phrase Chunking Counts

Phrase Type Number of Proposed Chunks

Number of Correct Proposed Chunks

Number of Correct Chunks

Noun Phrase 1534 1426 1512

Postpositional Phrase

244 200 278

Verb Phrase 1022 910 964

Grand Total 2800 2536 2754

Based on the values of Table 7.3, the precision, recall, and F-measure values for

different chunk types are shown in Table 7.4 below. These values are expressed

in terms of percentage.

Table 7.4 : Phrase Chunking Results

Phrase Group Type Precision Recall F-measure

Noun Phrase 92.95 94.31 93.62

Postpositional

Phrase

81.96 71.94 76.62

Verb Phrase 89.04 94.39 91.63

Average 87.98 86.88 87.29

As per the above table, average precision comes out to be 87.98%, average

recall 86.88%, and average F-measure value 87.29%, which reveals that if the

234

words are tagged accurately and the structure of the sentences follow the

assumptions, the precision and recall of the phrase chunker can reach to high

levels. Precision and recall for noun phrases and verb phrases is more as

compared to post positional phrases as the structure of postpositional phrase has

complexities whereas noun phrase and verb phrases are relatively simpler. It can

be further increased by training the system with more phrases.

Comparison with existing systems

Singh et al. followed a rule-based approach for Hindi and reported 91% precision

and 100% recall using 5 phrase tags.[153] Tamil text chunking had a precision of

97.4%.[175] Hindi phrase chunker developed at IIT Kharagpur has a precision of

87.22% and recall of 94.62%. The recall of chunker can be improved by addition

of more rules.

7.5.3 Evaluation of Final Translation

The collection of sentences to be given as input for evaluation of Machine

Translation System in various research areas varies according to their specific

research considerations. Some of them include

7.5.3.1 Selection of a Set

It is very important aspect in MT evaluation to make appropriate selection of the

sentences for evaluating the Machine Translation System. These sentences

correspond to the following:

235

Test Corpora: It is a collection of naturally occurring text in electronic form.

Test Suites: It is a collection of artificially constructed inputs, where each input is

designed to probe a system's treatment of a specific phenomenon or set of

phenomena. Inputs may be in the form of sentences, sentence fragments, or

even sequence of sentences.

Test Collections: It is a set of inputs associated with a corresponding set of

expected outputs.

For the present system, the random selection of sentences has been made to be

used as input in the evaluation process. These sentences correspond to news

related to some crime either from various Punjabi newspapers or from the First

Information Reports(FIR’s) gathered from local police stations and lawyers. The

sentences were too complex. So these were first divided to form simple

sentences following certain assumptions being considered for the system.

Sentence length has also been restricted to maximum of 12 words and the

phrase length is restricted to maximum of 6 words. The test data set is shown in

the table 7.5

Table 7.5: Test Data Set for the Evaluation of Punjabi to English Machine

Translation System

Total Sentences

1000

Total Words 8665

236

7.5.3.2 Selection of Tests for Evaluation

There are number of tests available for evaluating the Machine Translation

Systems. In the evaluation procedure for present Machine Translation System

being developed, both Qualitative (Subjective) and Quantitative tests have been

applied. The subjective test includes two tests, Intelligibility Test and Accuracy

Test and the Quantitative test includes only one i.e. Word Error Rate (WER) Test.

These tests are explained below:

7.5.3.2.1 Intelligibility Tests:

A traditional way of assessing the quality of translation is to assign scores to

output sentences. This test is used to check the intelligibility of the MT System. A

common aspect to score for is Intelligibility, where the intelligibility of a translated

sentence is affected by grammatical errors, mistranslations and untranslated

words. A four point scale is most adequate, in that it measures intelligibility only,

has a low scatter and is of a sufficiently discriminatory character since the

evaluation covers several hundreds of sentences and the average calculated as

a percentage is sufficiently precise. Scoring scales reflect top marks for those

sentences that look like perfect target language sentences and bottom marks for

those that are so badly degraded so as to prevent the average

translator/evaluator from guessing what a reasonable sentence might be in the

context. In between these two extremes, output sentences are assigned higher or

237

lower scores depending on their degree of awfulness.[176] The scale is given in

tablr 7.6

Table 7.6 Score Sheet for Intelligibility Test

Score Significance

3 The sentence is perfectly clear and intelligible. It is grammatically correct.

2 The sentence is generally clear and intelligible. Despite some

inaccuracies, one can understand the information to be conveyed.

1 The general idea is intelligible only after considerable study. The sentence contains grammatical errors and/or poor word choice.

0 The sentence is unintelligible. The meaning of the sentence is not understandable.

7.5.3.2.2 Accuracy Test / Fidelity Measure

By measuring intelligibility we get only a partial view of translation quality. A

highly intelligible output sentence need not be a correct translation of the source

sentence. It is important to check whether the meaning of the source language

sentence is preserved in the translation. This property is called Accuracy or

Fidelity [176]. Scoring for accuracy is normally done in combination with (but

after) scoring for intelligibility. As with intelligibility, some sort of scoring scheme

for accuracy must be devised. Whilst it might initially seem tempting to just have

simple `Accurate' and `Inaccurate' labels, this could be somewhat unfair to an MT

system which routinely produces translations which are only slightly deviant in

238

meaning. The evaluation procedure is fairly similar to the one used for the scoring

of intelligibility. However the scorers obviously have to refer to the source

language text (or a high quality translation of it in case they cannot speak the

source language), so that they can compare the meaning of input and output

sentences.

A Four Point Scale is selected in which highest score is assigned to those

sentences that are completely faithful and lowest score is assigned to the

sentence which are un-understandable and unacceptable. The scale looks like

Table 7.7 Score Sheet for Accuracy Test

Score Significance

3 Completely faithful

2 Fairly faithful: more than 50 % of the original

information passes in the translation.

1 Barely faithful: less than 50 % of the original

information passes in the translation.

0 Completely unfaithful. Doesn’t make sense.

239

7.5.4 Experiments

To evaluate the system, different evaluators were chosen. About 30 people are

chosen who are well qualified and most of them are in teaching profession having

knowledge of both the languages and also translation rules for translating Punjabi

sentences to English. Some of the persons amongst them are more familiar with

English and have less knowledge about Punjabi, but have knowledge about

Hindi. These persons are provided with experiments related to intelligibility tests.

Average ratings for the sentences of the individual translations are then summed

up separately according to intelligibility and accuracy to get the average scores.

Percentage of accurate sentences and intelligent sentences is calculated.

7.5.4.1 Intelligibility Evaluation

The evaluators did not have any clue about the source language i.e. Punjabi.

They judged each sentence of target language i.e English, which is the output of

the translator on the basis of its comprehensibility. The target user had been a

layman who was interested only in the comprehensibility of translations.

Intelligibility in this case is affected by grammatical errors, mis-translations, and

un-translated words.

7.5.4.1.1 Scoring

The scoring is done based on the degree of intelligibility and comprehensibility. A

four point scale is made in which highest score is assigned to those sentences

240

that look perfectly alike the target language and lowest score is assigned to the

sentence which is un-understandable. Detail is as follows:

Score 3: The sentence is perfectly clear and intelligible. It is grammatically correct

and reads like ordinary text.

Score 2: The sentence is generally clear and intelligible. Despite some

inaccuracies, one can understand immediately what it means.

Score 1: The general idea is intelligible only after considerable study. The

sentence contains grammatical errors and/or poor word choice.

Score 0: The sentence is unintelligible. Studying the meaning of the sentence is

hopeless. Even allowing for context, one feels that guessing would be too

unreliable.

7.5.4.1.2 Results

According to the responses of 30 respondents, who were asked to judge the

translated sentences on the basis of the 4-point scale as discussed above, the

observations are as given in the table 7.8

241

Table 7.8 : Summary of Respondents’ Perception of Translated Sentences for Intelligibility Rating

Total Number of Sentences Respondents

Score 0 Score 1 Score 2 Score 3

1 187 110 342 361

2 215 215 198 372

3 118 171 302 409

4 154 129 206 511

5 186 186 225 403

6 146 169 310 375

7 195 195 311 299

8 164 110 301 425

9 138 151 226 485

10 125 164 392 319

11 168 169 263 400

12 237 105 281 377

13 165 146 298 391

14 115 95 346 444

242


Score 0 Score 1 Score 2 Score 3

15 210 116 272 402

16 138 148 275 439

17 106 118 265 511

18 188 188 274 350

19 165 147 370 318

20 154 118 307 421

21 147 163 375 315

22 118 167 314 401

23 105 200 318 377

24 118 105 408 369

23 125 94 336 445

26 201 156 252 391

27 148 154 230 468

28 180 117 242 461

29 198 168 218 416

30 141 121 321 417

Percentage 15.85 14.65 29.26 40.24

243

The responses by the evaluators were analyzed and following results were

observed:

40.24 % sentences got the score 3 i.e. they were perfectly clear and

intelligible.

29.26 % sentences got the score 2 i.e. they were generally clear and

intelligible.

14.65 % sentences got the score 1 i.e. they were hard to understand.

15.85 % sentences got the score 0 i.e. they were not understandable.

So we can say that about 69.50 % sentences are intelligible. These sentences

are those which have score 2 or above.

Sample Translations with Intelligibility Results

S No. English Sentence Score Number

1. He told his name surinder singh 2

2. Bag of cloth was searched 3

3. I was present on moment 1

4. I should be informed of case number 3

5. Mouth of bottle is sealed with lid 3

6. 18 bottle whisky found 2

7. I is three boys and a girl 0

8. Elder son of mohan is balwinder singh 2

244

S No. English Sentence Score Number

9. Age of balwinder is 34 year 2

10. He works in grain market 3

7.5.4.2 Accuracy Evaluation / Fidelity Measure

The evaluators were provided with source text along with translated text. A highly

intelligible output sentence need not be a correct translation of the source

sentence. It is important to check whether the meaning of the source language

sentence is preserved in the translation or not. This property is called as

accuracy.

7.5.4.2.1 Scoring:

The scoring is done on the basis of the degree of intelligibility and

comprehensibility. A four point scale is used in which highest point is assigned to

those sentences that look perfectly like the target language and lowest point is

assigned to the sentence which is not understandable and unacceptable. The

description of the scale is as given below:

Weight Description

Score 3 Completely faithful

Score 2 Fairly faithful: more than 50 % of the original information passes in

the translation.

245

Weight Description

Score 1 Barely faithful: less than 50 % of the original information passes in the translation.

Score 0 Completely unfaithful. It doesn’t make any sense

7.5.4.2.2 Results

According to the responses of 30 respondents, who were asked to judge the

translated sentences on the basis of the 4-point scale as discussed above, the

observations are as given in the table 7.9

Table 7.9 : Summary of Respondents’ Perception of Translated Sentences for

Accuracy Rating


3 2 1 0

1 412 262 211 115

2 422 276 200 102

3 471 230 210 89

4 472 202 220 106

5 434 280 175 111

6 507 202 170 121

7 432 307 139 122

8 525 228 162 85

246


3 2 1 0

9 452 208 175 165

10 365 316 205 114

11 436 280 165 119

12 431 303 145 121

13 365 312 218 105

14 411 260 232 97

15 459 230 218 93

16 359 290 253 98

17 552 230 135 83

18 332 403 167 98

19 459 260 151 130

20 440 235 212 113

21 422 270 210 98

22 444 236 217 103

23 325 349 217 109

24 346 322 182 150

23 332 367 180 121

26 441 270 192 97

247


3 2 1 0

27 547 208 153 92

28 539 221 150 90

29 552 220 142 86

30 411 290 165 134

Percentage 43.65 26.89 18.57 10.89

The table 7.9 depicts the total number of sentences rated by each respondent for

a particular category of scores. It shows that 43.65 % of sentences got the score

3 i.e. these are completely faithful. 26.89 % sentences got the score 2 i.e. they

were fairly faithful. 18.57 % sentences got the score 1 i.e. barely faithful. 10.89 %

sentences got the score 0 i.e. completely unfaithful.

So we can say that about 70.54 % sentences are faithful i.e. they are completely

correct translations or more than 50% of the information is conveyed in the

translation of these sentences. These sentences belong to the category of

scoring 2 or above. The results also depict that the percentage of the sentences

conveying no meaning at all came out to be the least whereas the completely

meaningful sentences or the correct sentences had the highest score. This

concludes for going towards the acceptability of the system.

248

Some Sample Translations with Accuracy Results

S No Punjabi Sentence Transliteration English Sentence Score

1. ਉਸ ਨੰੂ ਗੁਰੂ ਨਾਨਕ ਦੇਵ ਹਸਪਤਾਲ ਿਵਚ ਦਾਖਲ ਕਰਵਾਇਆ ਿਗਆ

us nūṃ gurū nānak dēv haspatāl vic dākhal karvāiā giā

He was admitted in guru nanak dev hospital

3

2. ਉਹ ਆਪਣੇ ਕੁਝ ਸਾਥੀਆਂ ਨਾਲ ਆਏ ਸਨ

uh āpaṇē kujh sāthīāṃ nāl āē san

They came with their some friends

2

3. ਉਸ ਨੇ ਹਮਲਾ ਕੀਤਾ us nē hamlā kītā He attacked 3

4. ਮੋਹਨ ਉਹਨਾ ਨੰੂ ਜਖਮੀ ਕਰਕੇ ਫਰਾਰ ਹੋ ਿਗਆ

mōhan uhnā nūṃ jakhmī karkē pharār hō giā

Mohan ran away by injuring them

1

5. ਓ◌ੁਹਨੇ ਮੋਟਰ ਸਾਇਕਲ ਿਵਚ ਗੱਡੀ ਮਾਰ ਿਦਤੀ

uhnē mōṭar sāikal vic gaḍḍī mār ditī

He stuck car in motorcycle

1

6. ਭਗਵਾਨ ਿਸੰਘ ਦੀ ਮੌਤ ਹੋ ਗਈ।

bhagvān siṅgh dī maut hō gaī.

Bhagwan singh died 3

7. ਉਹ ਆਪਣੇ ਿਰਸ਼ਤਦਾਰਾਂ ਕੋਲ ਜਾ ਿਰਹਾ ਸੀ।

uh āpaṇē rishtadārāṃ kōl jā rihā sī.

He was going to his relatives

2

8. ਉਹ ਿਵਆਹ ਦੇਖਣ ਜਾ ਿਰਹਾ ਸੀ।

uh viāh dēkhaṇ

jā rihā sī.

He was going to see

marriage

3

249

S No Punjabi Sentence Transliteration English Sentence Score

9. ਮੈ ਿਬਸ਼ਨ ਿਸੰਘ ਪੁੱਤਰ ਹਮੀਰ ਿਸੰਘ ਖੰਨੇ ਦਾ ਰਿਹਣਵਾਲਾ ਹਾ ਂ

mai bishan siṅgh puttar hamīr siṅgh

khannē dā rahiṇvālā hāṃ

I am resident of khanna bishan singh son hameer singh

0

10. ਚੋਰ ਹਾਰਡਵੇਅਰ ਦਾ ਸਾਮਾਨ ਚੋਰੀ ਕਰਕੇ ਲੈ ਗਏ

cōr hārḍavēar dā sāmān cōrī karkē lai gaē

Thieves took luggage of hardware

1

7.5.4.3 Word Error Analysis

Error analysis is done against pre classified error list. All the errors in translated

text were identified and their frequencies were noted. Errors were just counted

and not weighted. After analyzing the sentences under testing, 1129 words out of

8665 words are found to be incorrect i.e. the Word Error rate is found to be

13.02%.

Table 7.10: Percentage of Type of Errors Out of the Total Errors Found

Type of Word Error Number of words Percentage error

Wrongly Translated Words 113 10.01%

Untranslated Words 338 29.93%

Wrong Choice of Words 375 33.21%

Addition and Removal of Words

303 26.83%

250

From the above table it is concluded that most of the errors are due to wrong

choice of words which has been the main reason that word sense ambiguity has

not been removed and hence it has been limited to a legal domain. Even then

there are certain words which are ambiguous. Since the type of sentences or

phrases is limited to a particular order. Some words when not found according to

the rule of phrase are not translated. Since there are structural differences of

languages, there is a need to insert some words like ‘to’, ‘has’, ‘have’, ‘had’ etc.

and deletion of some words in the sentences is also required. As the tagging has

high accuracy, so the number of wrongly translated words have low percentage.

7.6 Comparison with Other Existing Systems:

The accuracy level of other existing systems are compared as shown in the table 7.11

Table 7.11: Comparison of Present System with Other Existing Systems

MT SYSTEM Accuracy Test Used

Hinglish Satisfactory results in more than 90% of

cases.

Accuracy Test

Mantra (English-Hindi) 93% Accuracy Test

English-Arabic 85% Accuracy Test

Hindi-to-Punjabi 94%

90.84%

Intelligibility Test

Accuracy Test

Punjabi-English 69.50%

70.54%

Intelligibility Test

Accuracy Test

251

The comparison of the accuracy level of these systems with the present system

shows that the system has lower accuracy as compared to these compared

systems. It is due to the reason that word level ambiguity is not resolved here.

After adding the WSD module, the accuracy of the system can be highly

improved.

7.7 Conclusion

By applying subjective tests and the quantitative metrics for evaluation, it has

been found that the Machine Translation System for translation of legal

documents from Punjabi to English is found to be 69.50% on the basis of

intelligibility test and 70.54% on the basis of accuracy test. The accuracy can be

improved by training the system with large corpus and by adding word sense

disambiguation module. Improving the post processing module can even raise

the accuracy and intelligibility level.

Chapter 7 Evaluation and Results -...

Documents

Transcript of Chapter 7 Evaluation and Results -...