2007. Muñoz & Conde. Effects of Serial Translation Evaluation (Preprint, Color)

Effects of Serial Translation Evaluation

Ricardo Muñoz Martín & José Tomás Conde Ruano1

PETRA Research Group

University of Granada

Evaluating, in translation, is a prototypical concept with many extensions. Readers tend

to view it as a matter of acceptability, adequacy, or quality, whereas other stakeholders

conceive of it as an activity, or part of an activity, of proofreading, grading, correcting,

revising, editing, assessing, and so on. Means and goals are also different quite often.

These circumstances, together with enormously varied personal criteria and standards in

evaluators, support the generally accepted view that evaluation cannot be studied.

However, we are trying to find out whether studying the way subjects actually evaluate

might shed light on some regularities which could help us better understand what is at

stake. In other words, we are trying to find out whether there is some order in

subjectivity.

Evaluating several translations from the same original is something pretty unnatural

in the market. It actually comes up nearly only in translator training, translator hiring,

and translation criticism, and the third case is rather different from the other two.

However, translator training and hiring are crucial activities for the industry. Can the

repeated activity teach us something about evaluating translations? Does the repetition

have an influence on the outcome of the evaluation? Those were the questions raised in

this piece of research, which is a part of a larger effort by Tomás Conde, under the

supervision of Ricardo Muñoz, within the activities of the Research Group Expertise

and Environment in Translation (PETRA).

The overarching purpose of this research project is to map intra- and intergroup

coincidences and differences at evaluating2 translations. This preliminary field study

can be described as a piece of descriptive-relational research. It is descriptive, since it

seeks to depict what already exists in a group or population, and it is relational since it

investigates the connection between variables that are already present in the group or

population. At this stage, and after a pilot study, we have already processed data from

10 students. The reduced number of subjects makes results relatively unimportant, but

we think the findings are interesting enough to circulate them. In the near future, the

research project will compare variables with larger amounts of subjects and also

between different groups of population; apart from translation students, we will study

professional translators, translation teachers, and addressees.

1. Materials and methods

35 students in their fourth year of the translation degree at the University of Granada

were invited to “assess / correct / proofread / edit / revise” four sets of 12 translations

each corresponding to four originals, according to their beliefs and intuition, and to the

best of their knowledge. This report analyzes the data of ten of these subjects, the first

to be completed. As for the texts, two of the originals (A and C) dealt with politics, and

1 Corresponding author: Tomás Conde Ruano, Dpto. Traducción e Interpretación. Universidad de

Granada. Granada E-18071 Spain. [email protected] 2 We will use evaluate to cover the set of activities carried out by all subjects, independently of the way

they envisioned their task.

the other two (B and D), with technical procedures for painting machinery. Translations

had been carried out by students from an earlier course, and they were chosen amongst

those which were not assigned a very good grade by the teacher, so as to avoid that one

translation would serve as a model for the evaluators. The sets were alternatively

sequenced (A, politics; B, technical painting; C, politics; and D, technical painting) to

prompt evaluators to think of them as separate tasks. Translations within each set were

randomly ordered and coded for blind intervention by subjects. Originals and

translations were provided as digital files, but printouts were provided upon request, and

subjects were also allowed to print them out.

The only constraints imposed to all subjects alike were the following three: They

had to (1) process the translations in the order they were given; (2) work on a whole set

of translations in a single session and, finally, (3) classify translations into one of four

categories: very good, good, bad, and very bad. In order to allow for computing

averages, quality judgments were assigned numerical values: very bad, 1; bad, 2; good,

3; very good, 4.

We wanted to move away from popular approaches to evaluating translations which

focus on mistakes and assign arbitrary values to poorly defined categories. To do so,

some concepts had to be operationalized. Since evaluators do not only mark mistakes,

we defined phenomenon as ‘what motivates an evaluator to act onto a particular text

segment’. Phenomena were classified into two groups: normalized phenomena, and not-

normalized phenomena. Normalized phenomena include typos, punctuation and spelling

mistakes, formatting variations, concordance, syntax, weights & measurements, and the

like, where there is a proper option sanctioned by an authority (normally, RAE). Not

normalized phenomena include instances such as word order, paraphrase, register and

different interpretations of the original. The classification is not homogeneous nor

totally sharp, but it responds to the nature of the phenomena pretty well, does not

demand a strong heuristic effort, and brings about a considerable reduction of

undetermined phenomena (5.99%). On the other hand, phenomena were taken to be

more or less salient according to the number of subjects who singled them out. Hence, a

text segment where seven evaluators perform an action is thought of as more salient

than a text segment where only three evaluators seem to notice a phenomenon. Due to

problems of space, here saliency is reduced to phenomena where more than half of the

evaluators coincide.

An (evaluative) action is ‘any mark introduced by the evaluator on the text’. In this

study, actions have been limited to those which remain once the evaluator is done and

turns in the translation. Hence, inconsistencies and on line changes, which might be

very informative, have not been taken into account here, and will be subject of future

research. Professionals –and some teachers– often quote the amount of work needed to

fix a translation. Since actions may entail varying quantities of work depending on their

nature, systematicity, and other factors which may be specific for each phenomenon,

they were operationalized as quantity of actions, and their types. The classification of

actions into types was made according to a behavioral criterion, as observed in the

evaluated translations. Actions observed so far can be divided into those made in the

body of the text and those made at the margin (which also includes before and after the

body of the text). In both cases, actions could also be classified into additions,

suppressions, changes, marks, annotations, and comments. Also, some evaluators chose

to code their marks so as to classify phenomena in some way.

Evaluators do not seem to have clear or conscious criteria to evaluate translations,

and many of those who state a set of parameters turn out to apply them rather unevenly

in actual practice. Hence, to try to account for their standards we defined demand as ‘the

sum of conscious and unconscious expectations an evaluator seems to think that a

translation should meet’. Demand was operationalized from two perspectives: 1) level,

that is, whether evaluators seem to expect more or less from a translation as reflected in

their quality judgments; 2) evenness, or the uniformity or lack of variation in the level

of demand. The second perspective may indicate the existence of clear and/or stable

criteria for evaluating, or else an attempt to pursue some even-handedness.

Order effects were defined as ‘any consistent tendency across evaluators which

cannot be explained as a feature of the translations when considered separately’. We

searched for three types of order effects: (1) within the whole task; (2) within each set;

and (3), within texts. Since translations were evaluated in the same order, task effects

were analyzed simply by observing changes in the progression of the task. To search for

set effects, translations were grouped into three subsets, in such a way that subset I

includes translations 1-4 from each set, subset II includes translations 5-8, and subset III

includes translations 9-12. For example, subset I includes the actions carried out in

translations A01-04, B01-04, C01-04, and D01-04. For effects within translations,

originals were divided into three sections (initial, middle, and final) of roughly identical

length, and translations were divided accordingly. Data were entered in an Access

database and later analyzed with SPSS 12.0.

2. General results and discussion

For the purpose of framing our findings on effects at serial evaluation, we will first need

to characterize subjects and their behaviors. Analytical parameters emerged from the

detailed study of results and their comparison. The first parameter was final quality

judgments, which were assigned numerical values, to allow for computing averages:

very bad, 1; bad, 2; good, 3; very good, 4.

1 2 3 4 5 6 7 8 9 10 11 12 set ave.

A 2.1 2.2 1.2 1.5 1.6 1.1 2.1 1.6 1.8 2.9 2.1 2.4 1883

B 2.4 2.6 2.7 1.7 2.5 1.9 2.8 2.4 2.6 2.6 2.2 2.2 2383

C 2.2 2.2 1.6 1.2 2.0 2.2 1.3 3.0 2.5 2.3 2.3 1.5 2025

D 3.3 2.2 1.8 1.6 2.6 2.4 2.3 2.0 1.6 2.0 1.9 2.0 2141

Table 1. Average quality judgments.

Table 1 displays average quality judgments for the 48 translations. In set A, translations

A02 and A10 received the best grades, whereas A03 and A06 got the lowest. The median

value of all translations was 2.1. Technical translations received higher grades than general

translations. The amount of words in the translation does not correlate with corresponding

quality judgments, although evaluators I05 and I03 tended to think that long translations are

good (correlations of 0.295 and 0.291, respectively, significant at 0.05).

Graphic 1 shows the frequency of average quality judgments in the task, which is

close to a fairly typical distribution (Gauss’ bell), except for the fact that the curve is

displaced to the left, probably because translations were purposefully chosen among the

worst ones, to substantiate evaluators’ activities. Only nine translations were deemed

Good or Very good (green columns). When the continuum is divided into three equal

periods, then only two translations reach the highest third (blue background).

Graphic 1. Frequency of quality judgment averages in the task.

2.1. Demand

2.1.1. Level

Table 2 provides some information to capture some specifics of subjects’ behavior.

Correlations between quality judgments by evaluators were statistically significant

between I02 and I06 (0.426), I04 and I10 (0.627), I05 and I07 (0.375), and I07 with I08

(0.384) and with I09 (0.596).

subjectt/set A B C D Total Aver. s.d. Aver. s.d. Aver. s.d. Aver. s.d. aver. s.d.

I02 2.08 0.996 1.82 0.603 2.25 1.055 2.92 0.669 2.28 0.926

I03 2.83 0.835 2.64 0.505 2.92 0.669 2.58 0.669 2.74 0.675

I04 2.42 0.793 2.73 0.786 2.83 0.835 2.50 0.905 2.62 0.822

I05 1.08 0.289 2.25 0.622 1.67 0.651 1.75 1.138 1.69 0.829

I06 1.83 0.835 1.50 0.674 1.17 0.389 1.58 0.996 1.52 0.772

I07 2.25 0.965 2.92 0.793 2.17 0.937 1.75 0.452 2.27 0.893

I08 2.08 0.900 2.42 0.900 2.33 1.155 2.17 1.030 2.25 0.978

I09 2.00 0.853 2.92 0.793 2.42 0.793 1.58 0.515 2.23 0.881

I10 2.67 0.492 2.33 0.778 2.50 0.659

I11 2.25 0.965 2.58 1.084 2.50 0.905 2.45 0.934 2.45 0.951

Table 2. Quality judgment, per subject and set.

Evaluator I03 has the best opinion of the translations (set A average, 2.83; set C

average, 2.92; general average, 2.74). Other generous or lenient evaluators are I04 (2.62

average), I11 (2.45 average) and I07 (2.27 average). On the other hand, I06 is the

hardest or more demanding evaluator (set C average, 1.17; general average, 1.52),

followed by I05 (1.69 average). Graphic 2 shows that subjects can be classified into

three groups: I05 and I06 are the more demanding evaluators; I03, I04, I10, and I11 are

the lenient, and I02, I07, I08 and I09 are in between. The intermediate group is

remarkably homogeneous.

Subjects

Graphic 2. Quality judgment, per subject.

2.1.2. Evenness

Graphic 3 displays average quality judgments per set in the evaluators. I05 seems

especially tough in set A (1.08 average) when compared to average, and I02 is very

generous in set D (2.92). On the other hand, I08 is very regular throughout the sets (2.25

average), followed by I11, I04 and I03, who have better general opinions on the

translations. Clearly, lenient evaluators (plus medium evaluator I08) seem more even

than the rest in all texts.

Graphic 3. Set averages of subjects’quality judgments,.

2.2. Actions

2.2.1. Quantity

The number of actions correlates significantly with quality judgments (- 0.535) when

considered text by text, but not when analyzed by subjects. The total amount of actions

is 11909 (table 3). Within sets, C and D show the largest variations, which may amount

up to four times as many actions between translations.

text / set A B C D

aver. s.d. aver. s.d. aver. s.d. aver. s.d.

01 44.90 21.702 24.00 9.684 27.00 15.727 8.80 5.181

02 37.10 15.366 18.00 5.249 18.20 6.374 15.80 8.664

03 56.80 20.471 16.30 4.347 22.60 8.579 22.70 7.675

04 56.70 25.975 19.90 6.226 29.60 11.138 26.00 5.598

05 42.40 20.250 16.90 4.701 19.80 8.053 10.40 5.296

06 62.70 24.784 22.10 6.557 14.50 10.157 12.50 8.100

07 32.90 15.871 14.90 7.534 24.10 10.682 12.00 6.716

08 47.60 20.007 17.60 8.072 23.40 19.945 16.80 8.257

09 63.30 18.331 18.30 8.433 13.10 7.370 11.70 4.877

10 30.10 17.272 14.60 8.605 13.50 7.706 17.50 7.634

11 37.20 17.561 15.50 7.200 14.50 8.567 17.70 6.701

12 35.30 15.151 17.30 8.629 22.90 9.597 13.40 5.621

set aver. 45.58 19.4 17.95 7.103 20.27 10.32 15.44 6.693

Table 3. Quantity of actions, per translation.

Subjects differ widely in the number of actions they carry out (table 4). Subject I02 has

only done a total of 627 actions, while I10 reached 1877, approx. three times as many.

set / subject I02 I03 I04 I05 I06 I07 I08 I09 I10 I11 aver.

A 23.08 59 42.5 62.5 37.58 28.42 26.33 47.83 72.67 55.92 45.58

B 8.91 24.92 15.5 30.83 15.83 14 14.67 13.33 30.17 34.5 20.27

C 13.75 19.25 20.67 14.67 19.92 11.17 15.58 12.08 25.17 27.25 17.95

D 6.5 16.33 13.58 16.33 16.5 11.33 16.25 11.83 28.42 17.33 15.44 average

Total 13 30 23 31 22 16 18 21 39 34

Nr. of actions 627 1434 1107 1492 1078 779 874 1021 1877 1620

Table 4. Quantity of actions, per subject.

Graphic 4. Set averages for subjects’ quantity of actions.

Graphic 4 displays the average quantity of actions that each subject performed for every

set, ordered from left to right in decreasing total quantity of actions. Curiously, four out

of the five subjects who performed more actions were the lenient evaluators, and

medium evaluators performed fewer actions than demanding ones.

2.2.2. Types

Marking is the only type of action which correlates significantly at 0.01 with quality

judgments (so do changes at the margin, but they are very few). Evaluators I03, I04,

I07, I09 and I11 tended to act on the text adding, suppressing, and changing text. On the

opposite pole, subjects I02 and I08 were clearly oriented to offer feedback to the

translator or the researcher, whereas the rest did not seem to have a clear pattern of

behavior.

actions/subjects I02 I03 I04 I05 I06 I07 I08 I09 I10 I11 total

Classification 612 821 747 2180

Mark 568 636 142 3 2 127 39 1517

Addition 1 1

Note 22 22

margin

Change 54 104 158

Addition 2 288 113 150 32 125 99 172 175 1156

Suppression 141 134 85 47 65 90 87 206 855

Change 962 838 620 212 401 809 727 1151 5720

in text

Note 2 28 14 46 46 50 21 17 49 273

Doubtful 11 15 1 27

Total 627 1434 1107 1492 1078 779 874 1021 1877 1620 11909

Table 5. Types of actions, per subject.

Actions co-occur in certain patterns. Adding in text strongly correlates with changing in

text (0.896), suppressing (0.864) and marking (0.721). Other correlations show

emergent profiles of coherent behavior: there seems to be a general tendency in that

evaluators either tend to try to fix the translations for later use (text-oriented), or else

seem to aim at providing explanations of the sense of their action to the translator or the

researcher (feedback-oriented). Graphic 5 shows the distribution of the five more

common actions in the subjects. Subjects I02 and I08 only classify phenomena, whereas

I03, I04 and I09 focus on changing, adding, and suppressing in the body of texts. In any

case, lenient evaluators seem to focus on introducing changes in the translations,

whereas demanding evaluators tend to mark phenomena more often.

Graphic 5. Types of actions, per subject.

When contrasted to their level of demand, demanding evaluators turn out to prefer to

just mark phenomena, medium evaluators perform more classifications, and lenient

evaluators introduce more changes and suppressions.

As for comments, no clear pattern emerged from their use. It is worth noting,

however, that I05–one of the more demanding evaluators–was the subject who made

more of them (37.5% of all), followed by I10 (14.4%). On the other hand, the subjects

who introduced fewer comments were I04 (1.4%), I03 (2.8%), the two more lenient

subjects.

2.3. Summary of subjects’ profiles

Evaluators show consistent tendencies to adopt (1) a higher or lower level of demand,

and (2) to confront different texts with a higher or lower degree of evenness. Their

actions on the translations may be (3) more or less abundant; and (4) text-oriented or

feedback-oriented; and (5) supported with a few or many comments.

demand actions

Level Even Quant Type Comm

I02 2 1 1 1 3

I03 1 2 3 4 1

I04 1 2 2 4 1

I05 3 1 3 3 4

I06 3 1 2 2 2

I07 2 1 1 3 3

I08 2 2 1 1 1

I09 2 1 2 4 1

I10 1 2 3 1 3

I11 1 2 3 3 2

Table 6. Summary of subjects’ characteristics.

Table 6 displays a summary of criteria, where evaluators have been grouped according

to their results. Column I displays the level of demand, from the most lenient (1) to the

most demanding (3). Column II displays the level of evenness, from the most even (1)

to the most uneven evaluator (2). Column III displays the order of subjects according to

the quantity of actions on texts, from the fewest (1), to the most abundant (3). Column

IV ranks subjects from the most feedback-oriented (1) to the most text-oriented (4).

Finally, column V ranks subjects according to the number of comments they introduced,

from the fewest (1) to the most abundant (4). In brief, demanding evaluators tend to be

feedback-oriented, and are uneven in their level of demand. Medium evaluators tend to

perform few actions and are also pretty uneven. I02 and I07 behave similarly. Lenient

evaluators seem more homogeneous: they are text-oriented, perform many actions when

evaluating, and tend to be pretty even in their judgments. I03 and I04 seem particularly

close in the way they evaluate.

Or course, 10 evaluators are far too few to think that data can hold any consistent

truth, but they are interesting from two perspectives: First, they point to possible

tendencies and relationships between variables when analyzing the behavior of

evaluators; hence, this research strategy seems promising and deserves further research.

Second, the variation in evaluators’ behaviors is the background to contrast the

regularities found in all of them. These regularities can be explained as order effects in

serial translation evaluation.

3. Order effects

3.1. Order effects in the whole task

Graphic 4 showed that the number of actions decreases dramatically from set A to the

rest in all subjects. This is the first and most obvious order effect, and might be due to

the lack of experience of the students as evaluators. They would start performing many

actions to progressively realize that it meant too much work or else that it was not

necessary to perform so many actions to carry out the task.

As for the type of actions, graphic 6 shows that whereas the number of changes,

additions and suppressions decreases between set A and set D, classifications are the

only type of action that seems to increase. This supports the notion that decreasing

actions by subjects might be due to their adjusting the effort to the task. In fact,

classifications increase because one of the subjects changed her strategy: she stopped

changing and started classifying in the middle of the task.

Graphic 6. Type of actions in different sets.

Graphic 7 shows that salient phenomena >5 –that is, phenomena singled out by more

than five evaluators– stays around 20% in sets A, C, and D. The original in set B was

the first technical translation and students were not familiar with the subject matter.

This might explain the drop in coincidences. The relative increase in normalized

phenomena within salient phenomena probably indicates that evaluators felt

uncomfortable with the text. In any case, it is worth pointing out that normalized

phenomena only account for ca. 5% of all actions.

Graphic 7. Percentage of salient phenomena (>5) in each set.

3.2. Order effects within sets

Table 7 shows the amount of actions in the three subsets. There is a general tendency to

reduce the quantity of actions per subset, which may be due to an improvement in

efficiency. Again, this supports the notion of the evaluators learning how to carry out

the task as they were doing it. The exception is set D, where subset III has more actions

than subset II, but it also has a lower quality judgment average.

Subsets/Sets A B C D Subset ave.

1 1955 974 782 733 4444

2 1856 818 715 517 3906

3 1659 640 657 603 3559

Table 7. Amount of actions per subsets.

While there is a tendency for most types of action to appear less in subsets II and III

across sets, suppressions increase in sets A, B, and D; additions and changes, in set D;

and classifications in sets C and D. The increase of suppressions throughout three sets

may be taken to indicate that evaluators have a clearer notion of the relevance of the

information. The rise of classifications throughout sets C and D may be thought of as a

consequence of the evaluators getting tired of repeating the same action.

Graphic 8. Percentage of salient (>5) phenomena, in each subset.

Graphic 8 shows a steady increase in the percentage of salient phenomena across the

three subsets, probably an indicator of the degree of certainty in the evaluators. The

drop in normalized phenomena in subset II might be explained as a function of the

degree of confidence of the evaluators in the task.

3.3. Order effects within the texts

Table 8 shows the relationship between number of actions in different translation

sections and quality judgments. All of them are statistically significant but the closer to

the end, the strongest the correlation. The tendency to increasing significance is evident

at sentence level, since actions in the first sentences in the translations do not correlate

with quality judgments. The relationship between quality judgments and actions in

translation text segments which received a special typographic treatment or else stood

out due to their position in the text, such as titles, headings, captions and the like,

showed a lower significance than regular segments. Hence, visual prominence was ruled

out as an explanation for first and last sentence correlations.

Translations Pearson Sig. (bil.)

outstanding - 0.324* 0.025 Segment

regular - 0.525** 0.000

first - 0.057 0.701

last - 0.514** 0.000 Sentence

rest - 0.522** 0.000

initial - 0.411** 0.004

central - 0.548** 0.000 Section

final - 0.597** 0.000

** Correlation significant at 0.01

Table 8. Relationship between quality judgment

and actions.

Hence, evaluators seem to identify phenomena and perform actions on them in all

sections of the translations, but the further down in the text, the stronger the effect on

their judgment of the quality of the translation as a whole. Interestingly, this does not

correspond to the percentage of salient phenomena, which drop in central sections.

Graphic 9. Percentage of salient phenomena (>5) in initial, central, and

final sections of translations.

Quality judgments are independent of the quantity of actions introduced, when

considered by subject. Lenient evaluators do perform more actions than demanding

evaluators, and medium evaluators perform the fewest, as shown in graphic 10.

Graphic 10. Quantity of actions in initial, central, and final sections by

lenient, medium, and demanding evaluators.

Another interesting effect can be traced when the number of actions in translations’

sections (graphic 11) –that is, their initial, middle, and final parts–is correlated to

average quality judgments. Bad and Good translations show a similar pattern of

subjects’ behavior, where initial sections contain an amount of actions which slightly

decreases in central sections to minimally rise again in final sections. Very Bad

translations, however, show a steady increase in the number of actions across sections,

and Very good translations present a constant decrease in the number of actions as the

text progresses. This might point to an emotional involvement of evaluators in the

process.

Graphic 11. QUantity of actions in initial, central, and final sections of

translations, per average quality judgment

2007. Muñoz & Conde. Effects of Serial Translation Evaluation (Preprint, Color)

Documents

Transcript of 2007. Muñoz & Conde. Effects of Serial Translation Evaluation (Preprint, Color)