Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

9
Assessing Writing 18 (2013) 173–181 Contents lists available at SciVerse ScienceDirect Assessing Writing Rater sensitivity to lexical accuracy, sophistication and range when assessing writing Erik Fritz a,, Rachael Ruegg b a Osaka Institute of Technology, 5-16-1, Omiya, Asahi-ku, Osaka 535-8585, Japan b Akita International University, 193-2 Okutsubakidai, Yuwa, Akita 010-1292, Japan article info Article history: Received 4 June 2012 Received in revised form 6 February 2013 Accepted 8 February 2013 Keywords: Lexis Writing Lexical accuracy Lexical range Lexical sophistication abstract Although raters can be trained to evaluate the lexical qualities of student essays, the question remains as to what extent raters follow the “lexis” scale descriptors in the rating scale when evaluating or rate according to their own criteria. The current study examines the extent to which 27 trained university EFL raters take various lexical qualities into account while using an analytic rating scale to assess timed essays. In this experiment, the lexical content of 27 essays was manipulated before rating. This was done in order to determine if raters were sensitive to range, accuracy or sophistication when rating writing for lexis. Using a between-subjects ANOVA design, it was found that raters were sensitive to accuracy, but not range or sophistication, when rating essays for lexis. The implications for rater training and using rating scales are discussed. © 2013 Elsevier Ltd. All rights reserved. 1. Introduction Writing, whether it is an essay for a test or a shopping list, has a purpose and an audience, even if that audience is the same as the person writing the message. Once the message has been written, the audience is then tasked with deciphering what the writer intended to convey. If the message being read has been written by someone else, the deciphering process can prove difficult, especially when considering the writing of second language learners. The quality of a message depends on a number of components, one of which is the lexis the writer uses. Evaluating the quality of the lexis, by giving a numerical score to an essay, can be a complicated process with many factors to consider. Corresponding author. E-mail addresses: [email protected] (E. Fritz), [email protected] (R. Ruegg). 1075-2935/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.asw.2013.02.001

Transcript of Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

Page 1: Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

Assessing Writing 18 (2013) 173–181

Contents lists available at SciVerse ScienceDirect

Assessing Writing

Rater sensitivity to lexical accuracy,sophistication and range when assessing writing

Erik Fritza,∗, Rachael Rueggb

a Osaka Institute of Technology, 5-16-1, Omiya, Asahi-ku, Osaka 535-8585, Japanb Akita International University, 193-2 Okutsubakidai, Yuwa, Akita 010-1292, Japan

a r t i c l e i n f o

Article history:Received 4 June 2012Received in revised form 6 February 2013Accepted 8 February 2013

Keywords:LexisWritingLexical accuracyLexical rangeLexical sophistication

a b s t r a c t

Although raters can be trained to evaluate the lexical qualities ofstudent essays, the question remains as to what extent raters followthe “lexis” scale descriptors in the rating scale when evaluating orrate according to their own criteria. The current study examines theextent to which 27 trained university EFL raters take various lexicalqualities into account while using an analytic rating scale to assesstimed essays. In this experiment, the lexical content of 27 essayswas manipulated before rating. This was done in order to determineif raters were sensitive to range, accuracy or sophistication whenrating writing for lexis. Using a between-subjects ANOVA design,it was found that raters were sensitive to accuracy, but not rangeor sophistication, when rating essays for lexis. The implications forrater training and using rating scales are discussed.

© 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Writing, whether it is an essay for a test or a shopping list, has a purpose and an audience, evenif that audience is the same as the person writing the message. Once the message has been written,the audience is then tasked with deciphering what the writer intended to convey. If the messagebeing read has been written by someone else, the deciphering process can prove difficult, especiallywhen considering the writing of second language learners. The quality of a message depends on anumber of components, one of which is the lexis the writer uses. Evaluating the quality of the lexis,by giving a numerical score to an essay, can be a complicated process with many factors to consider.

∗ Corresponding author.E-mail addresses: [email protected] (E. Fritz), [email protected] (R. Ruegg).

1075-2935/$ – see front matter © 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.asw.2013.02.001

Page 2: Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

174 E. Fritz, R. Ruegg / Assessing Writing 18 (2013) 173–181

Furthermore, the reason a rater gives a particular score or what he or she takes into considerationwhen evaluating the lexis used in an essay is not widely understood. This study, therefore, intends tofind out more about what aspects affect raters’ lexis scores. The study was designed to look at threelexical qualities, two that were mentioned in the rating scale (range and accuracy) and one that wasnot (sophistication), in order to determine what factors affect lexis scores in timed writing samples ofsecond language writers.

2. Review of literature

In terms of lexis, a good quality essay can be defined as containing the following characteristics:a variety of different words, a selection of both low-frequency and topic-appropriate words, a highpercentage of content words, and no or very few lexical errors (Read, 2000). An essay with theselexical qualities would be expected to receive a high lexical score on any writing assessment that usesan analytic scale to measure lexis.

The variety of words in a piece of writing, or lexical range, can affect a rater’s judgment about thequality of a piece of writing (Engber, 1995; Grobe, 1981). In addition, the number of lexical errors ina piece of writing can affect a rater’s opinion of the overall quality of an essay (Engber, 1995; Santos,1988).

Little research, however, has been conducted on how raters make decisions when rating an essayfor lexis. One study by Lumley (2005) used think-aloud protocols to ascertain how four trained ratersscored second language learners’ essays using an analytic scale. One part of the scale, called “TaskFulfilment and Appropriacy (TFA)”, included in the scale descriptors the appropriacy of lexis in termsof choice, range, and accuracy. Although two other sub-categories were included in the TFA descriptors,Lumley (2005) separately coded all mentions of lexis by the raters when talking about the essay andjustifying their scores. In his study, only 12.3% of what raters mentioned when rating this categoryhad to do with lexis. Lumley (2005) writes of the four raters, “. . .lexis does not seem to play a majorrole in text evaluations. . .” (p. 207).

A study by Cumming, Kantor, and Powers (2002) examined think-aloud protocols from 17 ratersof TOEFL essays (without using a formal rating scale) and categorized 35 different decision makingbehaviours made by the raters about the essays. One behaviour, called “consider lexis”, showed a meanof 2.5% out of the overall 34.6% of language-focused behaviours mentioned in the protocols. Althoughthink-aloud protocols do not record all the cognitive processes that occur while rating an essay – justthe comments that are verbalized – they still offer a window into the decision processes of raters. Thelimited mention of lexis in the think-aloud protocols of the aforementioned studies, then, seems toindicate a limited focus on lexis in the rating process.

Even with an analytic scale that entails a separate score being awarded for lexis, the question ofhow raters come to a decision on a lexis score still remains.

In an observational study examining the lexis scores of 140 timed writings of EFL students (Ruegg,Fritz, & Holland, 2011), it was shown that of several lexical qualities, accuracy was the only one sig-nificantly predictive of lexis scores (b = −0.103). These results indicate that the more errors an essayhad the lower the lexis score was. That the other lexical qualities had no statistically significant effecton the lexis score, however, calls for further attention. One of the limitations of the previous studywas that the actual lexis present in the sample essays did not represent a wide range of differentcapabilities between students. The present study used an experimental design in order to ascertainwhether, when essays contain lexes that vary vastly in quality, raters are sensitive to these variations,as demonstrated by the scores they assign for lexis.

A study by Freedman (1979) used a similar experimental design to the present study, although herstudy did not focus on lexis. In her study, essays were rewritten to be stronger or weaker in relationto content, organization, sentence structure and mechanics. Raters were then asked to rate the essaysin terms of the same four categories. It was found that content and organization affected scores morethan sentence structure and mechanics, whereas it was on sentence structure and mechanics that theraters commented the most. To our knowledge, however, no experimentally designed research hasbeen carried out in relation to lexical quality in writing.

Page 3: Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

E. Fritz, R. Ruegg / Assessing Writing 18 (2013) 173–181 175

Lumley (2002) states that because rating is so complex “we can never really be sure which ofthe multitude of influences raters have relied on in making their judgments, or how they arbitratedbetween conflicting components of the scale.” (p. 268). The current study endeavours to find a way toget to the heart of what raters are truly sensitive to when rating lexis in writing.

The research questions for the present study are: (1) Are raters sensitive to the accuracy of wordsused when rating essays for lexis? (2) Are raters sensitive to the sophistication of words used whenrating essays for lexis? (3) Are raters sensitive to the range of words used when rating essays for lexis?

It was expected that we would find a positive relationship between lexis scores and both lexicalrange and lexical sophistication. That is, increasing the variety of lexical range and sophistication inan essay should predict a higher lexis score. On the other hand, we would expect to find a negativerelationship between lexical accuracy and lexis scores. That is, a high number of inaccurate lexes inan essay should predict a lower lexis score.

3. Methodology

3.1. Context

This study investigates the writing section of an in-house, general proficiency test given annuallyat a foreign languages university in Japan. The writing test consists of an argumentative essay writtenin 30 minutes in response to a single prompt. Each essay is twice-rated using a set of four analyticrating scales designed specifically for the test. Most students range from 18 to 22 years of age and aremajoring in English as a foreign language.

3.2. Participants

In total there were 27 raters who rated the essays for this study. The 27 raters were all experi-enced and qualified EFL instructors at the university where the study was carried out. Sixteen of theraters (59.26%) were males and 11 (40.74%) were females. They were from the United States, England,Australia, Canada, Scotland, Jamaica, Bulgaria, Japan, Ireland and New Zealand. All had master’s degreesin TESOL or in the area of linguistics. All raters participated in a rater norming session within aweek of the test administration. The purpose of the rater norming session was to increase interraterreliability.

3.3. Testing instrument

Every year the members of the institute’s testing research group meet to decide the writing promptfor the test. The following prompt was selected for the writing section of the test:

“Eating meat is bad for your health and raising animals just to eat causes them to suffer. Sheep, pigsand chickens spend their lives in terrible conditions until they are killed for the supermarket shelves.You can get everything you need for a healthy diet without meat. Therefore, more people shouldbecome vegetarian.” Give your reaction to the above statement and support your answerwith specific reasons and examples.

The rating scale for the test was the same one that had been used during several prior adminis-trations. The descriptors for the lexis band can be found in Appendix A. A low score of 0 meant thatbasically the writer wrote only a few words. A high score of 4 meant that, according to the rating scale,a writer would have demonstrated the ability to use a wide range of lexis with high accuracy. Ratersare specifically directed to evaluate writing in terms of accuracy and range, but nowhere in the ratingscale is sophistication mentioned.

3.4. Procedures

After the prompt had been selected, a student was employed to write a 30-minute essay based onthe prompt. Although there were 895 actual examinees for the writing test, all 27 of the manipulated

Page 4: Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

176 E. Fritz, R. Ruegg / Assessing Writing 18 (2013) 173–181

essays stem from one student’s essay. This student was not required to take the test. It was expected,however, that this student’s essay would be within the range of the actual test essays because studentsin different departments are comparable in terms of writing ability. This student signed a confiden-tiality agreement in Japanese stating that she would not discuss the details of her employment withany students or teachers at the university.

In order to investigate the lexical content of the essay, first, all function words were removed.Function words were defined as those listed by Nation (2001: 430–431). They were excluded from theinvestigation because they are inherently more grammatical in nature than lexical. Words from theprompt were also removed as students are informed on the test booklet that they will not be givencredit for the use of words in the prompt and raters are also trained not to give credit for those words.Thirty-two words remained after the removal of the function words and words from the prompt;these are the words that were focused on for this study. This original essay with the 32 content wordsunderlined, can be seen in Appendix B. Three different manipulations were carried out to create essaysof varying lexical quality, with respect to lexical range, lexical sophistication and lexical accuracy. Forall three lexical qualities, low, medium and high levels were determined, based on the content of theoriginal essay. An equal number of essays were created at each level for each variable. An essay withevery possible combination of qualities was created ranging from the essay with the lowest lexicalquality (low range, low sophistication, low accuracy) to the highest lexical quality (high range, highsophistication, high accuracy). This resulted in 27 different combinations and there were therefore 27essays in the study.

In the manipulation of lexical range, the 32 content words were taken into consideration toascertain how many were the same or similar in meaning. Out of the 32 words, there were 18unique meanings expressed. The remaining 14 words were the same or similar in meaning tothose 18. For example, different variations of the words ‘think’, ‘know’ and ‘believe’ accountedfor seven of the 32 content words. It would be possible to replace these seven words with justone word without changing the original meaning significantly. It was therefore found that atleast 18 different words were necessary to retain the meaning of the original essay. Thus, essayswith low lexical range had just 18 different content words. Essays with high lexical range haddifferent words for all 32 content words. Essays with medium lexical range would ideally beplaced halfway between the low and the high essays and therefore, they each had 25 differentcontent words. The remaining seven words were repetitions of other words. Synonyms for all32 content words were found using an online thesaurus (www.thesaurus.com) and appropriateones were selected that would not seem out of context or change the meaning of the originalessay.

In the manipulation of lexical sophistication, all 32 content words and their synonyms were ana-lysed using Paul Nation’s RANGE software (Nation, 2005). A selection of words was chosen from the1000 word level, the 3000 word level and from above the 3000 word level (listed as ‘not in the lists’ inthe RANGE output). An essay with low lexical sophistication consisted of 32 content words from the1000 word level. An essay with medium lexical sophistication consisted of 32 content words from the3000 word level. An essay with high lexical sophistication consisted of 32 content words from abovethe 3000 word level.

In the manipulation of lexical accuracy, two types of errors were identified, based on the lexicalerrors outlined by (Ruegg, Fritz, & Holland, 2011): (1) using the word out of context; and (2) usingthe wrong part of speech. Changes to the accuracy of lexis were based on these two types of errors.‘Type-one’ errors relate to word choice and ‘type-two’ errors relate to word formation. There are thosewho would argue with the classification of using the wrong part of speech as a lexical error, rather thana grammatical one. However, Nation (2001) states that there are three concepts involved in knowinga word: word form, word meaning and word use. Of these three concepts, both of the error types inthis study fall within Nation’s definition of word meaning.

The following is an example of using a word out of context or a type-one error:

Original Sentence: First I think there are trouble people who are keeping animals.Manipulated Sentence: First I think there are trouble people who are managing animals.

Page 5: Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

E. Fritz, R. Ruegg / Assessing Writing 18 (2013) 173–181 177

Here is an example of a type-two error:

Original Sentence: First I think there are trouble people who are keeping animals.Manipulated Sentence: First I think there are trouble people who are keeper animals.

In an essay with low accuracy, all 32 content words were inaccurate. In an essay with mediumaccuracy half of the content words (16) were inaccurate and the other half were accurate. In an essaywith high accuracy all 32 content words were accurate. The essay with the lowest lexical quality (lowrange, low sophistication, low accuracy) can be seen in Appendix C. The essay with the highest lexicalquality (high range, high sophistication, high accuracy) can be seen in Appendix D.

After the manipulations had been made and before the administration of the test, a different studentwas employed to handwrite the 27 versions of the essay on official test paper. This student signed thesame confidentiality agreement as the student who wrote the original essay. Immediately after thetest administration the essays created for this study were combined with the actual test essays forrating.

Student identity numbers were fabricated for the essays created in this study, which were similarto real student ID numbers in order to make it appear as though the essays were written by actualfirst and second year students at the university. Raters of the writing section of the test are alwaysassigned a random selection of essay numbers to rate for each administration of the test. The essayscreated for this study were also randomly assigned. Each rater was assigned 39 actual test essays torate and in addition to this, rated three of the essays created for this study. The essays for this studywere placed at the beginning, in the middle and at the end of the list of assigned essays in an attemptto ensure that the essays created for this study would not be rated sequentially.

3.5. Analysis

The actual writing test data consisted of 895 essays. All 922 essays were rated using a four-bandanalytic rating scale, with a separate lexis band (see Appendix A). The actual test essays were doublerated. The data from the 27 essays created for this study was checked for inconsistencies. Althoughit was intended for each essay to be rated 3 times, the final number of ratings for each essay rangedfrom 1 to 4 because of rater error and a few ratings being excluded. Rater error occurred because a fewraters, who were supposed to rate the manipulated essays, instead rated other essays by mistake. Thisalso occurs during real test rating, and when it does, it is sometimes necessary to ask an additionalrater to rate an essay which only has one rating. However, in the case of these manipulated essays,raters had already rated three each and asking them to rate more similar essays may have led to themrealising that the essays were similar and becoming suspicious.

Five of the essay ratings for manipulated essays had to be excluded because the rater had given ascore of 0 for all four rating scales. Although the manipulated essay was weak, it was much strongerthan an essay that would be expected to receive a score of 0 for every rating scale. Therefore, it isbelieved that the raters who assigned a 0 for every rating scale may have perceived that the essaywas very similar to another one they had already rated and therefore penalised a student who theythought had cheated. In total there were 72 ratings used to calculate the 27 essays ‘scores. Raw scoreswere used for the essays in this study.

Separate between-subject ANOVAs were run each with the lexis score as the dependent variableand lexical range, sophistication and accuracy as the independent variables, respectively. For eachof the independent variables, three groups were created: low, medium and high. For lexical range,the high group had the widest range of lexis used, the medium group had some range of types andthe low group had the lowest range of types. For lexical sophistication, the high group contained lowfrequency words (above the 3000 word level), the medium group contained medium frequency words(3000 word level) and the low group contained high frequency words (1000 word level). For lexicalaccuracy, the high group contained no lexical errors, the medium group contained 16 lexical errors,and the low group contained 32 lexical errors.

Page 6: Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

178 E. Fritz, R. Ruegg / Assessing Writing 18 (2013) 173–181

Table 1Lexical accuracy.

Group N Mean SD

Low 21 1.62 1.117Medium 25 2.12 .781High 25 2.44 .961

Table 2Lexical sophistication.

Group N Mean SD

Low 24 1.79 .509Medium 24 2.29 1.160High 23 2.17 1.154

4. Results

The descriptive statistics for each variable can be seen in Tables 1 through 3. The descriptive statis-tics include the number of essays rated, as well as the means and standard deviations of the lexisscores given by raters for each group. As for the inferential statistics, an ANOVA was run for each ofthe three independent variables, with only accuracy showing statistical significance (see Table 4).

For accuracy the trends go from low to high, with higher accuracy on average receiving a higherscore, as can be seen in Table 1. Based on histograms, it was determined that the data was normallydistributed. Levene’s Test indicated that the assumption of homogeneity of variance was tenable. Foraccuracy, the results were statistically significant, F(2, 68) = 4.26, p < .05. Eta-square was .111, indicatingthat 11.1% of the variance in scores can be attributed to lexical accuracy.

For range, the data in Table 2 show no apparent pattern. Again, Levene’s Test showed that theassumption of homogeneity of variance was not violated. The results between subject factors werenot statistically significant for range, F(2, 68) = 0.069, p > .05.

As can be seen from the data in Table 3, for frequency there were no apparent patterns. Although theassumption of normality was violated, ANOVA is known to be robust to violations of homogeneity ofvariance when sample sizes are similar (Kirk, 1995). The results for frequency were also not statisticallysignificant, F(2, 68) = 1.68, p > .05.

To return to our expectations from the outset of the study, it was confirmed that as accuracyincreased, so did the lexical score. Our expectations for sophistication and range were not confirmedin this study. The scores where range was manipulated are very similar. As for sophistication, therewas a slight difference in scores but this difference was not significant, especially between the highand low groups.

Table 3Lexical range.

Group N Mean SD

Low 25 2.08 .812Medium 20 2.15 1.040High 26 2.04 1.148

Table 4Between-subjects factors ANOVA.

Independent variable df F p �2

Accuracy 2 4.262* .018 .111Sophistication 2 1.680 .194 .047Range 2 .069 .933 .002

* p < .05.

Page 7: Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

E. Fritz, R. Ruegg / Assessing Writing 18 (2013) 173–181 179

5. Discussion

That raters are attuned to the accuracy of words used in an essay is not surprising. Correctingan essay for errors is perhaps the clichéd image when one thinks of an English teacher. It may bethat raters, pressed for time and tasked with evaluating many variables at once, naturally evaluatethe most obvious and easiest to spot of the lexical qualities, lexical errors. One issue to consider iswhether a rating scale, either for the classroom or for a large-scale writing test, should include separatecategories for grammar and lexis. Having separate rating scales for lexis and grammar, where bothentail the evaluation of accuracy, may be seen to complicate the rating of lexical accuracy. Therefore, itis advisable to clearly distinguish lexical from grammatical accuracy during the rater training processif there are two separate rating scales for lexis and grammar.

Apart from lexical accuracy, lexical range is also explicitly mentioned in the rating scale. The resultsshow, however, that the range of words used did not significantly affect the lexis scores. This meansthat, although raters were explicitly trained to evaluate essays on the range of lexis used, they didnot follow the rating scale in this respect. Students, therefore, were given scores on their essays thatdid not inform them of characteristics which were supposed to be assessed by this writing task. Thisfinding suggests possible implications for rater training, where spending more time attuning raters toparticular aspects of the rating scale may increase the construct validity of the test.

On the other hand, lexical sophistication is not mentioned at all in the rating scale, nor was itdiscussed in the training sessions. The results show that the frequency of words used did not play adeciding role in how raters evaluated the lexis score of an essay. This can be seen as a positive resultsince this variable was not mentioned in the rating scale and low frequency words are not intendedto result in higher lexis scores.

Some may believe that raters may not be able to evaluate all characteristics mentioned in a rat-ing scale, no matter how much training they receive. Contrary to this idea, several previous studieshave found rater training to be effective (Alderson, Clapham, & Wall, 2002; Brown, 2003; Lumley& McNamara, 1995; Weigle, 1994, 1998). Weigle (1994) found that norming raters to the scale andterms in the descriptors modified their expectations of student writing and helped to clarify the scoringcriteria.

Explaining how to determine a good or poor range of lexis in an essay during rater training, forexample, may help raters focus their attention to the scale more. Terms that appear in rating scales,such as ‘word formation’, may be interpreted in different ways and should be explained in detail duringthe norming process to ensure that raters go into the rating process with the same concepts in mind. Inthis study, an effort was made during the rater training session to explain and differentiate the termsused in the scales.

6. Limitations and suggestions for further research

Concerning the administration of the rating of essays used in this research, the raters were giventhe same instructions as they would usually be given to rate the essays; they were told to rate allessays as usual and let the rating room supervisor know about any concerns they might have. Someof the raters may have noticed something peculiar when rating the essays created for this study. Onepossible reason was that each rater needed to rate three of the manipulated essays in addition to thenormal essays in order to ensure adequate numbers for statistical purposes. However, as raters wereonly required to rate 39 actual test essays, a total of three similar essays was possibly too many andmay have aroused some suspicion. In future research it would be advisable to carefully consider thetotal number of actual test essays to be rated before deciding how many research essays each ratershould rate.

Another possible reason for some of the raters’ concerns was that the administration of the test wasmanaged differently than in previous administrations. Because of this there were a few examinees thatcheated on the writing section of the test by copying the writing of the person sitting next to them.This may have made raters more alert to possible cheating.

In addition to this two more points are important to mention. Firstly, it is possible that the lexicalmanipulations made to the essays were too advanced for the test population even though care was

Page 8: Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

180 E. Fritz, R. Ruegg / Assessing Writing 18 (2013) 173–181

taken in order to make the writing appropriate. Additionally, the original essay was written by a third-year university student studying English at the same university. Her writing was well within the rangeof writing ability typical of students studying at the university. Nevertheless, the level of the essay wasat the lower end of the range. Because of the limited number of raters involved in the writing test atthis university, it was not possible to include essays at a range of different levels. In future studies ofthis type, however, it would be advisable to manipulate a range of essays in order to ascertain whethersimilar effects arise in different levels of essay quality.

The raters’ concerns are not considered to have affected the ratings in this study as the few ratingsthat seemed unlikely, given the content of the essays, were deleted from the sample. However, it wouldbe preferable not to arouse any suspicion in raters when conducting this kind of research. Therefore,an effort should be made to limit the number of manipulated essays rated by each rater, while ensuringthat actual examinees do not cheat. It is also advisable not to manipulate certain qualities in essaystoo extremely.

7. Conclusions

In the rating scale for the current study it is stated that raters should consider lexical range (referredto as ‘variety’ in the rating scales) and lexical accuracy (referred to as ‘control’ in the rating scales) whenrating writing for lexis. In addition to range and accuracy, and although it was not mentioned in thescale, it was considered by the researchers that raters might also be sensitive to the frequency of wordsused (lexical sophistication). It is clear, however, that raters in this study were only sensitive to lexicalaccuracy.

One possible reason for this might be that readers are looking at so many factors when rating essayswith analytic rating scales, they have very little time to cognitively process everything on the scale.It may be that the most salient feature, or the easiest to grade, is lexical accuracy because errors areusually easy to spot. Other lexical qualities, such as range, may take more time to evaluate.

After analysing the findings of this study, it seems that raters were not sensitive to range, eventhough it was mentioned in the rating scale. Although during rater training sessions time is at apremium, clearly it is necessary to focus more on the evaluation of lexical range during these sessions.This may help to make scores more reliable and the test more valid.

More research manipulating lexical qualities, or other aspects of writing quality, such as grammar,would help to further understand what qualities of writing raters are sensitive to when rating writing. Areplication of this study is recommended to verify the conclusions made by the researchers. In addition,a think aloud protocol could be carried out to gather information about the thought processes of theraters while rating.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, in the online version, athttp://dx.doi.org/10.1016/j.asw.2013.02.001.

References

Alderson, J., Clapham, C., & Wall, D. (2002). Language test construction and evaluation. Cambridge: Cambridge University Press.Brown, A. (2003). Interviewer variation and co-construction of speaking proficiency. Language Testing, 20, 1–25.Cumming, A., Kantor, R., & Powers, D. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The

Modern Language Journal, 86 (1), 67–96.Engber, C. (1995). The relationship of lexical proficiency to the quality of ESL compositions. Journal of Second Language Writing,

4 (2), 139–155.Freedman, S. (1979). How characteristics of student essays influence teachers’ evaluations. Journal of Educational Psychology, 71

(3), 328–338.Grobe, C. (1981). Syntactic maturity, mechanics, and vocabulary as predictors of quality ratings. Research in the Teaching of

English, 15 (1), 75–85.Kirk, R. (1995). Experimental design: Procedures for the behavior sciences. Boston: Books/Cole Publishing Company.Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to raters? Language Testing, 19

(3), 246–276.Lumley, T. (2005). Assessing second language writing: The rater’s perspective. Frankfurt am Main: Peter Lang.

Page 9: Rater sensitivity to lexical accuracy, sophistication and range when assessing writing

E. Fritz, R. Ruegg / Assessing Writing 18 (2013) 173–181 181

Lumley, T., & McNamara, T. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12, 54–71.Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge: Cambridge University Press.Nation, I. S. P. (2005). Range and frequency: Programs for Windows based PCs [Computer software and manual]. Retrieved from

http://www.victoria.ac.nz/lals/staff/paul-nation/nation.aspxRead, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press.Ruegg, R., Fritz, E., & Holland, J. (2011). Rater sensitivity to qualities of lexis in writing. TESOL Quarterly, 45 (1), 63–80.Santos, T. (1988). Professors’ reactions to the academic writing of nonnative-speaking students. TESOL Quarterly, 22 (1), 69–90.Weigle, S. (1994). Effects of training on raters of ESL compositions. Language Testing, 11 (1), 197–223.Weigle, S. (1998). Using FACETS to model rater training effects. Language Testing, 15, 263–287.

Erik Fritz is a senior university lecturer in Japan. His research interests include vocabulary, writing, assessment and corpuslinguistics.

Rachael Ruegg is currently a PhD candidate at MacQuarie University and a lecturer at Akita International Universityin Akita, Japan. Her research interests include vocabulary, writing, feedback and assessment. She can be contacted at:[email protected].