Little coherence, considerable strain for reader, A comparison between two rating scales for the...

7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

1/21

Assessing Writing 12 (2007) 108128

Available online at www.sciencedirect.com

Little coherence, considerable strain for reader: Acomparison between two rating scales for the

assessment of coherence

Ute Knoch

Language Testing Research Centre, University of Melbourne, Level 3,245 Cardigan Street, Carlton, Victoria 3052, Australia

Available online 17 October 2007

Abstract

The category of coherence in rating scales has often been criticized for being vague. Typical descriptors

might describe students writing as having a clear progression of ideas or lacking logical sequencing.

These descriptors inevitably require subjective interpretation on the side of the raters.

A number of researchers (Connor & Farmer, 1990; Intaraprawat & Steffensen, 1995) have attempted to

measure coherence more objectively. However, these efforts have not thus far been reflected in rating scaledescriptors. For the purpose of this study, the results of an adaptation of topical structure analysis (Connor and

Farmer, 1990; Schneider and Connor, 1990), which proved successful in distinguishing different degrees of

coherence in 602 academic writing scripts was used to formulate a new rating scale. The study investigates

whether such an empirically grounded scale can be used to assess coherence in students writing more

reliably and with greater discrimination than the more traditional measure. The validation process involves

a multi-faceted Rasch analysis of scores derived from multiple ratings of 100 scripts using the old and new

rating descriptors as well as a qualitative analysis of questionnaires canvassed from the raters. The findings

are discussed in terms of their implications for rating scale development.

2007 Elsevier Inc. All rights reserved.

Keywords: Writing assessment; Rating scales; Coherence; Rating scale validation; Multi-faceted Rasch analysis

1. Introduction

Because writing assessment requires subjective evaluations of writing quality by raters, the raw

score candidates receive might not reflect their actual writing ability. In an attempt to reduce the

Tel.: +61 3 83445206; fax: +61 3 83445163.

E-mail address: [email protected].

1075-2935/$ see front matter 2007 Elsevier Inc. All rights reserved.

doi:10.1016/j.asw.2007.07.002
mailto:[email protected]://localhost/var/www/apps/conversion/current/tmp/scratch15862/dx.doi.org/10.1016/j.asw.2007.07.002http://localhost/var/www/apps/conversion/current/tmp/scratch15862/dx.doi.org/10.1016/j.asw.2007.07.002mailto:[email protected]


2/21

U. Knoch / Assessing Writing 12 (2007) 108128 109

variability between raters and therefore to increase the reliability of ratings, attempts have been

made to improve certain features of the rating process, most commonly through rater training

(Elder, Knoch, Barkhuizen, & von Randow, 2005; McIntyre, 1993; Weigle, 1994a, 1994b, 1998).

However, despite all the efforts put into training raters, it has been shown that differences in rater

reliability persist and can account for as much as 35% of variance in students written performance(Cason & Cason, 1984). Some researchers have suggested that a better specification of scoring

criteria might lead to an increase in rater reliability (Hamp-Lyons, 1991; North, 1995, 2003;

North & Schneider, 1998). One reason for the variability found in writing performance might lie

in the way rating scales are designed. Fulcher (2003) has shown that most existing rating scales

are developed based on intuitive methods which means that they are either adapted from already

existing scales or they are based on what developers think might be common features in the writing

samples in question. However, for rating scales to be more valid, it has been contended that rating

scales should be based on empirical investigation of actual writing samples (North & Schneider,

1998; Turner & Upshur, 2002; Upshur & Turner, 1995, 1999).

2. The assessment of coherence in writing

Lee (2002) defines coherence as the relationships that link the ideas in a text to create mean-

ing. Although a number of attempts have been undertaken in second language writing research

to operationalize coherence (Cheng & Steffensen, 1996; Connor & Farmer, 1990; Crismore,

Markkanen, & Steffensen, 1993; Intaraprawat & Steffensen, 1995), this has not been reflected

in rating scales commonly used in the assessment of writing. Watson Todd, Thienpermpool and

Keyuravong (2004), for example, criticized the level descriptors for coherence in a number of

rating scales as being vague and lacking enough detail for raters to base their decisions on. They

quote a number of rating scale descriptors used for measuring coherence. The commonly used

and much cited Jacobs scale (Jacobs, Zinkgraf, Wormuth, Hartfiel, & Hughey, 1981), for exam-

ple, describes high quality writing as well organized and exhibiting logical sequencing. In

other scales less successful writing has been described for example as being fragmentary so that

comprehension of the intended communication is virtually impossible (TEEP Attribute Writing

Scales, cited in Watson Todd et al., 2004). Watson Todd et al. therefore argue that while analytic

criteria are intended to increase the reliability of rating, the descriptors quoted above inevitably

require subjective interpretations of the raters and might lead to confusion. Although one reason

for these vague descriptions of coherence might lie in the rather vague nature of coherence, Hoey

(1991) was able to show that judges are able to reach consensus on the level of coherence.A notable exception to the scales described above is a scale for coherence developed by

Bamberg (1984). Although Bamberg was able to develop more explicit descriptors for a number

of different aspects of writing related to coherence (e.g., organization and topic development),

her holistic scale descriptors mix a variety of aspects at the descriptor level. The descriptor for

level 2, for example, describes the writing as incoherent, refers to topic identification, setting of

context, the use of cohesive devices, the absence of an appropriate conclusion, flow of discourse

and errors. Although the scale has five levels, when Bambergs raters used the scale, they seemed

to only be able to identify three levels. It is possible that because this holistic scale mixes so many

aspects at the descriptor level, raters were overusing the inner three band levels of the scale and

avoiding the extreme levels.It seems that no existing rating scale for coherence has been able to operationalize this aspect

of writing in a manner that can be successfully used by raters. The aim of this study was therefore

to attempt to develop a rating scale for coherence which is empirically-based.


3/21

110 U. Knoch / Assessing Writing 12 (2007) 108128

2.1. Topical structure analysis (TSA)

In the second language writing literature, several attempts have been made to measure coher-

ence. To be transferable into a rating scale, the method chosen for this study needs to be sufficiently

simple to be used by raters who are rating a number of scripts in a limited amount of time. Severalmethods were investigated as part of the literature review for this study. Crismore et al.s (1993)

metadiscoursal markers were excluded because insufficient tokens were found in students essays

and measures like topic-based analysis (Watson Todd, 1998; Watson Todd et al., 2004) needed to

be excluded for being too complicated and time consuming. For this study, topical structure anal-

ysis (TSA) was chosen and adapted because it was the only attempt at operationalizing coherence

which was sufficiently simple to be transferred into a rating scale.

TSA, based on topic and comment analysis, was first described by Lautamatti (1987) from

the Prague School of Linguistics in the context of text readability to analyze topic development

in reading material. She defined the topic of a sentence as what the sentence is about and the

comment of a sentence as what is said about the theme. Lautamatti described three types ofprogression, which create coherence in a text. These types of progression advance the discourse

topic by developing a sequence of sentence topics. Through this sequence of sentence topic, local

coherence is created. The three types of progression can be summarized as follows ( Hoenisch,

1996):

1. Parallel progression, in which topics of successive sentences are the same, producing a repe-

tition of topic that reinforces the idea for the reader (, , ).

Example: Paul walked on the street. He was carrying a backpack.

2. Sequential progression, in which topics of successive sentences are always different, as the

comment of one sentence becomes, or is used to derive, the topic of the next (, ,

).

Example: Paul walked on the street. The streetwas crowded.

3. Extended parallel progression, in which the first and the last topics of a piece of text are the

same but are interrupted with some sequential progression (, , ).

Example: Paul walked on the street. Many people were out celebrating the public holiday.

He had trouble finding his friends.

Witte (1983a, 1983b) introduced TSA into writing research. He compared two groups of per-

suasive writing scripts, one rated high and one rated low, on the use of the three types of progressiondescribed above. He found that the higher level writers used less sequential progression, and more

extended and parallel progression. There were, however, several shortcomings of Wittes study.

Firstly, the raters were not professional raters, but rather were solicited from a variety of profes-

sions. Secondly, Witte did not use a standardized scoring scheme. He also conducted the study

in a controlled revision situation in which the students revised a text written by another person.

Furthermore, Witte did not report any intercoder reliability analysis.

In 1990, Schneider and Connor set out to compare the use of topical structures by 45 writers

taking the Test of Written English (TWE). They grouped the 45 argumentative essays into three

different levels: high, medium, low. As with Wittes study, Schneider and Connor did not report

any intercoder reliability statistics. The findings were contradictory to Wittes findings: the higherlevel writers used more sequential progression while the low and middle group used more parallel

progression. There was no difference between the levels in the use of extended parallel progres-

sion. Schneider and Connor drew up clear guidelines on how to code TSA and also suggested


4/21


a reinterpretation of sequential progression as part of their discussion section. They suggested

dividing sequential progression into the following subcategories:

1. Direct sequential progression, in which the comment of the previous sentence becomes the

topic of the following sentence. The topic and comment are either word derivations (e.g.,science, scientist) or they form a part-whole relation (these groups, housewives, children) (, , ).

2. Indirect sequential progression, in which the comment of the previous sentence becomes the

topic of the following sentence but topic and comment are only related by semantic sets (e.g.,

scientists, their inventions and discoveries, the invention of the radio, telephone and television).

3. Unrelated sequential progression, in which topics are not clearly related to either the previous

sentence topic or discourse topic (, , ).

Wu (1997), in his doctoral dissertation, applied Schneider and Connors revised categories to

analyze two groups of scripts rated using the scale developed by Jacobs et al. (1981). He found inhis analysis no statistically significant difference in terms of the use of parallel progression between

high and low level writers. Higher level writers used slightly more extended parallel progression

and more direct sequential progression.A more recent study using TSA to compare groups of

writing based on holistic ratings, was undertaken by Burneikaite and Zabiliute (2003). Using the

original criteria of topical structure developed by Lautamatti and Witte, they investigated the use

of topical structure in argumentative essays by three groups of students rated as high, middle

and low based on a rating scale adapted from Tribble (1996). They found that the lower level

writers over-used parallel progression whilst the higher level writers used a balance between

parallel and extended parallel progression. The differences in terms of sequential progression

were small, although they could show that lower level writers use this type of progression slightly

less regularly. Burneikaite and Zabiliute failed to report any interrater reliability statistics.

All studies conducted since Wittes study in 1983 show generally very similar findings, however

there are slight differences. Two out of three studies found that lower level writers used more

parallel progression than higher level writers; however, Wu (1997) found no significant difference.

All three studies found that higher level writers used more extended parallel progression. In terms

of sequential progression the differences in findings can be explained by the different ways this

category was used. Schneider and Connor (1990), and Burneikaite and Zabiliute (2003) used the

definition of sequential progression with no subcategories. Both studies found that higher level

writers used more sequential progression. Wu (1997) found no differences between differentlevels of writing using this same category. However, he was able to show that higher level writers

used more related sequential progression. It is also not entirely clear how much task type or topic

familiarity influences the use of topical structure and if findings can be transferred from one

writing situation to another.

3. The study

The aim of this study was to investigate whether TSA can successfully be operationalized into a

rating scale to assess coherence in writing. The study was undertaken in three phases. Firstly, 602

writing samples were analyzed to establish the topical structure used by writers at five levels ofwriting ability. The findings were then transferred into a rating scale. To validate this scale, eight

raters were trained and then rated 100 writing samples. The findings were compared to previous

ratings of the same 100 scripts by the same raters using an existing rating scale for coherence.


5/21


Fig. 1. Research design.

After the rating rounds, raters were given a questionnaire to fill in to canvass their opinions about

the rating scale and a subset of five raters was interviewed.

Fig. 1 illustrates the design of the study visually.

The research questions were as follows:

RQ1: What are the features of topical structure displayed at different levels of expository writing?RQ2: How reliable and valid is TSA when used to assess coherence in expository writing?

RQ3: What are raters perceptions of using TSA in a rating scale as compared to more conven-

tional rating scales?

4. Method

4.1. Context of the research

This study was conducted in the context of the Diagnostic English Language Needs Assessment

(DELNA) which is administered at the University of Auckland, New Zealand. DELNA is auniversity-funded procedure designed to identify the English language needs of undergraduate

students following their admission to the University, so that the most appropriate language support

can be offered. DELNA is administered to both native and non-native speakers of English. This

context was selected by the researcher purely because of its availability and because the rating

scale used to assess the writing task (see description below) is representative of many other rating

scales used in writing performance assessment across the world. A more detailed description of

the assessment and the rating scale can be found in the section below.

4.1.1. The assessment instrumentDELNA includes a screening component which consists of a speed-reading and a vocabulary

task. This is used to eliminate highly proficient users of English and exempts them from the time-

consuming and resource-intensive diagnostic procedure. The diagnostic component comprises

objectively scored reading and listening tasks and a subjectively scored writing task.

The writing section is an expository writing task in which students are given a table or graph

of information which they are asked to describe and interpret. Candidates have 30 minutes to

complete the task. The writing task is routinely double (or if necessary triple) marked analyti-

cally on nine traits (organization, coherence, style, data description, interpretation, development

of ideas, sentence structure, grammatical accuracy, vocabulary and spelling) on a six-point scale

ranging from four to nine. The assessment criteria were developed in-house, initially based on anexisting scale. A number of validity studies have been conducted on the DELNA battery, which

included validation of the rating scale (Elder & Erlam, 2001; Elder & von Randow, 2002). The

wording of the scale has been changed a number of times based on the feedback of raters after


6/21


training sessions or during focus groups. The DELNA rating scale reflects common practice in

performance assessment in that the descriptors are graded using adverbs like adequate, appropri-

ate, sufficient, severe or slight. The coherence scale uses descriptors like skilful coherence,

message able to be followed effortlessly or little coherence, considerable strain for reader.

Strain is graded between different level descriptors from slight, some, considerable tosevere.

4.1.2. The writing samples

To identify the specific features of topical structure used by writers taking DELNA, 602 writing

samples, which were produced as part of the 2004 administration of the assessment, were randomly

selected. The samples were originally hand-written by the candidates. The mean number of words

for the scripts was 269, ranging from 75 to 613.

4.1.3. The candidatesThree hundred twenty-nine of the writing samples were produced by females and 247 by

males (roughly reflecting the gender distribution of DELNA), whilst 26 writers did not spec-

ify their gender. The L1 of the students (as reported in a self-report questionnaire) varied.

Forty-two percent (or 248 students, N= 591) have an Asian first language, 36% (217) are

native speakers of English, 9% (52) are speakers of a European language other than English,

5% (31) have either a Pacific Island language or Maori as first language, and 4% (21) speak

either an Indian or a language from Sri Lanka as first language. The remaining 4% (22)

were grouped as others. Eleven students did not fill in the self-report questionnaire. The

scripts used in this analysis were all rated by two DELNA raters. In case of discrepancies

between the scores, the scores were averaged and rounded (in the case of a .5 result afteraveraging, the score was rounded down). The 602 scripts were awarded the following average

marks (Table 1).

4.1.4. The raters

The eight DELNA raters taking part in this study were drawn from a larger pool of raters based

on their availability at the time of the study. All raters have high levels of English proficiency

although not all are native speakers of English. Most have experience in other rating contexts, for

example, as accredited raters of the International English Language Testing System (IELTS). All

have postgraduate degrees in either English, Applied Linguistics or Teaching English to Speakersof other Languages (TESOL). All raters have several years of experience as DELNA raters and

take part in regular training moderation sessions either in face-to-face or online sessions (Elder,

Barkhuizen, Knoch, & von Randow, 2007; Knoch, Read, & von Randow, 2007).

Table 1

Score distribution of 602 writing samples

DELNA score Frequency Percent (%)

4 23 4

5 115 196 253 46

7 172 29

8 26 4


7/21


4.2. Proceduresanalysis of writing samples

4.2.1. Pilot study

While coding the 602 writing scripts, the categories of parallel, direct sequential and unrelated

sequential progression were used as defined by Schneider and Connor (1990) and Wu (1997).However, other categories had to be changed or added to better account for the data. Firstly,

extended parallel progression was changed to extended progression to account for cases in which

the topic of a sentence is identical to a comment occurring more than two sentences earlier.

Similarly, indirect sequential progression was modified to indirect progression to also include

cases in which the indirect link is back to the previous topic. Then, a category was created that

accounts for features very specific to writers whose L1 is not English. At very low levels, these

writers often attempt to create a coherent link back to the previous sentence but fail because, for

example, they use an incorrect linking device or a false pronominal. This category was called

coherence break. Another category was established to account for coherence that is created not

by topic progression but by features such as linking devices (e.g., however, also, but). This categoryalso includes cases in which the writer clearly signals the ordering of an essay or paragraph early

on, so that the writer can follow any piece of discourse without needing topic progression as

guidance. Table 2 below presents all categories of topical structure used in the main analysis with

definitions and examples.1

4.2.2. Main analysis

To analyze the data, the writing scripts were first typed and then divided into t-units following

Schneider and Connor (1990) and Wu (1997). The next step was to identify sentence topics. For

this, Wus (1997) criteria were used (see Appendix A). Then each t-unit was coded into one of

the seven categories as described in Table 2. The percentage of each category was recorded into

a spreadsheet. The mean DELNA score produced by the two DELNA raters was also added for

each candidate. To identify which categories were used by students at different proficiency levels,

the final score was correlated with the percentage of occurrence of each category. The results of

this analysis can be found in the results section under research question 1 below. Finally, to ensure

intercoder reliability, t-unit coding, topic identification and TSA were all undertaken by a second

researcher (on a subset of 50 scripts) and intercoder reliability was calculated.

4.3. Proceduresrating scale validation

4.3.1. Procedure

The raters rated 100 scripts using the current DELNA criteria and then the same 100 using the

new scale based on TSA. The scripts were selected to represent a range of proficiency levels. The

raters were given the scripts in five sets of 20 scripts over a time period of about eight weeks.

They all participated in a rater moderation session to ensure they were thoroughly trained. All

raters were further instructed to rate no more than ten scripts in one session to avoid fatigue.

After rating the two sets of 100 scripts, the raters filled in a questionnaire canvassing their

opinions about the scales. The questionnaire (part of a larger-scale study) allowed the raters to

record any opinions or suggestions they had with respect to the coherence scale. The questionnaire

questions were as follows:

1 All examples were taken from the data used in this study.


8/21


Table 2

Categories of topical structure analysis used in main analysis with examples

Parallel progression

Topics of successive sentences are the same (or synonyms)

Maori and PI males are just as active as the rest of NZ. They also have other interests

Direct sequential progression

The comment of the previous sentence becomes the topic of the following sentence

The graph showing the average minutes per week spent on hobbies and games by age group and sex, shows many

differences in the time spent by females and males in NZ on hobbies and games

These differences include on age factor

Indirect progression

The topic or the comment of the previous sentence becomes the topic of the following sentence. The topic/or

comment are only indirectly related (by inference, e.g., related semantic sets)

or The main reasons for the increase in the number of immigrates is the development of some third-world countries.

e.g., China. People in those countries have got that amount of money to support themselves living in a foreign

country

Superstructure

Coherence is created by a linking device instead of topic progression

Reasons may be the advance in transportation and the promotion of New Zealands natural environment and

green image. For example, the filming of The Lord of the rings brought more tourists to explore the

beautiful nature of NZ

Extended progression

A topic or a comment before the previous sentence become the topic of the new sentence ... or ...

The first line graph shows New Zealanders arriving in and departing from New Zealand between 2000 and 2002.

The horizontal axis shows the times and the vertical axis shows the number of passengers which are New

Zealanders. The number of New Zealanders leaving and arriving have increased slowly from 2000 to 2002.

Coherence break

Attempt at coherence fails because of an error

The reasons for the change on the graph. Its all depends on their personal attitude

Unrelated progression

Topic of a sentence is not related to the topic or comment in the previous sentence

The increase in tourist arrivers has a direct affect to New Zealand economy in recent years. The government reveals

that unemployment rate is down to 4% which is a great news to all New Zealanders

(1) What did you like about the scales?

(2) Were there any descriptors that you found difficult to apply? If yes, please say why.

(3) Please write specific comments that you have about the scales below. You could for example

write how you used them, any problems that you encountered that you havent mentioned

above or you can mention anything else you consider important.

A subset of five raters were also interviewed after the study was concluded.


9/21


Table 3

TSA correlations with final DELNA writing score

Final writing score

Parallel progression .215a

Direct sequential progression .292a

Superstructure .258a

Indirect progression .220a

Extended progression .07

Unrelated progression .202a

Coherence break .246a

n = 602.a p < .01.

4.3.2. Data analysisThe results of the two rating rounds were analyzed using multi-faceted Rasch measurement

in the form of the computer program FACETS (Linacre, 2006). FACETS is a generalization

of Wright and Masters (1982) partial credit model that makes possible the analysis of data

from assessments that have more than the traditional two facets associated with multiple-choice

tests (i.e., items and examinees). In the many-facet Rasch model, each facet of the assessment

situation (e.g., candidates, raters, trait) is represented by one parameter. The model states that

the likelihood of a particular rating on a given rating scale from a particular rater for a particular

student can be predicted mathematically from the proficiency of the student and the severity of

the rater. The advantages of using multi-faceted Rasch measurement is that it models all facets

in the analysis onto a common logit scale, which is an interval scale. Because of this, it becomespossible to establish not only the relative difficulty of items, ability of candidates and severity of

raters as well as the scale step difficulty, but also how large these differences are. Multi-faceted

Rasch measurement is particularly useful in rating scale validation as it provides a number of

useful measures such as rating scale discrimination, rater agreement and severity statistics and

information with respect to the functioning of the different band levels in a scale.

To make the multi-faceted Rasch analysis used in this study more powerful, a fully crossed

design was chosen; that is, all eight raters rated the same 100 writing scripts on both occasions.

Although such a fully crossed design is not necessary for FACETS to run the analysis, it makes

the analysis more stable and therefore better conclusions can be drawn from the results (Myford

& Wolfe, 2003).

5. Results

5.1. RQ1: What are the features of topical structure displayed at different levels of writing?

The results of the intercoder reliability analysis show a high level of agreement for the two

researchers coding the data. The proportion of exact agreement for the t-unit identification is .959,

for the identification of the t-unit topics is .931 and for the TSA categories (as shown in Table 2)

is .865.A Pearson correlation of the proportion that each TSA category was used in each essay with

the overall writing score was performed in order to establish which categories are used at different

levels of writing. The results of the correlation are reported in Table 3.


10/21


Table 4

TSA-based rating scale for coherence

Level Coherence

4 Frequent: unrelated progression, coherence breaks

Infrequent: sequential progression, superstructure, indirect progression5 As level 4, but coherence might be achieved in stretches of discourse by

overusing parallel progression. Only some coherence breaks

6 Mixture of most categories

Superstructure relatively rare

Few coherence breaks

7 Frequent: sequential progression

Superstructure occurring more frequently

Infrequent: Parallel progression

Possibly no coherence breaks

8 Writer makes regular use of superstructures, sequential progression

Few incidences of unrelated progression

No coherence breaks

The table shows that the variables used for the analysis of TSA in the essays can be divided into

three groups. The first group consists of variables that were used more by students whose essays

received a higher overall writing score. The three variables in this group are direct sequential

progression, indirect progression, and superstructure. The second group is made up of variables

that were used more by weaker writers. Variables in this group are coherence breaks, unrelated

sequential progression, and parallel progression. The third group consists of variables that were

used equally by the strong and the weak writers. The only variable that falls into this category is

extended progression.

The table showing the correlational results does not indicate the distribution over the different

DELNA writing levels. Therefore, box plots were created for each variable, to indicate how the

proportion of usage changes over the different DELNA band levels. The box plots can be found

in Appendix B (Figs. 28). The box plots show that although there is a lot of overlap between

the different levels of writing within each variable, there are clear trends in the distribution of the

variables over the proficiency levels. The only exception seems to be parallel progression, where

writers at level 4 seemed to use fewer instances of parallel progression than writers at level 5.

The quantitative results shown in the box plots were then used to develop the TSA-based rating

scale. The trends for the different types of TSA categories observed in the box plots were usedas the basis for the level descriptors. Because raters were presumably unable to identify small

trends in the writing samples, general trends only were used for the different level descriptors.

For example, raters were not asked to count each incident of each category of topical structure,

rather they were guided as to what features they could expect least or most commonly at different

levels. Because the strongest students were filtered out during the DELNA screening procedure,

and therefore no scripts at band level 9 were analyzed, the TSA-based rating scale only has five

levels. However, the possibility of a sixth level exists. The scale is reproduced in Table 4.

5.2. RQ 2: How reliable and valid is TSA when used to assess coherence in writing?

FACETS provides a group of statistics which investigates the spread of raters in terms of

harshness and leniency (see Table 5). The rater fixed chi square tests the assumption that all

the raters share the same severity measure, after accounting for measurement error. A significant


11/21


Table 5

Rater separation statistics

DELNA scale TSA-based scale

Rater fixed chi square value 66.9 d.f.: 7 significance

(probability): .00

216.8 d.f.: 7 significance

(probability): .00Rater separation ratio 2.94 5.62

Table 6

Rater infit mean square values

Rater Rater infit mean

square DELNA scale

Rater point biserial

DELNA scale

Rater infit mean square

TSA-based scale

Rater point biserial

TSA-based scale

2 1.11 .67 1.07 .65

4 1.30 .54 1.08 .70

5 1.53 .68 1.11 .79

7 .69 .68 .98 .56

9 .91 .68 .95 .65

12 1.08 .50 .97 .71

13 .67 .61 .73 .70

14 .69 .65 1.07 .58

Mean 1.00 .63 .99 .67

S.D. .31 .07 .12 .09

fixed chi square means that the severity measures of at least two raters included in the analysis

are significantly different. The fixed chi square value for both scales is significant,2 showing thattwo or more raters are significantly different in terms of leniency or harshness; however, the fixed

chi square value of the TSA-based scale is bigger indicating a larger difference between raters

in terms of severity. The rater separation ratio provides an indication of the spread of the rater

severity measures. The closer the separation ratio is to zero, the closer the raters are together in

terms of their severity. Again, the larger separation ratio of the TSA-based scale shows that the

raters differed more in terms of leniency and harshness.

Another important output of the rater measurement report is the infit mean square statistics with

the rater point biserial correlations (see Table 6). The infit mean square has an expected mean of

1. Raters with very low infit mean square statistics (lower than .7) do not show enough variation in

their ratings meaning they are overly consistent and possibly overuse the inner band levels of thescale, whilst raters with infit mean square values higher than 1.3 show too much variation in their

ratings meaning they rate inconsistently. Table 6 shows that two raters rated near the margins of

acceptability when using the existing rating scale, whilst no raters rated too inconsistently when

using the TSA-based scale. Three raters, however, showed not enough variation in their ratings

when using the DELNA scale, shown by the infit mean square values lower than .7.

The rater point biserial correlation coefficient is reported for each rater individually as well

as for the raters as a group. It summarizes the degree to which a particular raters ratings are

consistent with the ratings of the rest of the raters. The point biserial correlation is concerned with

2 Myford and Wolfe (2004) note that the fixed chi square test is very sensitive to sample size. Because of this, the fixed

chi square value is often significant, even if the actual variation in terms of leniency and harshness between the raters is

actually small.


12/21


Table 7

Candidate separation statistics

DELNA scale TSA-based scale

Candidate fixed chi square 833.5 d.f.: 99 significance

(probability): .00

736.6 d.f.: 99 significance

(probability): .00Candidate separation ratio 3.93 4.13

the degree to which raters are ranking candidates in a similar fashion. Myford and Wolfe (2004)

suggest that the expected values for this correlation are between .3 and .7, with a correlation of .7

being high for rating data. The point biserial correlation coefficient for the DELNA scale is .63,

whilst the TSA-based rating scale results in an average correlation coefficient of .67.

The findings from Table 6 indicate that raters, when using the TSA-based scale, with its more

defined categories, seemed to be able to not only rank candidates more similarly, but also to

achieve consistency in their ratings.Similarly to the rater measurement report described above, FACETS also generates a candidate

measurement report. The first group of statistics in this report is the candidate separation statistics.

The candidate fixed chi square tests the assumption that all candidates are of the same level of

performance. The candidate fixed chi square values in Table 7 indicate that the ratings based on

the TSA-based scale are slightly more discriminating (seen by the lower fixed chi square value).

The same trend can be seen when the candidate separation ratio is examined. The candidate

separation ratio indicates the number of statistically significant levels of candidate performance.

This statistic also shows that whenraters used the TSA-based scale, their ratings were slightly more

discriminating. Although the existing DELNA scale has 6 levels of descriptors for coherence, the

raters only separated the candidates into 3.93 levels. The TSA-based scale consists of five levelsand the raters separated the candidates into 4.13 levels when using it. The higher discrimination

ability of the new scale is a product of the higher rater point biserial correlation.

For the comparison of the rating scale categories, FACETS produces scale category statistics.

The tables for the existing DELNA and the TSA-based scale are reproduced in Tables 8 and 9

respectively. The first column in each table shows the raw scores represented by the two rating

scales. Please note that the TSA-based scale has one less category to award, therefore only ranges

from four to eight. The second column shows the number of times (counts) each of these scores

was used by the raters as a group, the third column shows these numbers as percentages of overall

use. When looking at the counts and percentages, it is clear that the raters when using the existing

DELNA scale, under-used the outside categories: in particular, category 4 was rarely awarded.This table also underlines the evidence that the raters, when using the existing DELNA scale,

Table 8

Scale category statisticsDELNA scale

Score Counts (%) Average

measure

Expected measure Outfit mean

square

Step calibration

measure

4 3 0 1.69 2.21 1.2

5 59 7 1.43 1.46 1.0 4.82

6 288 36

0.41

.38 1.0

2.537 299 37 1.04 .98 .9 .28

8 127 16 2.46 2.34 .8 2.58

9 24 3 2.95 3.24 1.3 4.49


13/21


Table 9

Scale category statisticsTSA-based scale

Score Counts (%) Average measure Expected measure Outfit mean

square

Step calibration

measure

4 38 5 2.03 2.01 1.05 165 21 .75 .81 1.1 2.89

6 239 30 .33 .39 1.0 .57

7 205 26 1.49 1.49 .9 1.10

8 153 19 2.73 2.70 .9 2.36

displayed a central tendency effect. The scores are far more widely spread when the raters used

the TSA-based scale; however level 4 seemed still slightly underused, only being awarded in 5%

of all cases. Low frequencies might indicate that the categories are unnecessary or redundant and

should possibly be collapsed.Column four indicates the average candidate measures at each scale category. These measures

should increase as the scale category increases. This is the case for both scales. When this pattern

is seen to be occurring, it shows that the rating scale points are appropriately ordered and are func-

tioning properly. This means that higher ratings do correspond to more of the variable being rated.

Column five shows the expected average candidate measure at each category, as estimated by

the FACETS program. The closer the expected and the actual average measures, the closer the

outfit mean square value in column six will be to 1. It can be seen that the outfit mean square values

for both scales are generally close to 1, however category 9 of the existing scale is slightly high,

which might mean that it is not contributing meaningfully to the measurement of the variable of

coherence. Bond and Fox (2001) suggest that this might be a good reason for collapsing a category.Column seven gives the step calibration measures, which denote the point at which the prob-

ability curves for two adjacent scale categories cross (Linacre, 1999). Thus, the rating scale

category threshold represents the point at which the probability is 50% of a candidate being rated

in one or the other of these two adjacent categories, given that the candidate is in one of them. The

rating scale category thresholds should increase monotonically and be equally distanced (Linacre,

1999) so that none are too close or too far apart. This is generally the case for both rating scales

under investigation. Myford and Wolfe (2004) argue that if the rating scale category thresholds are

widely dispersed, the raters might be exhibiting a central tendency effect. This is ever so slightly

the case for the existing DELNA scale.

Overall, the results from research question two indicate that the raters, when using the TSA-

based scale, were able to discern more levels of ability among the candidates and ranked the

candidates more similarly. They were also able to use more levels on the scale reliably. All these

show evidence that the TSA-based scale functions better than the existing scale. However, the

raters differed more in terms of leniency and harshness when using the TSA-based scale, which

is undesirable but less crucial in situations where scripts are routinely double-rated.

5.3. RQ3: What are raters perceptions of using TSA in a rating scale as compared to more

conventional rating scales when rating writing?

The interviews provided some evidence that raters experienced problems when using the less

specific level descriptors of the DELNA scale. Rater 12, for example, described his problems

when using the DELNA scale in the following quote:


14/21


. . . sometimes I look at [the descriptor] Im going what do you mean by that? . . . You just

kind of have to find a way around it cause its not really descriptive enough, yeah

Rater 4 provided evidence of a strategy that she resorted to when experiencing problems

assigning a level:

I just tend to go with my gut feeling. So I dont spend a lot of time worrying about it ... but

I think this is a very good example of where, if I have an overall sense that a script is really

a seven, Id be likely to give it a seven in coherence.

Raters were asked in a questionnaire about their perceptions of the rating scale category of

coherence in the TSA-based scale. Four raters commented that it took them a while to get used

to the rating scale and that they had concerns about it not being very marker friendly (e.g.,

Rater 5). Most of these raters, however, mentioned that they became accustomed to the category

after having marked a number of scripts. Rater 2, for example, mentioned that he likes the scalebecause it gives me a lot more guidance than the DELNA scale and I feel that I am doing the

writers more justice in this way.

One rater, however, commented that the TSA-based scale is narrower than the DELNA coher-

ence scale as it focuses only on topical structure and not on other aspects of coherence.

Overall, the raters found the TSA-based scale more objective and more descriptive.

6. Discussion and conclusion

The analysis of the 602 writing scripts using TSA analysis was able to show that this measure is

successful in differentiating between different proficiency levels. The redesign of the categories

suggested by Schneider and Connor (1990) was valuable in improving the usefulness of the

measure. Especially the new categories of superstructure and coherence breaks were found to be

discriminating between different levels of writing ability. Apart from being useful in the context

of rating scale development and assessment, this method could be applied to teaching, as was

suggested by Connor and Farmer (1990). Overall, TSA analysis was shown to be useful as an

objective discourse analytic measure of coherence.

The comparison of the ratings based on the two different rating scales was able to provide

evidence that the raters rated more accurately when using the TSA-based scale. Raters used more

band levels and ranked the candidates more similarly when using this scale. Therefore, whenusing more specific rating scale descriptors when rating a fuzzy concept such as coherence, raters

were able to identify more levels of candidate ability. Helping raters to divide performances into

as many ability levels as possible, is the aim of rating scales.

The raters, when in doubt which band level to award to a performance when using the more

impressionistic descriptors of coherence on the DELNA scale, seemed to resort to two different

strategies. Either they used most band levels on the scale, but did so inconsistently, or they overused

the band levels 6 and 7 and avoided the extreme levels, especially levels 4 and 9. Whilst this might

be less of a problem if the trait is only one of many on an analytic rating scale and the score

is reported as an averaged score, in a diagnostic context, in which we would like to report the

strengths and weaknesses of a candidate to stakeholders, this might result in a loss of valuableinformation. Alderson (2005) suggests that diagnostic tests should focus more on specific rather

than global abilities, and therefore it could be argued that the TSA-based descriptors might be

particularly useful in a diagnostic context.


15/21


The interview and questionnaire data also provided evidence that the raters focused more

on the descriptors when these were less vague and impressionistic. If we are able to arrive at

descriptors which enable raters to rely less on their gut feeling of the overall quality of a writing

performance but focus more on the descriptions of performance in the level descriptors, then we

inevitably arrive at more reliable and probably more valid ratings. This study was able to showthat developing descriptors empirically might be the first step in this direction.

An important consideration with respect to the two scales discussed in this study is practi-

cality. Two types of practicality need to be considered: practicality of scale development and

practicality of scale use. The TSA-based scale was clearly more time-consuming to develop than

the pre-existing DELNA scale. To do a detailed empirical analysis of a large number of writ-

ing performances is labor-intensive and therefore might not be practical in some contexts. The

practicality of the scale use is another issue that needs to be considered. In this case, there was

evidence from the interviews and questionnaires that raters needed more time when rating with

the TSA-based scale. However, most reported becoming accustomed to these more detailed scale

descriptors.One limitation of this study is that TSA does not cover all aspects of coherence. So whilst

the TSA-based scale is more detailed in its descriptions, some aspects of coherence which raters

might look for when using more conventional rating descriptors might be lost, which lowers the

content validity of the scale. However, it seems that the existing rating scale also resulted in

two raters rating too inconsistently, which might be because they were judging different aspects

in different scripts and others overusing the inside scale categories. Lumley (2002) was able to

show that when raters are confronted with aspects of writing which are not specifically men-

tioned in the scale descriptors, they inevitably use their own knowledge or feelings to resolve

it by resorting to strategies. However, this study was able to show that rating scales with very

specific level descriptors can help avoid play-it-safe methods and make it easier to arrive at a level

(which is what rating scales are ultimately designed for), even though some content validity is

sacrificed.

It is also important to mention that the raters taking part in this study were far more famil-

iar with the current DELNA scale, having used it for many years. Being confronted with the

TSA-based scale for this research project meant a departure from the norm. It is therefore

possible that the raters in this study varied more in terms of severity when using the TSA-

based scale because they were less familiar with it. It might be possible that if they were

to use the TSA-based scale more regularly and receive more training, the variance in terms

of leniency and harshness might be reduced. It seems important to ensure that rating pat-terns over time remain stable, and avoid central tendency effects, by subjecting individual

trait scales to regular quantitative and qualitative validation studies, and addressing varia-

tions both through rater training (as is usually the case) and better specification of scoring

criteria.

This research was able to show the value of developing descriptors based on empirical inves-

tigation. Even an aspect of writing as vague and elusive as coherence was operationalized for

this study. Rating scale developers should consider this method of scale development as a viable

alternative to intuitive development methods which are commonly used around the world. Over-

all, it can be said however that more detailed, empirically developed rating scales might lend

themselves to being more discriminating and result in higher levels of rater reliability than moreconventional rating scales. Further research is necessary to establish if this is also the case for

other traits, as this study only looked at a scale for coherence. Also, it would be interesting to

pursue a similar study in the context of speaking assessment.


16/21


Appendix A. Criteria for identifying sentence topics (taken from Wu, 1997)

1. Sentence topics are defined as the leftmost NP dominated by the finite verb in the t-unit. It is

what the t-unit is about.

2. Exceptions:a. Cleft sentences

i It is the scientistwho ensures that everyone reaches his office on time

ii It is Jane we all admire

b. Anticipatory pronoun it

i It is well known that a society benefits from the work of its members

ii it is clear that he doesnt agree with me

c. Existential there

i There often exists in our society a certain dichotomy of art and science

ii There are many newborn children who are helpless

d. Introductory phrasei Biologists now suggest that language is species-specific to the human race.

Appendix B. Box plots comparing TSA over different levels

See Figs. 28.

Fig. 2. Proportion of parallel progression over five DELNA levels.


17/21


Fig. 3. Proportion of direct sequential progression over five DELNA levels.

Fig. 4. Proportion of superstructure over five DELNA levels.


18/21


Fig. 5. Proportion indirect progression over five DELNA levels.

Fig. 6. Proportion extended progression over five DELNA band levels.


19/21


Fig. 7. Proportion unrelated progression over five DELNA band levels.

Fig. 8. Proportion coherence breaks over five DELNA band levels.


20/21


References

Alderson, C. (2005). Diagnosing foreign language proficiency. The interface between learning and assessment. London:

Continuum.

Bamberg, B. (1984). Assessing coherence: A reanalysis of essays written for National Assessment of Education Progress.

Research in the Teaching of English, 18 (3), 305319.Bond, T. G., & Fox, C. M. (2001).Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah,

NJ: Lawrence Erlbaum.

Burneikaite, N., & Zabiliute, J. (2003). Information structuring in learner texts: A possible relationship between the topical

structure and the holistic evaluation of learner essays. Studies about Language, 4, 111.

Cason, G. J., & Cason, C. L. (1984). A deterministic theory of clinical performance rating. Evaluation and the Health

Professions, 7, 221247.

Cheng, X., & Steffensen, M. S. (1996). Metadiscourse: A technique for improving student writing. Research in the

Teaching of English, 30 (2), 149181.

Connor, U., & Farmer, F. (1990). The teaching of topical structure analysis as a revision strategy for ESL writers. In: B.

Kroll (Ed.), Second language writing: Research insights for the classroom. Cambridge: Cambridge University Press.

Crismore, A., Markkanen, R., & Steffensen, M. S. (1993). Metadiscourse in persuasive writing: A study of texts written

by American and Finnish university students. Written Communication, 10, 3971.

Elder, C. (2003). The DELNA initiative at the University of Auckland. TESOLANZ Newsletter, 12 (1), 1516.

Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater responses to an online rater training

program. Language Testing, 24 (1), 3764.

Elder, C., & Erlam, R. (2001).Development and validation of the diagnostic English language needs assessment (DELNA):

Final Report. Auckland: University of Auckland, Department of Applied Language Studies and Linguistics.

Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it

work? Language Assessment Quarterly, 2 (3), 175196.

Elder, C., & von Randow, J., (2002). Report on the 2002 Pilot of DELNA at the University of Auckland. Auckland:

University of Auckland, Department of Applied Language Studies and Linguistics.

Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. ELT Journal, 41 (4), 287291.

Fulcher, G. (1996). Does thick description leadto smart tests? A data-based approach to ratingscale construction.LanguageTesting, 13 (2), 208238.

Fulcher, G. (2003). Testing second language speaking. London: Pearson Longman.

Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In: L. Hamp-Lyons (Ed.), Assessing second language

writing in academic contexts. Norwood, NJ: Ablex Publishing Corporation.

Hoenisch, S. (1996). The theory and method of topical structure analysis. Retrieved 30 April 2007, from

http://www.criticism.com/da/tsa-method.php

Hoey, M. (1991). Patterns of lexis in text. Oxford: Oxford University Press.

Intaraprawat, P., & Steffensen, M. S. (1995). The use of metadiscourse in good and poor ESL essays. Journal of Second

Language Writing, 4 (3), 253272.

Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL composition: A practical approach.

Rowley, MA: Newbury House.

Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-facetraining? Assessing Writing, 12 (1), 2643.

Lautamatti, L. (1987). Observations on the development of the topic of simplified discourse. In: U. Connor & R. B. Kaplan

(Eds.), Writing across languages: Analysis of L2 text (pp. 87114). Reading, MA: Addison-Wesley.

Lee, I. (2002). Teaching coherence to ESL students: a classroom inquiry. Journal of Second Language Writing, 11,

135159.

Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3 (2), 103122.

Linacre, J. M. (2006). Facets Rasch measurement computer program. Chicago: Winsteps.

Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language

Testing, 19 (3), 246276.

McIntyre, P. N. (1993). The importance and effectiveness of moderation training on the reliability of teachers assessments

of ESL writing samples. Unpublished masters thesis, University of Melbourne, Australia.

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet rasch measurement: Part

I. Journal of Applied Measurement, 4 (4), 386422.

Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part

II. Journal of Applied Measurement, 5 (2), 189227.
http://www.criticism.com/da/tsa-method.phphttp://www.criticism.com/da/tsa-method.php


21/21


North, B. (1995). The development of a common framework scale of descriptors of language proficiency based on a theory

of measurement. System, 23 (4), 445465.

North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation

formats. TOEFL Research Paper. Princeton, NJ: Educational Testing Service.

North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales.Language Testing, 15 (2), 217263.

Schneider, M., & Connor, U. (1990). Analyzing topical structure in ESL essays. Studies in Second Language Acquisition,12 (4), 411427.

Tribble, C. (1996). Writing. Oxford: Oxford University Press.

Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples: Effects of the scale maker and the

student sample on scale content and student scores. TESOL Quarterly, 36(1), 4970.

Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49 (1), 312.

Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: Test method

and learner discourse. Language Testing, 16(1), 82111.

Watson Todd, R. (1998). Topic-based analysis of classroom discourse. System, 26, 303318.

Watson Todd, R., Thienpermpool, P., & Keyuravong, S. (2004). Measuring the coherence of writing using topic-based

analysis. Assessing Writing, 9, 85104.

Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and

qualitative approaches. University of California, Los Angeles: Unpublished doctoral dissertation.

Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11 (2), 197223.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15 (2), 263287.

Witte, S. (1983). Topical structure analysis and revision: An exploratory study. College Composition and Communication,

34 (3), 313341.

Witte, S. (1983). Topical structure and writing quality: Some possible text-based explanations of readers judgments of

students writing. Visible Language, 17, 177205.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.

Wu, J. (1997). Topical structure analysis of English as a second language (ESL) texts written by college Southeast Asian

refugee students. Unpublished doctoral dissertation, University of Minnesota.

Little coherence, considerable strain for reader, A comparison between two rating scales for the...

Documents

Transcript of Little coherence, considerable strain for reader, A comparison between two rating scales for the...