Little coherence, considerable strain for reader, A comparison between two rating scales for the...

download Little coherence, considerable strain for reader, A comparison between two rating scales for the assessment of coherence

of 21

Transcript of Little coherence, considerable strain for reader, A comparison between two rating scales for the...

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    1/21

    Assessing Writing 12 (2007) 108128

    Available online at www.sciencedirect.com

    Little coherence, considerable strain for reader: Acomparison between two rating scales for the

    assessment of coherence

    Ute Knoch

    Language Testing Research Centre, University of Melbourne, Level 3,245 Cardigan Street, Carlton, Victoria 3052, Australia

    Available online 17 October 2007

    Abstract

    The category of coherence in rating scales has often been criticized for being vague. Typical descriptors

    might describe students writing as having a clear progression of ideas or lacking logical sequencing.

    These descriptors inevitably require subjective interpretation on the side of the raters.

    A number of researchers (Connor & Farmer, 1990; Intaraprawat & Steffensen, 1995) have attempted to

    measure coherence more objectively. However, these efforts have not thus far been reflected in rating scaledescriptors. For the purpose of this study, the results of an adaptation of topical structure analysis (Connor and

    Farmer, 1990; Schneider and Connor, 1990), which proved successful in distinguishing different degrees of

    coherence in 602 academic writing scripts was used to formulate a new rating scale. The study investigates

    whether such an empirically grounded scale can be used to assess coherence in students writing more

    reliably and with greater discrimination than the more traditional measure. The validation process involves

    a multi-faceted Rasch analysis of scores derived from multiple ratings of 100 scripts using the old and new

    rating descriptors as well as a qualitative analysis of questionnaires canvassed from the raters. The findings

    are discussed in terms of their implications for rating scale development.

    2007 Elsevier Inc. All rights reserved.

    Keywords: Writing assessment; Rating scales; Coherence; Rating scale validation; Multi-faceted Rasch analysis

    1. Introduction

    Because writing assessment requires subjective evaluations of writing quality by raters, the raw

    score candidates receive might not reflect their actual writing ability. In an attempt to reduce the

    Tel.: +61 3 83445206; fax: +61 3 83445163.

    E-mail address: [email protected].

    1075-2935/$ see front matter 2007 Elsevier Inc. All rights reserved.

    doi:10.1016/j.asw.2007.07.002

    mailto:[email protected]://localhost/var/www/apps/conversion/current/tmp/scratch15862/dx.doi.org/10.1016/j.asw.2007.07.002http://localhost/var/www/apps/conversion/current/tmp/scratch15862/dx.doi.org/10.1016/j.asw.2007.07.002mailto:[email protected]
  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    2/21

    U. Knoch / Assessing Writing 12 (2007) 108128 109

    variability between raters and therefore to increase the reliability of ratings, attempts have been

    made to improve certain features of the rating process, most commonly through rater training

    (Elder, Knoch, Barkhuizen, & von Randow, 2005; McIntyre, 1993; Weigle, 1994a, 1994b, 1998).

    However, despite all the efforts put into training raters, it has been shown that differences in rater

    reliability persist and can account for as much as 35% of variance in students written performance(Cason & Cason, 1984). Some researchers have suggested that a better specification of scoring

    criteria might lead to an increase in rater reliability (Hamp-Lyons, 1991; North, 1995, 2003;

    North & Schneider, 1998). One reason for the variability found in writing performance might lie

    in the way rating scales are designed. Fulcher (2003) has shown that most existing rating scales

    are developed based on intuitive methods which means that they are either adapted from already

    existing scales or they are based on what developers think might be common features in the writing

    samples in question. However, for rating scales to be more valid, it has been contended that rating

    scales should be based on empirical investigation of actual writing samples (North & Schneider,

    1998; Turner & Upshur, 2002; Upshur & Turner, 1995, 1999).

    2. The assessment of coherence in writing

    Lee (2002) defines coherence as the relationships that link the ideas in a text to create mean-

    ing. Although a number of attempts have been undertaken in second language writing research

    to operationalize coherence (Cheng & Steffensen, 1996; Connor & Farmer, 1990; Crismore,

    Markkanen, & Steffensen, 1993; Intaraprawat & Steffensen, 1995), this has not been reflected

    in rating scales commonly used in the assessment of writing. Watson Todd, Thienpermpool and

    Keyuravong (2004), for example, criticized the level descriptors for coherence in a number of

    rating scales as being vague and lacking enough detail for raters to base their decisions on. They

    quote a number of rating scale descriptors used for measuring coherence. The commonly used

    and much cited Jacobs scale (Jacobs, Zinkgraf, Wormuth, Hartfiel, & Hughey, 1981), for exam-

    ple, describes high quality writing as well organized and exhibiting logical sequencing. In

    other scales less successful writing has been described for example as being fragmentary so that

    comprehension of the intended communication is virtually impossible (TEEP Attribute Writing

    Scales, cited in Watson Todd et al., 2004). Watson Todd et al. therefore argue that while analytic

    criteria are intended to increase the reliability of rating, the descriptors quoted above inevitably

    require subjective interpretations of the raters and might lead to confusion. Although one reason

    for these vague descriptions of coherence might lie in the rather vague nature of coherence, Hoey

    (1991) was able to show that judges are able to reach consensus on the level of coherence.A notable exception to the scales described above is a scale for coherence developed by

    Bamberg (1984). Although Bamberg was able to develop more explicit descriptors for a number

    of different aspects of writing related to coherence (e.g., organization and topic development),

    her holistic scale descriptors mix a variety of aspects at the descriptor level. The descriptor for

    level 2, for example, describes the writing as incoherent, refers to topic identification, setting of

    context, the use of cohesive devices, the absence of an appropriate conclusion, flow of discourse

    and errors. Although the scale has five levels, when Bambergs raters used the scale, they seemed

    to only be able to identify three levels. It is possible that because this holistic scale mixes so many

    aspects at the descriptor level, raters were overusing the inner three band levels of the scale and

    avoiding the extreme levels.It seems that no existing rating scale for coherence has been able to operationalize this aspect

    of writing in a manner that can be successfully used by raters. The aim of this study was therefore

    to attempt to develop a rating scale for coherence which is empirically-based.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    3/21

    110 U. Knoch / Assessing Writing 12 (2007) 108128

    2.1. Topical structure analysis (TSA)

    In the second language writing literature, several attempts have been made to measure coher-

    ence. To be transferable into a rating scale, the method chosen for this study needs to be sufficiently

    simple to be used by raters who are rating a number of scripts in a limited amount of time. Severalmethods were investigated as part of the literature review for this study. Crismore et al.s (1993)

    metadiscoursal markers were excluded because insufficient tokens were found in students essays

    and measures like topic-based analysis (Watson Todd, 1998; Watson Todd et al., 2004) needed to

    be excluded for being too complicated and time consuming. For this study, topical structure anal-

    ysis (TSA) was chosen and adapted because it was the only attempt at operationalizing coherence

    which was sufficiently simple to be transferred into a rating scale.

    TSA, based on topic and comment analysis, was first described by Lautamatti (1987) from

    the Prague School of Linguistics in the context of text readability to analyze topic development

    in reading material. She defined the topic of a sentence as what the sentence is about and the

    comment of a sentence as what is said about the theme. Lautamatti described three types ofprogression, which create coherence in a text. These types of progression advance the discourse

    topic by developing a sequence of sentence topics. Through this sequence of sentence topic, local

    coherence is created. The three types of progression can be summarized as follows ( Hoenisch,

    1996):

    1. Parallel progression, in which topics of successive sentences are the same, producing a repe-

    tition of topic that reinforces the idea for the reader (, , ).

    Example: Paul walked on the street. He was carrying a backpack.

    2. Sequential progression, in which topics of successive sentences are always different, as the

    comment of one sentence becomes, or is used to derive, the topic of the next (, ,

    ).

    Example: Paul walked on the street. The streetwas crowded.

    3. Extended parallel progression, in which the first and the last topics of a piece of text are the

    same but are interrupted with some sequential progression (, , ).

    Example: Paul walked on the street. Many people were out celebrating the public holiday.

    He had trouble finding his friends.

    Witte (1983a, 1983b) introduced TSA into writing research. He compared two groups of per-

    suasive writing scripts, one rated high and one rated low, on the use of the three types of progressiondescribed above. He found that the higher level writers used less sequential progression, and more

    extended and parallel progression. There were, however, several shortcomings of Wittes study.

    Firstly, the raters were not professional raters, but rather were solicited from a variety of profes-

    sions. Secondly, Witte did not use a standardized scoring scheme. He also conducted the study

    in a controlled revision situation in which the students revised a text written by another person.

    Furthermore, Witte did not report any intercoder reliability analysis.

    In 1990, Schneider and Connor set out to compare the use of topical structures by 45 writers

    taking the Test of Written English (TWE). They grouped the 45 argumentative essays into three

    different levels: high, medium, low. As with Wittes study, Schneider and Connor did not report

    any intercoder reliability statistics. The findings were contradictory to Wittes findings: the higherlevel writers used more sequential progression while the low and middle group used more parallel

    progression. There was no difference between the levels in the use of extended parallel progres-

    sion. Schneider and Connor drew up clear guidelines on how to code TSA and also suggested

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    4/21

    U. Knoch / Assessing Writing 12 (2007) 108128 111

    a reinterpretation of sequential progression as part of their discussion section. They suggested

    dividing sequential progression into the following subcategories:

    1. Direct sequential progression, in which the comment of the previous sentence becomes the

    topic of the following sentence. The topic and comment are either word derivations (e.g.,science, scientist) or they form a part-whole relation (these groups, housewives, children) (, , ).

    2. Indirect sequential progression, in which the comment of the previous sentence becomes the

    topic of the following sentence but topic and comment are only related by semantic sets (e.g.,

    scientists, their inventions and discoveries, the invention of the radio, telephone and television).

    3. Unrelated sequential progression, in which topics are not clearly related to either the previous

    sentence topic or discourse topic (, , ).

    Wu (1997), in his doctoral dissertation, applied Schneider and Connors revised categories to

    analyze two groups of scripts rated using the scale developed by Jacobs et al. (1981). He found inhis analysis no statistically significant difference in terms of the use of parallel progression between

    high and low level writers. Higher level writers used slightly more extended parallel progression

    and more direct sequential progression.A more recent study using TSA to compare groups of

    writing based on holistic ratings, was undertaken by Burneikaite and Zabiliute (2003). Using the

    original criteria of topical structure developed by Lautamatti and Witte, they investigated the use

    of topical structure in argumentative essays by three groups of students rated as high, middle

    and low based on a rating scale adapted from Tribble (1996). They found that the lower level

    writers over-used parallel progression whilst the higher level writers used a balance between

    parallel and extended parallel progression. The differences in terms of sequential progression

    were small, although they could show that lower level writers use this type of progression slightly

    less regularly. Burneikaite and Zabiliute failed to report any interrater reliability statistics.

    All studies conducted since Wittes study in 1983 show generally very similar findings, however

    there are slight differences. Two out of three studies found that lower level writers used more

    parallel progression than higher level writers; however, Wu (1997) found no significant difference.

    All three studies found that higher level writers used more extended parallel progression. In terms

    of sequential progression the differences in findings can be explained by the different ways this

    category was used. Schneider and Connor (1990), and Burneikaite and Zabiliute (2003) used the

    definition of sequential progression with no subcategories. Both studies found that higher level

    writers used more sequential progression. Wu (1997) found no differences between differentlevels of writing using this same category. However, he was able to show that higher level writers

    used more related sequential progression. It is also not entirely clear how much task type or topic

    familiarity influences the use of topical structure and if findings can be transferred from one

    writing situation to another.

    3. The study

    The aim of this study was to investigate whether TSA can successfully be operationalized into a

    rating scale to assess coherence in writing. The study was undertaken in three phases. Firstly, 602

    writing samples were analyzed to establish the topical structure used by writers at five levels ofwriting ability. The findings were then transferred into a rating scale. To validate this scale, eight

    raters were trained and then rated 100 writing samples. The findings were compared to previous

    ratings of the same 100 scripts by the same raters using an existing rating scale for coherence.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    5/21

    112 U. Knoch / Assessing Writing 12 (2007) 108128

    Fig. 1. Research design.

    After the rating rounds, raters were given a questionnaire to fill in to canvass their opinions about

    the rating scale and a subset of five raters was interviewed.

    Fig. 1 illustrates the design of the study visually.

    The research questions were as follows:

    RQ1: What are the features of topical structure displayed at different levels of expository writing?RQ2: How reliable and valid is TSA when used to assess coherence in expository writing?

    RQ3: What are raters perceptions of using TSA in a rating scale as compared to more conven-

    tional rating scales?

    4. Method

    4.1. Context of the research

    This study was conducted in the context of the Diagnostic English Language Needs Assessment

    (DELNA) which is administered at the University of Auckland, New Zealand. DELNA is auniversity-funded procedure designed to identify the English language needs of undergraduate

    students following their admission to the University, so that the most appropriate language support

    can be offered. DELNA is administered to both native and non-native speakers of English. This

    context was selected by the researcher purely because of its availability and because the rating

    scale used to assess the writing task (see description below) is representative of many other rating

    scales used in writing performance assessment across the world. A more detailed description of

    the assessment and the rating scale can be found in the section below.

    4.1.1. The assessment instrumentDELNA includes a screening component which consists of a speed-reading and a vocabulary

    task. This is used to eliminate highly proficient users of English and exempts them from the time-

    consuming and resource-intensive diagnostic procedure. The diagnostic component comprises

    objectively scored reading and listening tasks and a subjectively scored writing task.

    The writing section is an expository writing task in which students are given a table or graph

    of information which they are asked to describe and interpret. Candidates have 30 minutes to

    complete the task. The writing task is routinely double (or if necessary triple) marked analyti-

    cally on nine traits (organization, coherence, style, data description, interpretation, development

    of ideas, sentence structure, grammatical accuracy, vocabulary and spelling) on a six-point scale

    ranging from four to nine. The assessment criteria were developed in-house, initially based on anexisting scale. A number of validity studies have been conducted on the DELNA battery, which

    included validation of the rating scale (Elder & Erlam, 2001; Elder & von Randow, 2002). The

    wording of the scale has been changed a number of times based on the feedback of raters after

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    6/21

    U. Knoch / Assessing Writing 12 (2007) 108128 113

    training sessions or during focus groups. The DELNA rating scale reflects common practice in

    performance assessment in that the descriptors are graded using adverbs like adequate, appropri-

    ate, sufficient, severe or slight. The coherence scale uses descriptors like skilful coherence,

    message able to be followed effortlessly or little coherence, considerable strain for reader.

    Strain is graded between different level descriptors from slight, some, considerable tosevere.

    4.1.2. The writing samples

    To identify the specific features of topical structure used by writers taking DELNA, 602 writing

    samples, which were produced as part of the 2004 administration of the assessment, were randomly

    selected. The samples were originally hand-written by the candidates. The mean number of words

    for the scripts was 269, ranging from 75 to 613.

    4.1.3. The candidatesThree hundred twenty-nine of the writing samples were produced by females and 247 by

    males (roughly reflecting the gender distribution of DELNA), whilst 26 writers did not spec-

    ify their gender. The L1 of the students (as reported in a self-report questionnaire) varied.

    Forty-two percent (or 248 students, N= 591) have an Asian first language, 36% (217) are

    native speakers of English, 9% (52) are speakers of a European language other than English,

    5% (31) have either a Pacific Island language or Maori as first language, and 4% (21) speak

    either an Indian or a language from Sri Lanka as first language. The remaining 4% (22)

    were grouped as others. Eleven students did not fill in the self-report questionnaire. The

    scripts used in this analysis were all rated by two DELNA raters. In case of discrepancies

    between the scores, the scores were averaged and rounded (in the case of a .5 result afteraveraging, the score was rounded down). The 602 scripts were awarded the following average

    marks (Table 1).

    4.1.4. The raters

    The eight DELNA raters taking part in this study were drawn from a larger pool of raters based

    on their availability at the time of the study. All raters have high levels of English proficiency

    although not all are native speakers of English. Most have experience in other rating contexts, for

    example, as accredited raters of the International English Language Testing System (IELTS). All

    have postgraduate degrees in either English, Applied Linguistics or Teaching English to Speakersof other Languages (TESOL). All raters have several years of experience as DELNA raters and

    take part in regular training moderation sessions either in face-to-face or online sessions (Elder,

    Barkhuizen, Knoch, & von Randow, 2007; Knoch, Read, & von Randow, 2007).

    Table 1

    Score distribution of 602 writing samples

    DELNA score Frequency Percent (%)

    4 23 4

    5 115 196 253 46

    7 172 29

    8 26 4

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    7/21

    114 U. Knoch / Assessing Writing 12 (2007) 108128

    4.2. Proceduresanalysis of writing samples

    4.2.1. Pilot study

    While coding the 602 writing scripts, the categories of parallel, direct sequential and unrelated

    sequential progression were used as defined by Schneider and Connor (1990) and Wu (1997).However, other categories had to be changed or added to better account for the data. Firstly,

    extended parallel progression was changed to extended progression to account for cases in which

    the topic of a sentence is identical to a comment occurring more than two sentences earlier.

    Similarly, indirect sequential progression was modified to indirect progression to also include

    cases in which the indirect link is back to the previous topic. Then, a category was created that

    accounts for features very specific to writers whose L1 is not English. At very low levels, these

    writers often attempt to create a coherent link back to the previous sentence but fail because, for

    example, they use an incorrect linking device or a false pronominal. This category was called

    coherence break. Another category was established to account for coherence that is created not

    by topic progression but by features such as linking devices (e.g., however, also, but). This categoryalso includes cases in which the writer clearly signals the ordering of an essay or paragraph early

    on, so that the writer can follow any piece of discourse without needing topic progression as

    guidance. Table 2 below presents all categories of topical structure used in the main analysis with

    definitions and examples.1

    4.2.2. Main analysis

    To analyze the data, the writing scripts were first typed and then divided into t-units following

    Schneider and Connor (1990) and Wu (1997). The next step was to identify sentence topics. For

    this, Wus (1997) criteria were used (see Appendix A). Then each t-unit was coded into one of

    the seven categories as described in Table 2. The percentage of each category was recorded into

    a spreadsheet. The mean DELNA score produced by the two DELNA raters was also added for

    each candidate. To identify which categories were used by students at different proficiency levels,

    the final score was correlated with the percentage of occurrence of each category. The results of

    this analysis can be found in the results section under research question 1 below. Finally, to ensure

    intercoder reliability, t-unit coding, topic identification and TSA were all undertaken by a second

    researcher (on a subset of 50 scripts) and intercoder reliability was calculated.

    4.3. Proceduresrating scale validation

    4.3.1. Procedure

    The raters rated 100 scripts using the current DELNA criteria and then the same 100 using the

    new scale based on TSA. The scripts were selected to represent a range of proficiency levels. The

    raters were given the scripts in five sets of 20 scripts over a time period of about eight weeks.

    They all participated in a rater moderation session to ensure they were thoroughly trained. All

    raters were further instructed to rate no more than ten scripts in one session to avoid fatigue.

    After rating the two sets of 100 scripts, the raters filled in a questionnaire canvassing their

    opinions about the scales. The questionnaire (part of a larger-scale study) allowed the raters to

    record any opinions or suggestions they had with respect to the coherence scale. The questionnaire

    questions were as follows:

    1 All examples were taken from the data used in this study.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    8/21

    U. Knoch / Assessing Writing 12 (2007) 108128 115

    Table 2

    Categories of topical structure analysis used in main analysis with examples

    Parallel progression

    Topics of successive sentences are the same (or synonyms)

    Maori and PI males are just as active as the rest of NZ. They also have other interests

    Direct sequential progression

    The comment of the previous sentence becomes the topic of the following sentence

    The graph showing the average minutes per week spent on hobbies and games by age group and sex, shows many

    differences in the time spent by females and males in NZ on hobbies and games

    These differences include on age factor

    Indirect progression

    The topic or the comment of the previous sentence becomes the topic of the following sentence. The topic/or

    comment are only indirectly related (by inference, e.g., related semantic sets)

    or The main reasons for the increase in the number of immigrates is the development of some third-world countries.

    e.g., China. People in those countries have got that amount of money to support themselves living in a foreign

    country

    Superstructure

    Coherence is created by a linking device instead of topic progression

    Reasons may be the advance in transportation and the promotion of New Zealands natural environment and

    green image. For example, the filming of The Lord of the rings brought more tourists to explore the

    beautiful nature of NZ

    Extended progression

    A topic or a comment before the previous sentence become the topic of the new sentence ... or ...

    The first line graph shows New Zealanders arriving in and departing from New Zealand between 2000 and 2002.

    The horizontal axis shows the times and the vertical axis shows the number of passengers which are New

    Zealanders. The number of New Zealanders leaving and arriving have increased slowly from 2000 to 2002.

    Coherence break

    Attempt at coherence fails because of an error

    The reasons for the change on the graph. Its all depends on their personal attitude

    Unrelated progression

    Topic of a sentence is not related to the topic or comment in the previous sentence

    The increase in tourist arrivers has a direct affect to New Zealand economy in recent years. The government reveals

    that unemployment rate is down to 4% which is a great news to all New Zealanders

    (1) What did you like about the scales?

    (2) Were there any descriptors that you found difficult to apply? If yes, please say why.

    (3) Please write specific comments that you have about the scales below. You could for example

    write how you used them, any problems that you encountered that you havent mentioned

    above or you can mention anything else you consider important.

    A subset of five raters were also interviewed after the study was concluded.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    9/21

    116 U. Knoch / Assessing Writing 12 (2007) 108128

    Table 3

    TSA correlations with final DELNA writing score

    Final writing score

    Parallel progression .215a

    Direct sequential progression .292a

    Superstructure .258a

    Indirect progression .220a

    Extended progression .07

    Unrelated progression .202a

    Coherence break .246a

    n = 602.a p < .01.

    4.3.2. Data analysisThe results of the two rating rounds were analyzed using multi-faceted Rasch measurement

    in the form of the computer program FACETS (Linacre, 2006). FACETS is a generalization

    of Wright and Masters (1982) partial credit model that makes possible the analysis of data

    from assessments that have more than the traditional two facets associated with multiple-choice

    tests (i.e., items and examinees). In the many-facet Rasch model, each facet of the assessment

    situation (e.g., candidates, raters, trait) is represented by one parameter. The model states that

    the likelihood of a particular rating on a given rating scale from a particular rater for a particular

    student can be predicted mathematically from the proficiency of the student and the severity of

    the rater. The advantages of using multi-faceted Rasch measurement is that it models all facets

    in the analysis onto a common logit scale, which is an interval scale. Because of this, it becomespossible to establish not only the relative difficulty of items, ability of candidates and severity of

    raters as well as the scale step difficulty, but also how large these differences are. Multi-faceted

    Rasch measurement is particularly useful in rating scale validation as it provides a number of

    useful measures such as rating scale discrimination, rater agreement and severity statistics and

    information with respect to the functioning of the different band levels in a scale.

    To make the multi-faceted Rasch analysis used in this study more powerful, a fully crossed

    design was chosen; that is, all eight raters rated the same 100 writing scripts on both occasions.

    Although such a fully crossed design is not necessary for FACETS to run the analysis, it makes

    the analysis more stable and therefore better conclusions can be drawn from the results (Myford

    & Wolfe, 2003).

    5. Results

    5.1. RQ1: What are the features of topical structure displayed at different levels of writing?

    The results of the intercoder reliability analysis show a high level of agreement for the two

    researchers coding the data. The proportion of exact agreement for the t-unit identification is .959,

    for the identification of the t-unit topics is .931 and for the TSA categories (as shown in Table 2)

    is .865.A Pearson correlation of the proportion that each TSA category was used in each essay with

    the overall writing score was performed in order to establish which categories are used at different

    levels of writing. The results of the correlation are reported in Table 3.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    10/21

    U. Knoch / Assessing Writing 12 (2007) 108128 117

    Table 4

    TSA-based rating scale for coherence

    Level Coherence

    4 Frequent: unrelated progression, coherence breaks

    Infrequent: sequential progression, superstructure, indirect progression5 As level 4, but coherence might be achieved in stretches of discourse by

    overusing parallel progression. Only some coherence breaks

    6 Mixture of most categories

    Superstructure relatively rare

    Few coherence breaks

    7 Frequent: sequential progression

    Superstructure occurring more frequently

    Infrequent: Parallel progression

    Possibly no coherence breaks

    8 Writer makes regular use of superstructures, sequential progression

    Few incidences of unrelated progression

    No coherence breaks

    The table shows that the variables used for the analysis of TSA in the essays can be divided into

    three groups. The first group consists of variables that were used more by students whose essays

    received a higher overall writing score. The three variables in this group are direct sequential

    progression, indirect progression, and superstructure. The second group is made up of variables

    that were used more by weaker writers. Variables in this group are coherence breaks, unrelated

    sequential progression, and parallel progression. The third group consists of variables that were

    used equally by the strong and the weak writers. The only variable that falls into this category is

    extended progression.

    The table showing the correlational results does not indicate the distribution over the different

    DELNA writing levels. Therefore, box plots were created for each variable, to indicate how the

    proportion of usage changes over the different DELNA band levels. The box plots can be found

    in Appendix B (Figs. 28). The box plots show that although there is a lot of overlap between

    the different levels of writing within each variable, there are clear trends in the distribution of the

    variables over the proficiency levels. The only exception seems to be parallel progression, where

    writers at level 4 seemed to use fewer instances of parallel progression than writers at level 5.

    The quantitative results shown in the box plots were then used to develop the TSA-based rating

    scale. The trends for the different types of TSA categories observed in the box plots were usedas the basis for the level descriptors. Because raters were presumably unable to identify small

    trends in the writing samples, general trends only were used for the different level descriptors.

    For example, raters were not asked to count each incident of each category of topical structure,

    rather they were guided as to what features they could expect least or most commonly at different

    levels. Because the strongest students were filtered out during the DELNA screening procedure,

    and therefore no scripts at band level 9 were analyzed, the TSA-based rating scale only has five

    levels. However, the possibility of a sixth level exists. The scale is reproduced in Table 4.

    5.2. RQ 2: How reliable and valid is TSA when used to assess coherence in writing?

    FACETS provides a group of statistics which investigates the spread of raters in terms of

    harshness and leniency (see Table 5). The rater fixed chi square tests the assumption that all

    the raters share the same severity measure, after accounting for measurement error. A significant

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    11/21

    118 U. Knoch / Assessing Writing 12 (2007) 108128

    Table 5

    Rater separation statistics

    DELNA scale TSA-based scale

    Rater fixed chi square value 66.9 d.f.: 7 significance

    (probability): .00

    216.8 d.f.: 7 significance

    (probability): .00Rater separation ratio 2.94 5.62

    Table 6

    Rater infit mean square values

    Rater Rater infit mean

    square DELNA scale

    Rater point biserial

    DELNA scale

    Rater infit mean square

    TSA-based scale

    Rater point biserial

    TSA-based scale

    2 1.11 .67 1.07 .65

    4 1.30 .54 1.08 .70

    5 1.53 .68 1.11 .79

    7 .69 .68 .98 .56

    9 .91 .68 .95 .65

    12 1.08 .50 .97 .71

    13 .67 .61 .73 .70

    14 .69 .65 1.07 .58

    Mean 1.00 .63 .99 .67

    S.D. .31 .07 .12 .09

    fixed chi square means that the severity measures of at least two raters included in the analysis

    are significantly different. The fixed chi square value for both scales is significant,2 showing thattwo or more raters are significantly different in terms of leniency or harshness; however, the fixed

    chi square value of the TSA-based scale is bigger indicating a larger difference between raters

    in terms of severity. The rater separation ratio provides an indication of the spread of the rater

    severity measures. The closer the separation ratio is to zero, the closer the raters are together in

    terms of their severity. Again, the larger separation ratio of the TSA-based scale shows that the

    raters differed more in terms of leniency and harshness.

    Another important output of the rater measurement report is the infit mean square statistics with

    the rater point biserial correlations (see Table 6). The infit mean square has an expected mean of

    1. Raters with very low infit mean square statistics (lower than .7) do not show enough variation in

    their ratings meaning they are overly consistent and possibly overuse the inner band levels of thescale, whilst raters with infit mean square values higher than 1.3 show too much variation in their

    ratings meaning they rate inconsistently. Table 6 shows that two raters rated near the margins of

    acceptability when using the existing rating scale, whilst no raters rated too inconsistently when

    using the TSA-based scale. Three raters, however, showed not enough variation in their ratings

    when using the DELNA scale, shown by the infit mean square values lower than .7.

    The rater point biserial correlation coefficient is reported for each rater individually as well

    as for the raters as a group. It summarizes the degree to which a particular raters ratings are

    consistent with the ratings of the rest of the raters. The point biserial correlation is concerned with

    2 Myford and Wolfe (2004) note that the fixed chi square test is very sensitive to sample size. Because of this, the fixed

    chi square value is often significant, even if the actual variation in terms of leniency and harshness between the raters is

    actually small.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    12/21

    U. Knoch / Assessing Writing 12 (2007) 108128 119

    Table 7

    Candidate separation statistics

    DELNA scale TSA-based scale

    Candidate fixed chi square 833.5 d.f.: 99 significance

    (probability): .00

    736.6 d.f.: 99 significance

    (probability): .00Candidate separation ratio 3.93 4.13

    the degree to which raters are ranking candidates in a similar fashion. Myford and Wolfe (2004)

    suggest that the expected values for this correlation are between .3 and .7, with a correlation of .7

    being high for rating data. The point biserial correlation coefficient for the DELNA scale is .63,

    whilst the TSA-based rating scale results in an average correlation coefficient of .67.

    The findings from Table 6 indicate that raters, when using the TSA-based scale, with its more

    defined categories, seemed to be able to not only rank candidates more similarly, but also to

    achieve consistency in their ratings.Similarly to the rater measurement report described above, FACETS also generates a candidate

    measurement report. The first group of statistics in this report is the candidate separation statistics.

    The candidate fixed chi square tests the assumption that all candidates are of the same level of

    performance. The candidate fixed chi square values in Table 7 indicate that the ratings based on

    the TSA-based scale are slightly more discriminating (seen by the lower fixed chi square value).

    The same trend can be seen when the candidate separation ratio is examined. The candidate

    separation ratio indicates the number of statistically significant levels of candidate performance.

    This statistic also shows that whenraters used the TSA-based scale, their ratings were slightly more

    discriminating. Although the existing DELNA scale has 6 levels of descriptors for coherence, the

    raters only separated the candidates into 3.93 levels. The TSA-based scale consists of five levelsand the raters separated the candidates into 4.13 levels when using it. The higher discrimination

    ability of the new scale is a product of the higher rater point biserial correlation.

    For the comparison of the rating scale categories, FACETS produces scale category statistics.

    The tables for the existing DELNA and the TSA-based scale are reproduced in Tables 8 and 9

    respectively. The first column in each table shows the raw scores represented by the two rating

    scales. Please note that the TSA-based scale has one less category to award, therefore only ranges

    from four to eight. The second column shows the number of times (counts) each of these scores

    was used by the raters as a group, the third column shows these numbers as percentages of overall

    use. When looking at the counts and percentages, it is clear that the raters when using the existing

    DELNA scale, under-used the outside categories: in particular, category 4 was rarely awarded.This table also underlines the evidence that the raters, when using the existing DELNA scale,

    Table 8

    Scale category statisticsDELNA scale

    Score Counts (%) Average

    measure

    Expected measure Outfit mean

    square

    Step calibration

    measure

    4 3 0 1.69 2.21 1.2

    5 59 7 1.43 1.46 1.0 4.82

    6 288 36

    0.41

    .38 1.0

    2.537 299 37 1.04 .98 .9 .28

    8 127 16 2.46 2.34 .8 2.58

    9 24 3 2.95 3.24 1.3 4.49

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    13/21

    120 U. Knoch / Assessing Writing 12 (2007) 108128

    Table 9

    Scale category statisticsTSA-based scale

    Score Counts (%) Average measure Expected measure Outfit mean

    square

    Step calibration

    measure

    4 38 5 2.03 2.01 1.05 165 21 .75 .81 1.1 2.89

    6 239 30 .33 .39 1.0 .57

    7 205 26 1.49 1.49 .9 1.10

    8 153 19 2.73 2.70 .9 2.36

    displayed a central tendency effect. The scores are far more widely spread when the raters used

    the TSA-based scale; however level 4 seemed still slightly underused, only being awarded in 5%

    of all cases. Low frequencies might indicate that the categories are unnecessary or redundant and

    should possibly be collapsed.Column four indicates the average candidate measures at each scale category. These measures

    should increase as the scale category increases. This is the case for both scales. When this pattern

    is seen to be occurring, it shows that the rating scale points are appropriately ordered and are func-

    tioning properly. This means that higher ratings do correspond to more of the variable being rated.

    Column five shows the expected average candidate measure at each category, as estimated by

    the FACETS program. The closer the expected and the actual average measures, the closer the

    outfit mean square value in column six will be to 1. It can be seen that the outfit mean square values

    for both scales are generally close to 1, however category 9 of the existing scale is slightly high,

    which might mean that it is not contributing meaningfully to the measurement of the variable of

    coherence. Bond and Fox (2001) suggest that this might be a good reason for collapsing a category.Column seven gives the step calibration measures, which denote the point at which the prob-

    ability curves for two adjacent scale categories cross (Linacre, 1999). Thus, the rating scale

    category threshold represents the point at which the probability is 50% of a candidate being rated

    in one or the other of these two adjacent categories, given that the candidate is in one of them. The

    rating scale category thresholds should increase monotonically and be equally distanced (Linacre,

    1999) so that none are too close or too far apart. This is generally the case for both rating scales

    under investigation. Myford and Wolfe (2004) argue that if the rating scale category thresholds are

    widely dispersed, the raters might be exhibiting a central tendency effect. This is ever so slightly

    the case for the existing DELNA scale.

    Overall, the results from research question two indicate that the raters, when using the TSA-

    based scale, were able to discern more levels of ability among the candidates and ranked the

    candidates more similarly. They were also able to use more levels on the scale reliably. All these

    show evidence that the TSA-based scale functions better than the existing scale. However, the

    raters differed more in terms of leniency and harshness when using the TSA-based scale, which

    is undesirable but less crucial in situations where scripts are routinely double-rated.

    5.3. RQ3: What are raters perceptions of using TSA in a rating scale as compared to more

    conventional rating scales when rating writing?

    The interviews provided some evidence that raters experienced problems when using the less

    specific level descriptors of the DELNA scale. Rater 12, for example, described his problems

    when using the DELNA scale in the following quote:

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    14/21

    U. Knoch / Assessing Writing 12 (2007) 108128 121

    . . . sometimes I look at [the descriptor] Im going what do you mean by that? . . . You just

    kind of have to find a way around it cause its not really descriptive enough, yeah

    Rater 4 provided evidence of a strategy that she resorted to when experiencing problems

    assigning a level:

    I just tend to go with my gut feeling. So I dont spend a lot of time worrying about it ... but

    I think this is a very good example of where, if I have an overall sense that a script is really

    a seven, Id be likely to give it a seven in coherence.

    Raters were asked in a questionnaire about their perceptions of the rating scale category of

    coherence in the TSA-based scale. Four raters commented that it took them a while to get used

    to the rating scale and that they had concerns about it not being very marker friendly (e.g.,

    Rater 5). Most of these raters, however, mentioned that they became accustomed to the category

    after having marked a number of scripts. Rater 2, for example, mentioned that he likes the scalebecause it gives me a lot more guidance than the DELNA scale and I feel that I am doing the

    writers more justice in this way.

    One rater, however, commented that the TSA-based scale is narrower than the DELNA coher-

    ence scale as it focuses only on topical structure and not on other aspects of coherence.

    Overall, the raters found the TSA-based scale more objective and more descriptive.

    6. Discussion and conclusion

    The analysis of the 602 writing scripts using TSA analysis was able to show that this measure is

    successful in differentiating between different proficiency levels. The redesign of the categories

    suggested by Schneider and Connor (1990) was valuable in improving the usefulness of the

    measure. Especially the new categories of superstructure and coherence breaks were found to be

    discriminating between different levels of writing ability. Apart from being useful in the context

    of rating scale development and assessment, this method could be applied to teaching, as was

    suggested by Connor and Farmer (1990). Overall, TSA analysis was shown to be useful as an

    objective discourse analytic measure of coherence.

    The comparison of the ratings based on the two different rating scales was able to provide

    evidence that the raters rated more accurately when using the TSA-based scale. Raters used more

    band levels and ranked the candidates more similarly when using this scale. Therefore, whenusing more specific rating scale descriptors when rating a fuzzy concept such as coherence, raters

    were able to identify more levels of candidate ability. Helping raters to divide performances into

    as many ability levels as possible, is the aim of rating scales.

    The raters, when in doubt which band level to award to a performance when using the more

    impressionistic descriptors of coherence on the DELNA scale, seemed to resort to two different

    strategies. Either they used most band levels on the scale, but did so inconsistently, or they overused

    the band levels 6 and 7 and avoided the extreme levels, especially levels 4 and 9. Whilst this might

    be less of a problem if the trait is only one of many on an analytic rating scale and the score

    is reported as an averaged score, in a diagnostic context, in which we would like to report the

    strengths and weaknesses of a candidate to stakeholders, this might result in a loss of valuableinformation. Alderson (2005) suggests that diagnostic tests should focus more on specific rather

    than global abilities, and therefore it could be argued that the TSA-based descriptors might be

    particularly useful in a diagnostic context.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    15/21

    122 U. Knoch / Assessing Writing 12 (2007) 108128

    The interview and questionnaire data also provided evidence that the raters focused more

    on the descriptors when these were less vague and impressionistic. If we are able to arrive at

    descriptors which enable raters to rely less on their gut feeling of the overall quality of a writing

    performance but focus more on the descriptions of performance in the level descriptors, then we

    inevitably arrive at more reliable and probably more valid ratings. This study was able to showthat developing descriptors empirically might be the first step in this direction.

    An important consideration with respect to the two scales discussed in this study is practi-

    cality. Two types of practicality need to be considered: practicality of scale development and

    practicality of scale use. The TSA-based scale was clearly more time-consuming to develop than

    the pre-existing DELNA scale. To do a detailed empirical analysis of a large number of writ-

    ing performances is labor-intensive and therefore might not be practical in some contexts. The

    practicality of the scale use is another issue that needs to be considered. In this case, there was

    evidence from the interviews and questionnaires that raters needed more time when rating with

    the TSA-based scale. However, most reported becoming accustomed to these more detailed scale

    descriptors.One limitation of this study is that TSA does not cover all aspects of coherence. So whilst

    the TSA-based scale is more detailed in its descriptions, some aspects of coherence which raters

    might look for when using more conventional rating descriptors might be lost, which lowers the

    content validity of the scale. However, it seems that the existing rating scale also resulted in

    two raters rating too inconsistently, which might be because they were judging different aspects

    in different scripts and others overusing the inside scale categories. Lumley (2002) was able to

    show that when raters are confronted with aspects of writing which are not specifically men-

    tioned in the scale descriptors, they inevitably use their own knowledge or feelings to resolve

    it by resorting to strategies. However, this study was able to show that rating scales with very

    specific level descriptors can help avoid play-it-safe methods and make it easier to arrive at a level

    (which is what rating scales are ultimately designed for), even though some content validity is

    sacrificed.

    It is also important to mention that the raters taking part in this study were far more famil-

    iar with the current DELNA scale, having used it for many years. Being confronted with the

    TSA-based scale for this research project meant a departure from the norm. It is therefore

    possible that the raters in this study varied more in terms of severity when using the TSA-

    based scale because they were less familiar with it. It might be possible that if they were

    to use the TSA-based scale more regularly and receive more training, the variance in terms

    of leniency and harshness might be reduced. It seems important to ensure that rating pat-terns over time remain stable, and avoid central tendency effects, by subjecting individual

    trait scales to regular quantitative and qualitative validation studies, and addressing varia-

    tions both through rater training (as is usually the case) and better specification of scoring

    criteria.

    This research was able to show the value of developing descriptors based on empirical inves-

    tigation. Even an aspect of writing as vague and elusive as coherence was operationalized for

    this study. Rating scale developers should consider this method of scale development as a viable

    alternative to intuitive development methods which are commonly used around the world. Over-

    all, it can be said however that more detailed, empirically developed rating scales might lend

    themselves to being more discriminating and result in higher levels of rater reliability than moreconventional rating scales. Further research is necessary to establish if this is also the case for

    other traits, as this study only looked at a scale for coherence. Also, it would be interesting to

    pursue a similar study in the context of speaking assessment.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    16/21

    U. Knoch / Assessing Writing 12 (2007) 108128 123

    Appendix A. Criteria for identifying sentence topics (taken from Wu, 1997)

    1. Sentence topics are defined as the leftmost NP dominated by the finite verb in the t-unit. It is

    what the t-unit is about.

    2. Exceptions:a. Cleft sentences

    i It is the scientistwho ensures that everyone reaches his office on time

    ii It is Jane we all admire

    b. Anticipatory pronoun it

    i It is well known that a society benefits from the work of its members

    ii it is clear that he doesnt agree with me

    c. Existential there

    i There often exists in our society a certain dichotomy of art and science

    ii There are many newborn children who are helpless

    d. Introductory phrasei Biologists now suggest that language is species-specific to the human race.

    Appendix B. Box plots comparing TSA over different levels

    See Figs. 28.

    Fig. 2. Proportion of parallel progression over five DELNA levels.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    17/21

    124 U. Knoch / Assessing Writing 12 (2007) 108128

    Fig. 3. Proportion of direct sequential progression over five DELNA levels.

    Fig. 4. Proportion of superstructure over five DELNA levels.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    18/21

    U. Knoch / Assessing Writing 12 (2007) 108128 125

    Fig. 5. Proportion indirect progression over five DELNA levels.

    Fig. 6. Proportion extended progression over five DELNA band levels.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    19/21

    126 U. Knoch / Assessing Writing 12 (2007) 108128

    Fig. 7. Proportion unrelated progression over five DELNA band levels.

    Fig. 8. Proportion coherence breaks over five DELNA band levels.

  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    20/21

    U. Knoch / Assessing Writing 12 (2007) 108128 127

    References

    Alderson, C. (2005). Diagnosing foreign language proficiency. The interface between learning and assessment. London:

    Continuum.

    Bamberg, B. (1984). Assessing coherence: A reanalysis of essays written for National Assessment of Education Progress.

    Research in the Teaching of English, 18 (3), 305319.Bond, T. G., & Fox, C. M. (2001).Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah,

    NJ: Lawrence Erlbaum.

    Burneikaite, N., & Zabiliute, J. (2003). Information structuring in learner texts: A possible relationship between the topical

    structure and the holistic evaluation of learner essays. Studies about Language, 4, 111.

    Cason, G. J., & Cason, C. L. (1984). A deterministic theory of clinical performance rating. Evaluation and the Health

    Professions, 7, 221247.

    Cheng, X., & Steffensen, M. S. (1996). Metadiscourse: A technique for improving student writing. Research in the

    Teaching of English, 30 (2), 149181.

    Connor, U., & Farmer, F. (1990). The teaching of topical structure analysis as a revision strategy for ESL writers. In: B.

    Kroll (Ed.), Second language writing: Research insights for the classroom. Cambridge: Cambridge University Press.

    Crismore, A., Markkanen, R., & Steffensen, M. S. (1993). Metadiscourse in persuasive writing: A study of texts written

    by American and Finnish university students. Written Communication, 10, 3971.

    Elder, C. (2003). The DELNA initiative at the University of Auckland. TESOLANZ Newsletter, 12 (1), 1516.

    Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater responses to an online rater training

    program. Language Testing, 24 (1), 3764.

    Elder, C., & Erlam, R. (2001).Development and validation of the diagnostic English language needs assessment (DELNA):

    Final Report. Auckland: University of Auckland, Department of Applied Language Studies and Linguistics.

    Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it

    work? Language Assessment Quarterly, 2 (3), 175196.

    Elder, C., & von Randow, J., (2002). Report on the 2002 Pilot of DELNA at the University of Auckland. Auckland:

    University of Auckland, Department of Applied Language Studies and Linguistics.

    Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. ELT Journal, 41 (4), 287291.

    Fulcher, G. (1996). Does thick description leadto smart tests? A data-based approach to ratingscale construction.LanguageTesting, 13 (2), 208238.

    Fulcher, G. (2003). Testing second language speaking. London: Pearson Longman.

    Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In: L. Hamp-Lyons (Ed.), Assessing second language

    writing in academic contexts. Norwood, NJ: Ablex Publishing Corporation.

    Hoenisch, S. (1996). The theory and method of topical structure analysis. Retrieved 30 April 2007, from

    http://www.criticism.com/da/tsa-method.php

    Hoey, M. (1991). Patterns of lexis in text. Oxford: Oxford University Press.

    Intaraprawat, P., & Steffensen, M. S. (1995). The use of metadiscourse in good and poor ESL essays. Journal of Second

    Language Writing, 4 (3), 253272.

    Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL composition: A practical approach.

    Rowley, MA: Newbury House.

    Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-facetraining? Assessing Writing, 12 (1), 2643.

    Lautamatti, L. (1987). Observations on the development of the topic of simplified discourse. In: U. Connor & R. B. Kaplan

    (Eds.), Writing across languages: Analysis of L2 text (pp. 87114). Reading, MA: Addison-Wesley.

    Lee, I. (2002). Teaching coherence to ESL students: a classroom inquiry. Journal of Second Language Writing, 11,

    135159.

    Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3 (2), 103122.

    Linacre, J. M. (2006). Facets Rasch measurement computer program. Chicago: Winsteps.

    Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language

    Testing, 19 (3), 246276.

    McIntyre, P. N. (1993). The importance and effectiveness of moderation training on the reliability of teachers assessments

    of ESL writing samples. Unpublished masters thesis, University of Melbourne, Australia.

    Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet rasch measurement: Part

    I. Journal of Applied Measurement, 4 (4), 386422.

    Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part

    II. Journal of Applied Measurement, 5 (2), 189227.

    http://www.criticism.com/da/tsa-method.phphttp://www.criticism.com/da/tsa-method.php
  • 7/31/2019 Little coherence, considerable strain for reader , A comparison between two rating scales for the assessment of

    21/21

    128 U. Knoch / Assessing Writing 12 (2007) 108128

    North, B. (1995). The development of a common framework scale of descriptors of language proficiency based on a theory

    of measurement. System, 23 (4), 445465.

    North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation

    formats. TOEFL Research Paper. Princeton, NJ: Educational Testing Service.

    North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales.Language Testing, 15 (2), 217263.

    Schneider, M., & Connor, U. (1990). Analyzing topical structure in ESL essays. Studies in Second Language Acquisition,12 (4), 411427.

    Tribble, C. (1996). Writing. Oxford: Oxford University Press.

    Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples: Effects of the scale maker and the

    student sample on scale content and student scores. TESOL Quarterly, 36(1), 4970.

    Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49 (1), 312.

    Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: Test method

    and learner discourse. Language Testing, 16(1), 82111.

    Watson Todd, R. (1998). Topic-based analysis of classroom discourse. System, 26, 303318.

    Watson Todd, R., Thienpermpool, P., & Keyuravong, S. (2004). Measuring the coherence of writing using topic-based

    analysis. Assessing Writing, 9, 85104.

    Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and

    qualitative approaches. University of California, Los Angeles: Unpublished doctoral dissertation.

    Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11 (2), 197223.

    Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15 (2), 263287.

    Witte, S. (1983). Topical structure analysis and revision: An exploratory study. College Composition and Communication,

    34 (3), 313341.

    Witte, S. (1983). Topical structure and writing quality: Some possible text-based explanations of readers judgments of

    students writing. Visible Language, 17, 177205.

    Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.

    Wu, J. (1997). Topical structure analysis of English as a second language (ESL) texts written by college Southeast Asian

    refugee students. Unpublished doctoral dissertation, University of Minnesota.