Limitations of the Common - timeforjapanese.com · 1990; Bachman and Palmer, 1996; Davidson and...

21
Language Testing 2005 22 (3) 281–300 10.1191/0265532205lt309oa © 2005 Edward Arnold (Publishers) Ltd Address for correspondence: Cyril J. Weir, Professor in ELT; Director, Centre for Research in Testing, Evaluation and Curriculum, Roehampton University, Roehampton Lane, London, SW15 5PH, UK; email: [email protected] Limitations of the Common European Framework for developing comparable examinations and tests Cyril J. Weir Roehampton University The Common European Framework of Reference (CEFR) posits six levels of proficiency and defines these largely in relation to empirically derived dif- ficulty estimates based on stakeholder perceptions of what language func- tions expressed by ‘Can-do’ statements can be successfully performed at each level. Though also containing much valuable information on language proficiency and advice for practitioners, in its present form the CEFR is not sufficiently comprehensive, coherent or transparent for uncritical use in lan- guage testing. First, the descriptor scales take insufficient account of how variation in terms of contextual parameters may affect performances by rais- ing or lowering the actual difficulty level of carrying out the target ‘Can-do’ statement. In addition, a test’s theory-based validity – a function of the pro- cessing involved in carrying out these ‘Can-do’ statements – must also be addressed by any specification on which a test is based. Failure to explicate such context and theory-based validity parameters – i.e., to comprehensively define the construct to be tested – vitiates current attempts to use the CEFR as the basis for developing comparable test forms within and across languages and levels, and hampers attempts to link separate assessments, particularly through social moderation. I Introduction The Common European Framework of Reference (CEFR) attempts to describe language proficiency through a group of scales composed of ascending level descriptors couched in terms of outcomes. These descriptor scales are supplemented by a broad compendium of useful information on consensus views regarding language learning, teaching and assessment. In relation to testing, the following strong claims are made for the CEFR: … the Framework can be used:

Transcript of Limitations of the Common - timeforjapanese.com · 1990; Bachman and Palmer, 1996; Davidson and...

Language Testing 2005 22 (3) 281–300 10.1191/0265532205lt309oa © 2005 Edward Arnold (Publishers) Ltd

Address for correspondence: Cyril J. Weir, Professor in ELT; Director, Centre for Research inTesting, Evaluation and Curriculum, Roehampton University, Roehampton Lane, London, SW155PH, UK; email: [email protected]

Limitations of the Common European Framework for developingcomparable examinations and testsCyril J. Weir Roehampton University

The Common European Framework of Reference (CEFR) posits six levelsof proficiency and defines these largely in relation to empirically derived dif-ficulty estimates based on stakeholder perceptions of what language func-tions expressed by ‘Can-do’ statements can be successfully performed ateach level. Though also containing much valuable information on languageproficiency and advice for practitioners, in its present form the CEFR is notsufficiently comprehensive, coherent or transparent for uncritical use in lan-guage testing. First, the descriptor scales take insufficient account of howvariation in terms of contextual parameters may affect performances by rais-ing or lowering the actual difficulty level of carrying out the target ‘Can-do’statement. In addition, a test’s theory-based validity – a function of the pro-cessing involved in carrying out these ‘Can-do’ statements – must also beaddressed by any specification on which a test is based. Failure to explicatesuch context and theory-based validity parameters – i.e., to comprehensivelydefine the construct to be tested – vitiates current attempts to use the CEFRas the basis for developing comparable test forms within and acrosslanguages and levels, and hampers attempts to link separate assessments,particularly through social moderation.

I Introduction

The Common European Framework of Reference (CEFR) attemptsto describe language proficiency through a group of scales composedof ascending level descriptors couched in terms of outcomes. Thesedescriptor scales are supplemented by a broad compendium of useful information on consensus views regarding language learning,teaching and assessment.

In relation to testing, the following strong claims are made for theCEFR:

… the Framework can be used:

michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight

1) for the specification of the content of tests and examinations;2) for stating the criteria to determine the attainment of a learning

objective;3) for describing the levels of proficiency in existing tests and examina-

tions thus enabling comparisons to be made across different systems ofqualifications.

(Council of Europe, 2001: 178)

North (2004) advocates a lighter approach to its use:

A key idea always present in the development of the CEFR was to use thedescriptor scales … to profile the content of courses, assessments and exam-inations. These can then be related to each other through their CEFR profilewithout making direct comparisons between them or claiming that one is anexact equivalent of the other.

The Association of Language Testers in Europe (ALTE) sees thevertical system of ‘natural levels’ in the scales as providing a basisfor building arguments about how their own examinations could bealigned to this common reference tool. However, they do not regardit as a suitable basis for developing their examinations ab initio and,like North, are sensitive to the possibility of false assumptions ofequivalence where tests constructed for different purposes and audi-ences and which view and assess language constructs in differentways are located together on the same scale point.

Other articles in this issue exemplify how the CEFR has con-tributed usefully to test development in Europe and identify a vari-ety of qualities that make it the least arbitrary sequence of scaledproficiency descriptors available to us at the moment. In contrast, inthis article I have been asked, as the title indicates, to confine myself to discussing whether the use of the existing CEFR for test develop-ment or comparability is problematic in any way. Accordingly I onlyconsider the four areas of concern listed below where existingresearch on the CEFR suggests further support for test developers is required:

• the scales are premised on an incomplete and unevenly appliedrange of contextual variables/performance conditions (contextvalidity);

• little account is taken of the nature of cognitive processing atdifferent levels of ability (theory-based validity);

• activities are seldom related to the quality of actual performanceexpected to complete them (scoring validity);

• the wording for some of the descriptors is not consistent ortransparent enough in places for the development of tests.

282 Developing comparable examinations and tests

michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Rectangle
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight

Limitations of space preclude a full treatment of all four and so Ihave chosen to focus on context and theory-based validity as theseimpact the most on test development and comparability. Only briefattention is be paid to scoring validity and transparency problems.

The needs of test/examination constructors may differ markedlyfrom the users of test results (see Alderson, 1991). The latter may besatisfied with fewer, more generic, less detailed, context-independentand non-theory-based specifications within and across languages. Todevelop tests at different levels of proficiency, however, we needmore comprehensive specifications (see, for example, Bachman,1990; Bachman and Palmer, 1996; Davidson and Lynch, 2002; Weir,2004). To establish equivalence between the same level examina-tions across forms from year to year, or between examinationsoffered at a particular level by different providers, comprehensivespecification of the construct to be measured is as essential asdemonstrating statistical equivalence.

Despite the strong claims made above I recognize that the CEFRwas not designed specifically to meet the needs of language testersand that it will require considerable, long-term research, muchreflective test development by providers, and prolonged criticalinteraction between stakeholders in the field to address these defi-ciencies.

To demonstrate that their tests are fair measurements of a specifiedlevel of proficiency, providers need to furnish evidence that they areconstruct-valid, i.e., that they adequately address context, theory-based and scoring parameters of validity appropriate to the level oflanguage ability under consideration (Weir, 2004). A framework isrequired that helps identify the elements of both context and process-ing and the relationships between these at varying levels of profi-ciency, i.e., one that addresses both situational and interactionalauthenticity (see Bachman and Palmer, 1996). Since the CEFR isdeficient in both, it is not surprising that a number of studies haveexperienced difficulty in attempting to use the CEFR for testdevelopment or comparability purposes (see Huhta et al., 2002;Jones, 2002; Little et al., 2002; Alderson et al., 2004; Morrow, 2004).

Little et al. (2002: 64) point to such a limitation for using theCEFR in assessment in their attempts to develop English-languageprovision for refugees in Ireland. They warn that ‘[e]vidence of thedifficulty that learners encounter in using the CEFR to maintain on-going reflective self-assessment suggests a need for moredetailed descriptions of proficiency relevant to particular domains oflanguage learning.’

Cyril J. Weir 283

michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight

In a similar vein, Huhta et al. (2002: 133) comment: ‘The theoreticaldimensions of people’s skills and language use which CEFRdiscusses are on a very high level of abstraction.’ It is interesting tonote that Huhta et al. (2002: 134) feel the need to draw on resourcesother than the CEFR in their work on the DIALANG AssessmentFramework (DAF):

[S]ince these inventories are summaries with narrow exemplification andsince the DAF was developed as a resource for test writers, we found it useful to develop links to the more extensive lists of examples in the Councilof Europe Waystage, Threshold and Vantage Level documents.

Alderson (n.d.), in a rare substantive review of these early Europeanlanguage benchmarks, notes that even these earlier Council ofEurope publications (Waystage, Threshold and Vantage Level docu-ments) are not particularly helpful in distinguishing between levelsof proficiency in terms of contextual variables:

[T]he grammar identified for each level appears not to be too different forThreshold and Waystage … Under Degree of Skill, which is supposedly con-cerned with ‘how well learners should be able to do what is specified’, onewould expect, prima facie, that this would differ between Waystage andThreshold learners … the authors … say it is impossible to specify. No dif-ferences are given, and the chapters on Degree of Skill (Threshold Chapter14, Waystage Chapter 12) are identical … Appendix 3 (of the DAF) com-pares and contrasts the Communicative Activities contained in Waystage,Threshold and Vantage … As the authors of DAF note, ‘each level lacksspecifications for production and mediation’. Appendix 4 of DAF comparesTexts, as appearing in Waystage, Threshold and Vantage (Chapters 6, 9 and9 respectively). The differences between Waystage and Threshold appearminimal. Appendix 5 to DAF contrasts Functions in the three publications.The similarities are striking; the differences, although there, do not seemsignificant.

A framework for testing purposes would need to comprehensivelyaddress at different levels of proficiency the components of validitydetailed in Figure 1. Context validity is concerned with the socialdimensions of a task, including the setting of the task, and in partic-ular the linguistic and social demands arising out of the task orresulting from interaction with an interlocutor/addressee. It is likelythat different demands will be made on participants in terms of cop-ing with these contextual variables at the different levels of profi-ciency addressed by the CEFR scales. Exams need to reflect howconstructs differ from level to level in terms of the contextual condi-tions affecting task performance. They need, where applicable, tooperationalize appropriate, significant, divergent conditions (singleor configured) under which tasks are performed that enable us to

284 Developing comparable examinations and tests

differentiate between performances of a task at adjacent proficiencylevels. The positioning of this aspect of validity is represented by thecontext validity box in Figure 1 and a number of individual parame-ters that may potentially aid differentiation between levels are listedin Figure 2 (see below).

Secondly, we need to establish the nature and extent of the cogni-tive and metacognitive processing participants have to perform incarrying out tasks at various levels on scales such as the CEFR (seethe theory-based validity box in Figure 1 and Figure 3 (see below)for a processing model for writing). The CEFR does not currentlyoffer a view of how language develops across these proficiency lev-els in terms of cognitive or metacognitive processing (however, fora general discussion of such processing, see Council of Europe,2001: Section 3.6). The likelihood is that such processing willbecome more complex and comprehensive as language proficiencydevelops between CEFR levels A1 and C2. Defining such progres-sion will, however, be no simple matter.

Furthermore, the issue of quality of performance necessarilyinvolves us in scoring validity. We need to know how well the par-ticipants are expected to carry out a task at a particular level in termsof clear and explicitly specified criteria of assessment that are sym-biotically linked to the context-based and theory-based parameters ofthe construct being measured. In addition to appropriate scoring cri-teria we need convincing evidence on test raters and rating whichconfirms that we can rely on the results of the test. The parameters

Cyril J. Weir 285

Figure 1 An internal construct validity framework

286 Developing comparable examinations and tests

Figure 2 Parameters of context validitySource: Weir, 2004

Figure 3 Parameters of theory-based validity: an example for writingSource: Weir, 2004

affecting test reliability in the rating process are listed in Figure 4(see Weir and Milanovic, 2003). Any comparison of tests needs totake account of all these parameters of scoring validity.

II Context validity

The CEFR claims to be comprehensive: ‘it should attempt to specifyas full a range of language knowledge, skills and use as possible …and that all users should be able to describe their objectives, etc., byreference to it’ (Council of Europe, 2001: 7). However, although theFramework volume introduces many concepts, it does not actuallyincorporate these into the scales. It has a frequently recurring advicebox (Council of Europe, 2001):

Users of the framework may wish to consider and where appropriate state…

This advice box appears in the discussion on:

• context of language use (pp. 44–51), including domains, situa-tions, conditions and constraints, mental contexts of participants;

Cyril J. Weir 287

Scoring validity Rating

• Criteria/rating scale

• Raters

• Rating procedures

– Rater training

– Standardization

– Rating conditions

– Rating

– Moderation

– Statistical analysis

• Grading and awarding

Figure 4 Parameters of scoring validitySource: Weir, 2004

michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight

• communication themes (pp. 51–53), including thematic areas/topics, sub-areas and specific notions the learners will need inselected domains;

• communicative tasks and purposes (pp. 53–56) the learners mayneed to perform in the various domains;

• texts (pp. 93–100), including media and text types;• communicative language competences (pp. 108–30), including

linguistic, sociolinguistic and pragmatic competences; and• tasks and their role in language teaching (pp. 157–166).

Decisions with regard to the content of tests are mostly left to theindividual user, and apart from taxonomic generic lists no furtherhelp is provided for the test writer to determine what may or may notbe appropriate at the six different levels. The new draft Manual onrelating examinations to the CEFR (Council of Europe, 2003;Figueras et al., this issue) similarly expects test developers to decidefor themselves what contextual parameters are appropriate at differ-ent levels on the CEFR scales. Thus, there is very little help for thosewishing to understand how language proficiency develops in theserespects. The CEFR may never have been intended to provide a com-prehensive account of this but language testers need to, however dif-ficult this may prove. Only the communicative language activitiesand strategies (pp. 57–90) are laid out explicitly for the six levels ofthe framework based on the earlier, rigorous work by North (2000),and in this particular respect the CEFR does provide a valuable guidefor test developers.

Jones’ (2002: 181) finding that ‘different people tend to under-stand “Can-do’’ somewhat differently’ is indicative of the need tosearch for greater precision and explicitness in test specification thanis currently provided by the CEFR. Kaftandjieva and Takala (2002:113) argue for making the CEFR more comprehensive when theyreport: ‘the construct of language proficiency in writing is not quiteso well defined as the other two constructs… The assessment of thelevel of a particular writing “Can-do’’ statement could vary betweenlevel 2 (A2) and level 6 (C2) (p. 126).’ Jones (2002: 177) had a sim-ilar experience in trying to map ALTE ‘Can-do’ statements on to theCEFR: ‘One problem is that in the current analysis the highest level(C2) statements are not well distinguished from the level below(C1).’ The likely root cause is that so few contextual parameters ordescriptions of successful performance are attached to such ‘Can-do’statements. Both the context and the quality of performance may beneeded to ground these distinctions.

288 Developing comparable examinations and tests

michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Rectangle

Hawkey and Barker (2004) – reporting on the work of CambridgeESOL in developing a common scale for writing to be used acrossthe levels in its main suite examinations – note that it was necessaryto draw upon corpus analysis to identify features of texts that helpeddiscriminate between candidates along the CEFR scale, as they wereunable to use functional competence to maintain these distinctions.Saville (personal communication) also draws attention to the qualityof the product in writing as the locus for differences at the higherlevels of proficiency:

The distinctive features are often difficult to characterise at C2 level, but tendto relate to discoursal and pragmatic features and how these lead to complexor sophisticated uses of grammar and lexis. C2 might be characterised bymore of the following: complexity, precision, flexibility, sophistication,naturalness, impact on reader, independence than C1.

The potential importance of a number of these contextual parametersfor better grounding distinctions between adjacent proficiency levelson the CEF scales is discussed next (see Figure 2 for a summary ofthese), and the effects of such choices on theory-based validitynoted.

1 Purpose

Having a clear purpose in completing a test task will facilitate goal setting and monitoring, two key meta-cognitive strategies inlanguage processing. There is a symbiotic relationship between the choices we make in relation to test purpose (and the othercontextual variables listed) and the processing that results in taskcompletion.

The CEFR is not helpful about the purposes for which we uselanguage at the different levels and lacks specification of the sub-skills of comprehension, such as comprehension of main ideas,important details, inferences and conclusions; recognizing the struc-ture of a text; and recognizing connections between its parts(Alderson et al., 2004: 44). The purpose of a reading activity, forexample, will determine the type of reading performed on it (seeUrquhart and Weir, 1998). If the purpose is to go quickly through apassage to extract dates and figures, or through an index to find apage reference, then scanning is likely to be the type of readingcalled for. To get information for a university assignment mightinvolve expeditious search reading of several articles and/or bookswith careful reading kicking in when appropriate information is

Cyril J. Weir 289

michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight

located. To establish the general gist of an article to decide if it isworth reading, you might skim it quickly. Are these purposes and thedifferent cognitive processing they engender appropriate across allproficiency levels?

Exam providers need to clearly specify the extent to whichpurposes differ from level to level and provide evidence on how thiscriterion can be used to distinguish between adjacent proficiency -levels. Clearly, convincing a listener of your viewpoint is of a differ-ent order than directing somebody to the post office a few streetsaway, but which level(s) is each appropriate for? It may be that aconfiguration of purpose with other conditions – such as length, timeconstraints, topic, and channel – better grounds such distinctionsbetween adjacent levels, and we need to address this.

2 Response format

Alderson et al. (2004: 10) note that there is nothing in the CEFRabout response format, even though the CEFR is supposed to be areference point for assessment: ‘in a sense test method is the task,and so a consideration of test methods is crucial.’ Methods may wellbe used across different levels but, as Weir (2004: Chapter 8) pointsout, the techniques selected will have clear implications for thecontext and theory-based validity of the assessment. A decision touse short-answer questions as against true–false items in reading willimpact on the processing that takes place. In writing, choice offormat will determine whether knowledge telling or knowledgetransformation occurs in task completion; these are two very differentprocessing experiences.

3 Time constraints

Alderson (2000: 30) notes that ‘Speed should not be measuredwithout reference to comprehension, but at present comprehension isall too often measured without reference to speed.’ In testing readingit is important to consider the time constraints for the processing oftext and answering the items set on it. If time allotment is not care-fully planned, it may result in unpredictable performance. If toomuch time is given in a reading test or is not strictly controlled per section, candidates may simply read a passage intensively, andquestions designed to test ability to process text expeditiously (i.e.,selectively and quickly) to elicit specified information may no longer

290 Developing comparable examinations and tests

activate such operations (for an example of a major research projectwhere this unfortunately happened, see Weir et al., 2000). If time ismore than sufficient in an expeditious reading task, then carefulcumulative, linear processing rather than quick selective expeditiousprocessing will result. Decisions relating to timing clearly impact onprocessing and, hence, on the theory-based validity of our test tasks.Test writers need to specify how much time will be required forcarrying out various activities at different levels. Without this, con-struct under-representation may occur. Some notion of the differ-ences across levels in terms of the speed required for reading varioustexts expeditiously or carefully, dependent on purpose, is currentlymissing from the CEFR.

4 Channel

The channel for the communication can have an obvious impact onthe performance in a speaking test. For example, it may place agreater burden on candidates if they have to conduct a telephoneconversation at distance as against carrying out a face-to-faceconversation in the same room. Jones (2002: 173) found that in hisresearch on comparing the ALTE and CEFR ‘Can-do’ statements:

… where statements moved in difficulty could often be explained by certainfeatures of the statement. Features which generally made statements moredifficult included the use of very specific exemplification … reference to adifficult channel of communication (e.g. the telephone). Features whichgenerally made statements easier included generality or brevity. (emphasisadded)

If the positioning of a ‘Can-do’ statement on the scales shifts oncesuch contextual parameters are identified, then doubts must ariseabout the generalizability of the scales in their present form.

5 Discourse modes

Alderson (n.d.) points out that:

The CEFR (2001) itself contains little information as to the content of anygiven level, other than what is contained in the numerous scales. For exam-ple, it is not easy to determine what sort of written and spoken texts might beappropriate for each level.

Similarly, Urquhart and Weir (1998:141 ff) argue that test developersmust generate evidence on what genre, rhetorical tasks and patternsof exposition are appropriate at each proficiency level.

Cyril J. Weir 291

In their work on DIALANG, Huhta et al. (2002: 137) also foundthat the CEFR provides little assistance on what discourse types aresuitable across skills across levels. They argue that:

The CEFR discussion of text types only lists different sources for texts, forinstance, books, magazines … While texts can be so categorized, this classi-fication does not help characterise the nature of language use in the text.Such modalities contain a plurality of different genres and text-types. Yet, ifdiscourse forms and typical patterns of textual organisation are taken intoaccount texts can be grouped into categories, which share common textualfeatures. This was the type of textual categorization which was needed inDIALANG.

6 Text length

Johnston (1984: 151) noted that the texts employed in reading com-prehension tests tended to be many and brief, and this trend has con-tinued to this day. An interesting comparison can be made with theTest of English for Educational Purposes (TEEP) (Weir, 1983). Thistest was designed to test academic English and involved selectingtexts of over 1000 words as a means of testing expeditious readingskills, on the grounds that such texts were more representative of real-life demands than texts of the length (relatively speaking) used inmany exams for similar purposes (e.g., ELTS and TOEFL). Aldersonet al. (2004: 12) point out that length is not defined in the CEFR,which relies on ill-defined terms such as short or long: ‘… individualscannot decide for themselves what is “short’’ or “long’’.’

7 Topic

The content knowledge required for completing a particular task willaffect the way it is completed. The relationship between the content ofthe text and the candidate’s background knowledge (general knowledgethat may or may not be relevant to content of a particular text, whichincludes cultural knowledge) and subject matter knowledge (specificknowledge directly relevant to text topic and content) needs to be con-sidered (see Douglas, 2000). The CEFR provides no guidance as to thetopics that might be more or less suitable at any level in the CEFR.

8 Lexical competence

The CEFR provides little assistance in identifying the breadth anddepth of productive or receptive lexis that might be needed to operateat the various levels. Some general guidance is given on the learner’s

292 Developing comparable examinations and tests

lexical resources for productive language use but, as Huhta et al. (2002:131) point out, ‘no examples of typical vocabulary or structures areincluded in the descriptors.’ The argument that the CEFR ‘is intendedto be applicable to a wide range of different languages’ (Huhta et al.,2002) is used as an explanation, but this offers little comfort to the testwriter who has to select texts or activities uncertain as to the lexicalbreadth or knowledge required at a particular level within the CEFR.

Alderson et al. (2004: 13) make a related point that many of theterms in the CEFR remain undefined. They cite the use of ‘simple’in the scales and argue that difficulties arise in interpreting it becausethe CEFR:

does not contain any guidance, even at a general level, of what might besimple in terms of structures, lexis or any other linguistic level. Therefore theCEFR would need to be supplemented with lists of grammatical structuresand lexical items for each language to be tested, which could be referred toif terms like ‘simple’ are to have any meaning for item writers or item bankcompilers.

They note that even if structure and lexis were identified for eachlevel in the framework, problems may still occur for tests with multi-lingual audiences because simplicity might also be a function of acandidate’s language background. Jones (2002: 174) refers to this inhis research on linking ALTE and CEFR ‘Can-do’ statements: ‘Itappears that a number of statements vary in difficulty according to therespondent’s first language (or the language of the questionnaire).’

9 Structural competence

Alderson (2000: 37) refers to the:

importance of knowledge of particular syntactic structures, or the ability to process them, to some aspects of second language reading … The abilityto parse sentences into their correct syntactic structure appears to be animportant element in understanding text.

Texts with less complex grammar on the whole tend to be easier thantexts with more complex grammar. Recent work by Shiotsu (2003)demonstrates the importance of syntactic knowledge in explainingvariance in tests of reading administered to Japanese undergraduates.

The CEFR provides little help in deciding what level or range ofsyntax might help define a particular proficiency level. In relation toteaching, Keddle (2004: 43–44) notes:

However, there are challenges in using the CEFR in schools. It doesn’tmeasure grammar-based progression and this creates a barrier between thedescriptors and the students’ achievements … The integration of the CEFR

Cyril J. Weir 293

with pre-existing courses was problematic, not least because of the mismatchbetween the checklists and the standard, accepted, grammar syllabusprogression in secondary school classrooms … As a course designer, I wouldbe more comfortable if there were more guidance in relation to basicgrammar areas.

10 Functional competence

Based on the foundations of the earlier functional-notional approach tolanguage in Europe (Threshold Level, Waystage and Vantage studies)and the ground-breaking empirical work of North (2000) in calibrat-ing functions on to a common scale, functional competence is wellmapped out in the CEFR and is one of its major strengths. Its meritsare discussed elsewhere in this volume and it bears testimony to therigour that will be needed to ground the other parameters of contextvalidity discussed above but also highlights the value of such efforts.

III Theory-based validity

In a strong tradition established through the earlier Threshold Level,Waystage and Vantage studies, the CEFR has an avowedly socio-linguistic orientation and limited attention is paid to analysis ofpsycholinguistic parameters. Test developers need to be aware ofprevailing theories concerning the language processing required inreal-life language use. Where such use is to be operationalized in atest, this underlying processing should be replicated as far as is pos-sible, i.e., it must demonstrate theory-based validity (for the theoryof the cognitive processes involved in speaking, see Levelt, 1993;Fulcher, 2003; for writing, see Grabe and Kaplan, 1996; Hyland,2002; for reading, see Urquhart and Weir, 1998; Alderson, 2000;Grabe and Stoller, 2002; for listening, see Buck, 2001; Rost, 2002).

Figure 3 provides an overview of the writing process as an exam-ple of theory-based validity parameters testers need to address intheir definitions of construct at varying levels of proficiency. The fig-ure details the executive processes and resources that may be calledinto play in a writing task, depending on the requirements of the testtask in terms of the parameters outlined under context validityabove.

In responding to writing tasks the candidate is likely to call onexplicit or implicit knowledge of the executive writing processes andresources listed in Figure 3 to varying degrees (for discussion ofthese metacognitive strategies, see Bachman, 1990; Douglas, 2000).Few people would argue that answering a multiple-choice test of

294 Developing comparable examinations and tests

michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight

structure and written expression equates with the executive process-ing and resources required in completing an academic essay requir-ing knowledge transformation. There are also clearly differencesbetween what happens in filling out a form, writing a short letter,writing a 45-minute timed essay, composing a university dissertationor compiling a lengthy report in respect of the processes needed to complete the task. These differences have to be clarified and aconsensus view developed on what is appropriate in terms of theory-based validity at the different levels in CEFR.

Goal-setting, topic and genre modification, generation andorganization of ideas are all likely to vary according to the task, andmonitoring of content will vary in accordance with how explicit theassessment criteria have been made in the task rubric. The task willalso make demands on both content and language knowledge andthese demands will vary from task to task. Similarly these need to bemade explicit in relation to proficiency level.

In short, testers need more detail on the components of processingin both receptive and productive modes and how these developthrough the different levels of proficiency. Alderson (n.d.) expressesconcern that the CEFR is none too helpful in this respect: ‘What sortof operation – be that strategy, sub skill, or pragmatic inference …might be applicable at which level?’

In the Dutch CEFR Construct project, Alderson et al. (2004: 3, 9)go to the heart of this problem:

The CEFR, being such a comprehensive description of language use, canalso be considered, implicitly at least, as a theory of language development,but the ‘Can-do’ scales for reading and listening provide a taxonomy ofbehaviours rather than a theory of development in listening and readingactivities … The first major gap in the CEF we identified was a description of the operations that comprehension consists of and how a theory ofcomprehension develops.

Decisions taken with regard to context validity parameters haveimportant effects on the mental processing that results when candi-dates perform the tasks in the test situation. In this view context andtheory-based validity are inextricably linked. Both these aspects ofvalidity will also impact on scoring criteria. The demands of the taskand the processing it involves will have a direct bearing on the crite-ria that can and should be appropriately applied to the output of thetask. There is a symbiotic relation between these elements of validity.The CEFR does not specify these with sufficient coherence at andacross the different levels to enable valid tests to be developed. SeeFigure 4.

Cyril J. Weir 295

michaelkluemper
Highlight

Scoring validity is an area in which the CEFR has almost noadvice to offer, despite its claim to be a comprehensive and coherentbasis for test comparability. Two different examinations might belocated on the same CEFR level after social moderation and calibra-tion but, unless they both exhibit equivalent scoring validity in termsof each of the parameters in Figure 4, they are simply not compara-ble. Test users need to be sure that each instrument can offeradequate evidence that it is meeting consensus requirements in termsof all these aspects of the scoring procedure. Lack of rigour in anyone of these is likely to negate the validity of a test.

IV Transparency problems in using the CEFR

To fulfil its functions the CEFR as well as being comprehensive andcoherent also needs to be transparent: ‘information must be clearlyformulated and explicit, available and readily comprehensible tousers’ (Council of Europe, 2001: 7).

Alderson et al. (2004), in the Dutch CEFR Construct project,identified a number of practical problems at the microlinguistic level with using the CEFR scales as a basis for test specification.First, they expressed concern over the synonymy or not of the variety of terms used to describe an operation. They provide rather a large number of examples to illustrate this. In level B2 eightdifferent verbs are used to indicate comprehension: Understand,scan, monitor, obtain, select, evaluate, locate and identify. They raise the critical issue of whether these are ‘stylistic synonyms’ orwhether they in fact represent ‘real differences in cognitiveprocesses’ (Alderson et al., 2004: 9). As mentioned in the sectionabove on theory-based validity, the CEFR provides no informa-tion on the latter. Alderson et al. (2004: 10) conclude: ‘we will have to have recourse to theories of comprehension to resolve the issue.’

They also show how ‘many formulations in the “Can-do’’ state-ments are inconsistent’. They provide the following examples:

• Similar descriptions occur at different levels.• Some operations, e.g., ‘recognize’, are only mentioned at three

out of the six levels (A1, B1, C1).• Some verbs do not appear in all reading and listening levels, e.g.,

‘infer’, which only occurs at C1, whereas you might expect it toappear in all levels.

296 Developing comparable examinations and tests

michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight

• Some conditions appear in some levels and not others, e.g., ‘witha large degree of independence’ is only mentioned at B2 level.

Potential confusion in terminology and internal inconsistencies inthe scales need sorting out in order to optimize the benefits of thesescales for test development purposes.

V Future directions

Great store is set on the CEFR being useful in helping defineobjectives for pedagogy and assessment (Council of Europe, 2001:Chapter 1) but deficiencies still remain. Language test constructorsnot only need to know what and how learners ‘Can-do’ at each of thesix CEFR scale levels but also under what performance conditionsthese activities are normally carried out and what quality level interms of specified criteria the performance is usually expected tomeet. We need to address the demands of theory-based validity,context-based validity and scoring validity more comprehensivelythan the present CEFR does.

North (2002: 163), in a report on his own research into a CEFR-based self-assessment tool for university entrance, concludes:

What is perhaps of wider interest is the high stability in the values of CEFRdescriptors in an instrument consisting of 50% new formulations. In addi-tion, the fact that all the new formulations landed in the bands intended, is anindication that after a detailed common reference framework has beenestablished, well crafted descriptors defining other aspects in terms of CEFRlevels can be successfully targeted … further development can ‘work’ froma measurement perspective as well as from an educational perspective.

This is encouraging news for language test developers interested increating a more comprehensive and valid set of descriptors.Eventually if a new version of the CEFR is developed by testers thatbetter defines content by level (in terms of the context validity ele-ments outlined above and summarized in Figure 2), the coverage ofan exam can be profiled against this and then, through calibrationand standard setting, the results of content analysis can be checkedagainst psychometric value.

The scale’s theory-based validity, though of no less importance,may take longer to address. The nature of cognitive and metacogni-tive processing in spoken and written modes is less overt and suscep-tible to investigation than context variables, which are more readilyamenable to expert scrutiny and empirical investigation.

Cyril J. Weir 297

michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight

It is crucial that the CEFR is not seen as a prescriptive device butrather a heuristic, which can be refined and developed by languagetesters to better meet their needs. For this particular constituency itcurrently exhibits a number of serious limitations such that compar-isons based entirely on the scales alone might prove to be mislead-ing, given the insufficient attention paid in these scales to issues ofvalidity. The CEFR as presently constituted does not enable us to saytests are comparable let alone equip us to develop comparable tests.

Exam boards and other institutions offering high-stakes tests needto demonstrate and share how they are seeking to meet the demandsof context, theory-based and scoring validity outlined in this articleand crucially how they in fact operationalize criterial distinctionsbetween levels in their tests. Future research needs to investigatewhether the parameters discussed in this article either singly or inconfiguration can help better ground the distinctions in proficiencyrepresented by each level on the CEFR scales.

VI References

Alderson, J.C. 1991: Bands and scores. In Alderson, J.C. and North, B.,editors, Language testing in the 1990s. London: Macmillan, 71–86.

—— 2000: Assessing reading. Cambridge: Cambridge University Press.—— editor 2002: Common European Framework of Reference for

Languages: learning, teaching, assessment: case studies. Strasbourg:Council of Europe.

—— n.d.: Waystage and threshold. Or does the emperor have any clothes?Unpublished article.

Alderson, J.C., Figueras, N., Kuijper, H., Nold, G., Takala, S. and Tardieu, C.2004: The development of specifications for item development andclassification within the Common European Framework of Reference forLanguages: learning, teaching, assessment. Reading and listening. Finalreport of the Dutch CEF construct project. Unpublished document.

Bachman, L.F. 1990: Fundamental considerations in language testing.Oxford: Oxford University Press.

Bachman, L.F. and Palmer, A. 1996: Language testing in practice. Oxford:Oxford University Press.

Buck, G. 2001: Assessing listening. Cambridge: Cambridge University Press.Council of Europe 2001: Common European Framework of Reference for

Languages: learning, teaching, assessment. Cambridge: CambridgeUniversity Press.

—— 2003: Relating language examinations to the Common EuropeanFramework of Reference for Languages: learning, teaching, assessment(CEF). Manual: preliminary pilot version. DGIV/EDU/LANG 2003, 5.Strasbourg: Language Policy Division.

298 Developing comparable examinations and tests

michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight
michaelkluemper
Highlight

Davidson, F. and Lynch, B.K. 2002: Testcraft: a teacher’s guide to writingand using language test specifications. New Haven, CT: Yale UniversityPress.

Douglas, D. 2000: Assessing language for specific purposes. Cambridge:Cambridge University Press.

Fulcher, G. 2003: Testing second language speaking. Harlow: Pearson.Grabe, W. and Kaplan, R. 1996: Theory and practice of writing. London:

Longman.Grabe, W. and Stoller, F.L. 2002: Teaching and researching reading London:

Longman.Hawkey, R. and Barker F. 2004: Developing a common scale for the assess-

ment of writing. Assessing Writing 9, 122–59.Huhta, A., Luoma, S., Oscarson, M., Sajavaara, K., Takala, S. and

Teasdale, A. 2002: A diagnostic language assessment system for adultlearners. In Alderson, 2002: 130–146.

Hyland, K. 2002: Teaching and researching writing. London: Longman.Johnston, P. 1984: Prior knowledge and reading comprehension test bias.

Reading Research Quarterly 19, 219–39.Jones, N. 2002: Relating the ALTE framework to the Common European

Framework of Reference. In Alderson 2002: 167–183.Kaftandjieva, F. and Takala, S. 2002: Council of Europe scales of language

proficiency: a validation study. In Alderson 2002: 106–29.Keddle, J.S. 2004: The CEF and the secondary school syllabus. In Morrow

2004: 43–54.Levelt, W.J.M. 1993: Speaking: from intention to articulation. Cambridge,

MA: MIT Press.Little, D., Lazenby Simpson, B. and O’Connor, F. 2002: Meeting the English

language needs of refugees in Ireland. In Alderson 2002: 53–67.Morrow, K., editor, 2004: Insights from the Common European Framework.

Oxford: Oxford University Press.North, B. 2000: The development of a common framework scale of language

proficiency New York: Peter Lang.—— 2002: A CEF-based self assessment tool for university entrance. In

Alderson 2002: 146–66.—— 2004: Relating assessments, examinations, and courses to the CEF. In

Morrow 2004: 77–90.Rost, M. 2002: Teaching and researching listening. London: Longman.Shiotsu, T. 2003: Linguistic knowledge and processing efficiency as predictors

of L2 reading ability: a component skills analysis. Unpublished PhDdissertation, University of Reading.

Urquhart, A.H. and Weir, C. J. 1998: Reading in a second language: process,product and practice. Harlow: Longman.

Weir, C.J. 1983: Identifying the language problems of overseas students intertiary education in the UK. Unpublished PhD dissertation. Universityof London.

Weir, C.J. 2004: Language testing and validation: an evidence-basedapproach. Basingstoke: Palgrave Macmillan.

Cyril J. Weir 299

Weir, C.J. and Milanovic, M., editors, 2003: Continuity and innovation: thehistory of the CPE 1913–2002. Studies in language testing 15.Cambridge: Cambridge University Press.

Weir, C. J., Yang, H. and Jin, Y. 2000: An empirical investigation of the com-ponentiality of L2 reading in English for academic purposes. Studies inlanguage testing, volume 12. Cambridge: Cambridge University Press.

300 Developing comparable examinations and tests

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.