Language Assessment Principles and Class

Chapter 2Principles of Language Assessment page 25-51

Book Name: Language Assessment Principles and Classroom PracticeWritten by: H. Douglas Brown and Priyanvada AbeywickramaSummarized by: Saeed Mojarradi Professor: Dr.Zoghi

How principles of language assessment can and should be applied to formal tests but with the ultimate recognition that these principles also apply to assessments of all kinds.

How do you know if a test is effective, appropriate useful, or, in down-to-earth terms, a good test?

Can it be given within appropriate administrative constraints? Is it dependable? Does it accurately measure what you want it to measure?

Is the language in the test representative of real – world language use? Does the test provide information that is useful for the learner?

There are five cardinal criteria for “testing a test “:

1- Practicality2- Reliability3- Validity4- Authenticity5- Wash back

Practicality: refers to the logistical, down – to – earth, administrative issues involved in making, giving, and scoring an assessment. These include “costs” the amount of time it takes to construct and to administer, ease of scoring, and ease of interpreting reporting the results.

A practical test

- Stays within budgetary limits- Can be completed by the test-taker within appropriate time constraints- Has clear directions for administration- Appropriately utilizes available material resources- Does not exceed available material resources- Considers the time and effort involved for both design and scoring

Which tests are impractical?- A test of language proficiency that takes a student five hours to complete are impractical.- A test that requires individual one-to-one proctoring is impractical for a group of several

hundred test-takers and only a hundred of examiners.- A test that takes a few minutes for a student to take and several hours for an examiner to

evaluate is impractical for most classroom situations.- A test that can be score only by computer is impractical, because it takes too long to score.

1 | P a g e Language Assessment Principles and Classroom Practice Ch. 2

ReliabilityA reliable test is consistent and dependable. If you give the same test to the same student or matched students on two different occasions, the test should yield similar results.

A reliable test ….- Is consistent in its conditions across two or more administrations- Gives clear directions for scoring / evaluation- Has uniform rubrics for scoring /evaluation- Lends itself to consistent application of those rubrics by the scorer- Contains items/takes that are unambiguous to the test-taker

What is unreliability?The issue of the reliability of tests can be better understood by considering a number of factors that can contribute to their unreliability.

There are four possible factors regarding fluctuations:a) The studentb) The scoringc) The test administrationd) The test itself

Student –Related ReliabilityThe most common learner-related issue in reliability is caused by temporary illness, fatigue, a bad day, anxiety, and other physical or psychological factors, which may make an observed score deviate from one’s true score.

Also included in this category are such factors as a test-takers test-wiseness, or strategies for efficient test-taking.

Rater Reliability Human error, subjectivity, and bias may enter into the scoring process. Inter-rater reliability occurs when two or more scores yield consistent scores of the same test.Failure to achieve intra-rater reliability could stem from lack of adherence to scoring criteria, inexperience, inattention, or even preconceived biases.

Inter-rater reliability issues are not limited to contexts in which two or more scores are involved. Intra-rater reliability is an internal factor, a common occurrence for classroom teachers.

Test administration reliabilityUnreliability may also result from the conditions in which the test is administrated. I once witnessed the administration of a test of aural comprehension in which an audio player was used to deliver items for comprehension, but because of noise outside the building, students sitting next to open windows could not hear the stimuli accurately.

Other sources of unreliability are found in photocopying variations, the amount of light in different parts of the room, variations in temperature, and even the condition of desks and chairs.


Test Reliability Sometimes the nature of the test itself can cause measurement errors. Tests with multiple choice items must be carefully designed to include a number of characteristics that will guard against unreliability.

For example:The items need to be evenly difficult, distractors need to be well designed, and items need to be well distributed to make the test reliable. In classroom based assessment, test unreliability can be caused by many factors, including rater bias.This typically occurs with subjective tests, with open ended responses (e.g., essay responses) that require a judgment on the part of the teacher to determine correct and incorrect answers.

Objective tests, in contrast, have predetermined fixed responses, a format which of course increases their reliability.

Further unreliability may be caused by poorly written test items – that is, items that are ambiguous or that have more than one correct answer. Also, a test that contains too many items may ultimately cause test-takers to become fatigued by the time they reach the later items and respond incorrectly.

ValidityThe most complex criterion of an effective test – and arguably the most important principle is validity.The extent to which inferences made from assessment results are appropriate , meaningful , and useful in terms of the purpose of the assessment.

Samuel Messick defined validity as an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on scores or other modes of assessment.

A valid test ….- Measures exactly what it proposes to measure- Does not measure irrelevant or contaminating variables.- Relies as much as possible on empirical evidence(performance)- Involves performance that samples the test’s criterion ( objective)- Offers useful , meaning full information about a test –takers ability- Is supported by a theoretical rationale or argument

A valid test of reading ability actually measures reading ability.To measure writing ability, one might ask students to write as many words as they can in 15 minutes, and then simply count the words for the final score. Such a test would be easy to administer (practical) and the scoring quite dependable (reliable), but would not constitute a valid test of writing ability without some consideration of comprehensibility, rhetorical discourse elements…

How is the validity of a test established?There is no final absolute measure of validity, but several different kinds of evidence may be invoked in support. As Messick emphasized it is important to note that validity is a matter of degree, not all or none.

Statistical correlation with other related but independent measures are another widely accepted form of evidence. Other concerns about a test’s validity may focus on the consequences of a test, beyond measuring the criteria themselves, or even on the test-takers perception of validity.


Content -Related EvidenceIf a test actually samples the subject matter about which conclusions are to be drawn, and if it requires the test-taker to perform the behavior that is being measured, it can claim content-related evidence of validity, often popularly referred to as content-related validity.

If a course has perhaps 10 objectives but two are covered in a test, then content validity suffers.

Another way of understanding content validity is to consider the difference between direct and indirect testing.Direct testing involves the test-taker in actually performing the target task.In an indirect test, learners are not performing the task itself but rather a task that is related in some way. If you intend to test learners oral production of syllable stress and your test task is to have learners mark (with written accent marks) stressed syllables in a list of written words, you could argue that you are indirectly testing their oral production.A direct test of syllable production would require that students actually produce target words orally.

Criterion –Related EvidenceA second form of evidence of the validity of a test may be found in what is called criterion-related evidence, also referred to as criterion-related validity, or the extent to which the criterion of the test has actually been reached.

In the case of teacher-made classroom assessment s , criterion-related evidence is best demonstrated through a comparison of results of an assessment with results of some other measure of the same criterion.

Criterion –related evidence usually falls into one of two categories:- Concurrent validity- Predictive validity

A test has concurrent validity if its results are supported by other concurrent performance beyond the assessment itself. E.g. Validity of a high score on the final exam of a foreign language course will be substantiated by actual proficiency in the language. The predictive validity of an assessment becomes important in the case of placement tests , admissions assessment batteries , and achievement tests designed to determine students readiness to move on to another unit. The assessment criterion in such cases is not to measure concurrent ability but to assess and predict a test taker’s likelihood of future success.

Construct –Related Evidence

A third kind of evidence that can support validity, but one that does not play as large a role for classroom teachers is construct-related validity, commonly referred to as construct validity.

A construct is any theory, hypothesis, or model that attempts to explain observed phenomena in our universe of perceptions.

Constructs may or may not be directly or empirically measured – their verification often requires inferential data. Proficiency, communicative competence, and fluency are examples of linguistic constructs; self-esteem and motivation are psychological constructs.


In the field of assessment, construct validity asks “Does this test actually tap into the theoretical construct as it has been defined? Let’s say you are assessing a student’s oral fluency. To possess construct validity , your test should account for the various components of fluency , speed , rhythm , juncture , (lack of ) hesitations , and other elements within the construct of fluency.

Imagine, for example that you have been given a procedure for conducting an oral interview. The scoring analysis for the interview includes several factors in final score:

- Pronunciation- Fluency- Grammatical accuracy- Vocabulary use- Socio-linguistic appropriateness

If you were asked to conduct an oral proficiency interview that evaluated only pronunciation and grammar, you could be justifiably suspicious about the construct validity of that test.

Construct validity is a major issue in validating large-scale standardized tests of proficiency.

Consequential Validity (Impact)It encompasses all the consequences of a test , including such considerations as its accuracy in measuring intended criteria, its effect on the preparation of test –takers, and the (intended and unintended)social consequences of a test’s interpretation and use.

The term impact to refer to consequential validity, perhaps more broadly, encompassing the many consequences of assessment, before and after a test administration.The impact of test-taking and the use of test scores can according to Bachman and Palmer be seen at both a macro level (the effect on society and education systems) and a micro level (the effect on individual test-takers).

At the macro level Chio argued that the wholesale employment of standardized tests for such gate keeping purposes as college admission deprive students of crucial opportunities to learn and acquire productive language skills causing test consumers to be increasingly disillusioned with EFL testing.

At the micro-level , specifically the classroom instructional level, another important consequence of a test falls into the category of wash back. Geronlund and Waugh (2008) encouraged teachers to consider the effect of assessments on student’s motivation, subsequent performance in a course , independent learning , study habits , and attitude toward school work.

Face Validity

Face validity refers to the degree to which a test books right, and appears to measure the knowledge or abilities it claims to measure, based on the subjective judgment of the examinees who take it.

Despite the intuitive appeal of the concept of face validity, it remains a notion that cannot be empirically measured or theoretically justified under the category of validity.Many assessment experts view face validity as a superficial factor that is too dependent on the whim of the perceiver.

Teachers can increase student’s perception of fair tests by using:


- A well-constructed, expected format with familiar tasks- Tasks that can be accomplished within an allotted time limit- Items that are clear and uncomplicated- Directions that have been rehearsed in their previous course work- Tasks that relate to their course work(content validity)- A difficulty level that presents a reasonable challenge

Example:A dictation test and a cloze test as a placement test for a group of learners of English as a second language. Some learners were upset because such tests, on the face of it, did not appear to them to test their true abilities in English. They felt that a multiple choice grammar test would have been the appropriate format to use.

Validity is a complex concept, yet it is indispensable to the teacher’s understanding of what makes a good test.

AuthenticityA fourth major principle of language testing is authenticity, a concept that is difficult to define, especially within the art and science of evaluating and designing tests.Bachman and Palmer (1996) defined authenticity as the degree of correspondence of the characteristics of a given language test task to the features of a target language task.

In a test, authenticity may be present in the following ways:An authentic test …

- Contains language that is as natural as possible- Has items that are contextualized rather than isolated- Includes meaningful, relevant ,interesting topics- Provides some thematic organization to items , such as through a story line or episode- Offers tasks that replicate real – world tasks

The authenticity of test tasks in recent years has increased noticeably, Two or three decades ago, unconnected, boring, contrived items were accepted as a necessary component of testing.

Wash back

A facet of consequential validity discussed above is the effect of testing on teaching and learning, otherwise known in the language assessment field as wash back.Messick (1996) reminded us that the wash back effect may refer to both the promotion and the inhibition of learning, thus emphasizing what may be referred to as beneficial versus harmful or negative wash back.

The following factors comprise the concept of wash back:A test that provides beneficial wash back…

- Positively influences what and how teachers teach- Positively influences what and how learners learn- Offers learners a chance to adequately prepare- Gives learners feedback that enhances their language development- Is more formative in nature than summative- Provides conditions for peak performance by the learner


In large scale assessment, wash back often refers to the effects that tests have on instruction in terms of how students prepare for the test.Cram courses and teaching to the test are examples of wash back that may have both negative and positive effects.

Teachers can provide information that “washes back “to students in the form of useful diagnoses of strengths and weaknesses.

Wash back also includes the effects of an assessment on teaching and weaknesses.Wash back also includes the effects of an assessment on teaching and learning prior to the assessment itself.

Informal performance assessment is by nature more likely to have built in Wash back effects because the teacher is usually providing interactive feedback.

Formal tests can also have positive Wash back, but they provide on beneficial Wash back if the students receive a simple letter grade or a single overall numerical score.

The challenge to teachers is to create classroom tests serve as learning devices through which Wash back is achieved. Student’s incorrect responses can become windows of insight into further work.

Teachers can suggest strategies for success as part of their “coaching” role. Wash back enhances a number of basic principles of language acquisition: intrinsic motivation, autonomy, self-confidence, language ego, interlanguage and strategic investment, among others.

Another view point on Wash back is achieved by a quick consideration of differences between formative and summative tests.

1- Formative tests by definition provide wash back in the form of information to the learner on progress toward goals.

2- Summative tests, which provide assessment at the end of a course or program, do not need to offer much in the way of Wash back.

Finally Wash back implies that students have ready access to you to discuss the feedback and evaluation you have given.Whereas you almost certainly have known teachers with whom you wouldn’t dare argue about a grade, an interactive, cooperative, collaborative classroom can promote an atmosphere of dialogue between students and teachers regarding evaluative judgments.

Applying principles to the evaluation of classroom tests

Are there other principles that should be invoked in evaluating and designing assessment?The answer, of course, is yes.Language assessment is an extra ordinary broad discipline with many branches, interest areas, and issues. The process of designing effective assessment instruments is far too complex to be reduced to five principles.Good test construction, for example, is governed by research based rules of test preparation, sampling of tasks, item design and construction, scoring responses, ethical standards and so on.


However, if validity is not substantiated, all other considerations may be rendered useless.

1- Are the test procedures practical?2- Is the test itself reliable?3- Can you ensure rater reliability?4- Does the procedure demonstrate content validity?5- Has the impact of the test been carefully accounted for?6- Is the procedure biased for best?7- Are the test tasks as authentic as possible?8- Does the test offer beneficial Wash back to the learner?

Notice: you can find the answer for these 8 questions from pages 40 to 48 in book.

The endBest Wishes!

Saeed Mojarradi1393-08-18


Language Assessment Principles and Class

Documents

Transcript of Language Assessment Principles and Class