Testing for Language Teachers

TESTING FO

R LANGUAGE

TEACHERS

A R T HU R H

U GH

E S

Mohammad PazhouheshKhayyam University of MashhadFarhangian University of Mashhad, Beheshti Campus

TEACHING AND TESTING

BACKWASH The effect of testing on teaching and learning If the test is important, it can dominates all teaching and

learning activities It can be harmful or beneficial

Harmful Backwash : The test content and testing techniques are in variance with

ojectives of the course

Beneficial Backwash : It has an immediate effect on teaching

All measures of mental ability are necessarily indirect, incomplete, imprecise, subjective, relative

To minimize the effects of these limitationsA. provide clear theoretical definitions of the

abilities we want to measure;

B. Specify precisely the conditions, or operations that we will follow in eliciting and observing performance,

C. Quantify the observations so as to assure our measurement scales have the properties we require.

GENERAL TYPES OF TESTS Proficiency tests Achievement tests Diagnostic tests Placement tests Selection Tests

Competition testsAptitude tests

Language aptitude testsVocational aptitude tests

KINDS OF TESTS AND TESTING

PROFICIENCY TESTS measure language ability regardless of any

previous training. are not based on

the content / objectives of language courses .

a specification of what candidates are able to do to be considered proficient

Proficiency: having sufficient command of the language

used for a particular purpose such as: A translator in the United Nations A student seeking admission in American /

British Universities Used for a general purpose such as:

General Proficiency Tests FCE, CPE, TOEFL, IELTS

KINDS OF TESTS AND TESTING

ACHIEVEMENT TESTS are directly related to language courses Used to determine whether students have

achieved the objectives of the course or not.

Kinds of Achievement tests 1. Final achievement tests 2. Progress achievement tests

Final achievement tests are administered at the end of a course of study. Their contents are related to the course

concerned.

syllabus content approach:Should the test be based directly on a detailed course

syllabus? disadvantage : If the syllabus is badly designed, the

results of the test could be misleading

Objective content approach: : Should the test be based on course objectives?

Advantages compelling course designers to be explicit about course

objectives making it possible for the test to show how far objectives

have been achieved

compelling the course designers to choose a syllabus which is consistent with the course objectives

working against poor teaching practice promoting a more beneficial backwash effect

PROGRESS ACHIEVEMENT TESTS measure the progress of students and one to

measure progress is to administer repeatedly final

achievementtests

Disadvantage: The low scores in early stages are

discouraging. The alternative is to establish a series of well-

defined short-term objectives. These should make a clear progression towards the final achievement test based on course objectives

Pop Quizzes make a rough check on students’ progresskeep students on their toes.

• are used to identify learners’ strengths and weaknesses • are intended to ascertain what learning still needs to

take place. • can tell us that someone is particularly weak in, say,

speaking as opposed to reading in a language • Proficiency tests may prove adequate for this purpose

• Teacher may even need to analyze samples of a person’s performance in writing or speaking in order to create profiles of student’s ability in certain categories

DIAGNOSTIC TESTS

PLACEMENT TESTS are intended to place students at the stage of

the teaching program most appropriate to their abilities.

are used to assign students to classes at different levels

are constructed for particular situations. depend on the identification of the key features

at different levels of teaching.

APTITUDE TESTS indicate an individual facility for acquiring

specific skills and learning are used to measure aptitude for learning and

to predict future performance

DIRECT VS. INDIRECT TESTING

Direct tests require the candidate to perform precisely the skill that we wish to measure

If we want to know how well candidates can write composition, we get them to write composition.

If we want to know how they pronounce a language, we ask them to speak.

The tasks, and the texts used, should be as authentic as possible.

Direct testing is easier to carry out to measure the productive skill.

Attractions of Direct testing

1. Straightforward to create the conditions to elicit the required behaviors

2. Straightforward assessment and interpretation

3. helpful backwash effect

SEMI-DIRECT TESTING A. speaking tests where candidates respond to a

tape-recorder stimuli, their own responses being recorded and later scored

INDIRECT TESTING measures the abilities that underlie the skills tested. EXAMPLE: One section of the TOEFL as an

indirect measure of writing ability. At first the old woman seemed unwilling to accept anything that was offered her by my friend and I.

The main appeal of indirect testing: testing a large number of elements in one test giving it to a large number of students correcting it objectively

The main problem of indirect testing: The relationship between performance in the

test andactual performance of the skills being tested is

weak instrength and uncertain in nature

A. DISCRETE POINT TESTING

refers to the testing of one element at a time, item by item.

might take the form of a series of items, each testing a particular grammatical structure .

is a testing approach which cuts up language skills and components into smaller parts and then tests them one by one.

is an atomistic approach to language teaching and learning.

DISCRETE POINT VS. INTEGRATIVE TESTING

B. INTEGRATIVE TESTING

requires the candidate to combine many language elements in the completion of a task writing a composition taking notes while listening to a lecture taking a dictation completing a cloze passage

Unlike DP tests , IN tests tend to be direct. some integrative methods, such as cloze

procedure, are indirect Diagnostic tests of grammar tend to be discrete

point

A. NORM-REFERENCETESTING (NRT) relates one candidate’s performance to that of

other candidates . We are not told directly what the student is

capable of doing in the language .

B. CRITERION-REFERENCE TESTING (CRT) provides direct information about what a

candidate can actually do in the language.

NORM-REFERENCE VS. CRITERION-REFERENCE TESTING

A. OBJECTIVE TESTING No judgment is required on the part of the scorer

(multiple-choice tests) B. SUBJECTIVE TESTING Judgment is called for on the part of the scorer

(composition) There are different degrees of subjectivity in

testing. Scoring composition is more subjective than scoring short-answer items.

Objectivity in scoring brings greater reliability to testing.

Scoring rubrics can increase reliability of subjective tests such as composition.

OBJECTIVE VS. SUBJECTIVE TESTING

No real need for strong candidates to attempt easy items, and no need for weak candidates to attempt difficult items.

an efficient way of collecting information on testees’ ability Presenting initially items of average difficulty.

Those who respond correctly are presented with a more

difficult item. Those who respond incorrectly are presented

with an easy item. The computer adapts the items to the testees’

level .

Oral interviews are typically a form of adaptive testing

COMPUTER ADAPTIVE TESTING

Measuring any ability to take part in acts of communication, including reading and listening

It is assumed that it is usually communicative ability that we want to test .

COMMUNICATIVE LANGUAGE TESTING

VALIDITY: Definition A test is valid if it measures accurately what it is intended to measure

Types of validity Construct Content Criterion-related Face

CHAPTER 4 : VALIDITY

Construct Validitythe degree to which a test measures what it

claims, or purports, to be measuring

Construct: A construct is an attribute, an ability, or skill that happens in the human brain and is defined by established theories. Intelligence, motivation, anxiety, proficiency,

and fear are all examples of constructs. They exist in theory and has been observed to

exist in practice. Constructs exist in the human brain and are

not directly observable. There are two types of construct validity:

convergent and discriminant validity. Construct validity is established by looking at numerous studies that use the test being evaluated.

CHAPTER 4 : VALIDITY

2. CONTENT VALIDITY The test content is a representative sample

of the language skills being tested .

The test is content valid if it includes a proper

sample. importance of content validity

the greater a test’s content validity, the more

likely its construct validity

a test without content validity is likely to have a

harmful backwash effect since areas that are not

tested are likely to become ignored in teaching and

learning

3. CRITERION-ORIENTED VALIDITY The degree to which results on the test agree with those provided by an independent criterion

Kinds of criterion-related validity A. Concurrent Validity

is established when the test and the criterion are

administered at the same time

B. Predictive Validity concerns the degree to which a test can

predict candidates’ future performance.Areas that are not tested are likely to

become ignored in teaching and learning

VALIDITY COEFFIENT A mathematical measure of similarity to show the degree of validity . Perfect validity will result in a coefficient of 1.00 Total lack of validity results in a coefficient of 0.00 Satisfactory validity depends on the test’s purpose & importance A coefficient of 0.70 might be considered low if the test is important VALIDITY IN SCORING a reading test may call for short written responses If the scoring of these responses takes into account spelling and grammar, then it is not valid in scoring.

4. FACE VALIDITY The way the test looks to the examinees, test administrator, educators, and the like If you want to test the student in pronunciation, but you do not ask them to speak, your test lacks face validity If your test contain items or materials which are not acceptable to candidates, teachers, educators, etc., your test lacks face validity

HOW TO MAKE TESTS MORE VALID? Write explicit specifications for the test, which include all

the constructs to be measured. Make sure that you include a representative sample of the

content Use direct testing . Make sure the scoring is valid . Make the test reliable .

RELIABILITY refers to the stability or consistency of scores Nearly the same scores for the same individuals in two sessions Multiple-choice tests have high coefficient of reliability Look at the tables on p. 37

RELIABILITY COEFFICIENT The ideal coefficient is 1.00 Total lack of reliability is 0.00 Satisfactory reliability depends on the purpose and importance of the test Vocabulary, structure, and reading tests: .90 - .99 Auditory comprehension tests: .80 - .89 Oral production tests: .70 - .79

CHAPTER 5 : RELIABILITY

HOW TO ESTIMATE RELIABILITY? The way in which reliability coefficient arrived at

Test-retest Method Taking the same test twice by the same students, and then comparing the scores

Drawbacks of this method: If the administration is too soon, the students will remember, and then their scores will be affected If the time is too long, the students will forget or improve, and then that will affect the scores


The Alternate Forms Method Two equivalent forms, but the problem is such forms are not available

The Split Half Method The most common method to obtain reliability The subjects take the test one time, but each subject is given two scores One score for each half of the test The two sets of scores are then used to obtain the reliability coefficient be affected


THE STANDARD ERROR OF MEASUREMENT AND THE TRUE SCORE All test scores are estimates All tests contain some degree of error you have to use a statistic known as the standard error of measurement… to estimate the limits within which an obtained score is likely to diverge from a true score


SCORER RELIABILITY Consistency of scoring Nearly the same score for the same test In other words, comparing the scores of two or more scorers for the same students In composition tests, scores are usually fluctuate In multiple-choice tests, scores are nearly perfect If the scoring of a test is not reliable, then the test results cannot be reliable either


HOW TO MAKE TESTS MORE RELIABLE? 1) Take enough samples of behavior The more items you have on a test, the more reliable the test will be Considerations to be taken when adding extra items: Additional items should be independent of each other and of existing items Each additional item should represent a fresh start for the candidate Tests should neither be too long, nor too short


HOW TO MAKE TESTS MORE RELIABLE?2) Exclude items which do not discriminate well between weaker and stronger students Items on which strong students and weak students perform with similar degree of success contribute little to the reliability of a test Too easy items or too difficult items should be excluded A small number of easy items may be kept at the beginning of a test to give candidates confidence and reduce the stress they feel


HOW TO MAKE TESTS MORE RELIABLE?3) Do not allow candidates too much freedom The procedure of giving choices of questions to candidates has a negative effect on reliability In general, candidates should not be given a choice

4) Write unambiguous items 5) Provide clear and explicit instructions 6) Ensure that tests are well laid out and perfectly legible 7) Make candidates familiar with format and testing techniques


HOW TO MAKE TESTS MORE RELIABLE?

8) Provide uniform and non-distracting conditions of administration 9) Use items that permit scoring which is as objective 10) Provide a detailed scoring key 11) Train scorers 12) All scorers should follow the same criteria for scoring 13) Identify candidates by number not name 14) Employ multiple, independent scoring


RELIABILITY AND VALIDITY A valid test must be reliable However, a reliable test may not be valid at all Increasing the reliability of a test may be on the expense of validity There will always be some tension between reliability and validity The tester has to balance gains in one against losses in the other


CHAPTER 6 :ACHIEVING BENEFICIAL BACKWASH

Test the abilities whose development you want to encourage.

Beware of reasons for not testing particular abilities.

In case of MCQ and objectivity subjective scoring in case of subjective tests the expense involved in terms of time and

money

Determine the points that should be tested and give them sufficient weight in relation to the other

abilities

How to achieve beneficial backwash: I. Sample widely and unpredictably. II. Use direct testing.III. Make testing criterion-reference. IV. Base achievement tests on objectives. V. Ensure the test is known and understood by

students and teachers.VI. Where necessary provide assistance to teachers. VII.Count the cost.

CHAPTER 7: STAGES OF TEST DEVELOPMENT1)Make a full and clear statement of the testing

‘problem’ 2)Write complete specifications for the test 3)Write and moderate items4) Try the items on native speakers 5) Try the items on non-native speakers 6) Analyze the results of the trial and make necessary changes 7) Calibrate scales 8) Validate 9) Write handbooks for test takers, test users, and staff 10) Train any necessary staff (interviewers, raters, etc.)

The questions to be answered in order to state the problem:

i) What kind of test is it to be?ii) What is its precise purpose?iii) What abilities are to be tested?iv) How details must the results be?v) How accurate must the results be?vi) How important is backwash?vii) What constraints are set by unavailability, expertise, facilities, and time?

Stating the problem

i) Determining Content Specifying instructional objectives Preparing a table of specifications Determining number of items

ii. Necessary operations by the test-developer Specification of Text types:

Letters, forms, academic essays Addresses of Texts Length of Text(s) Topics (familiar/unfamiliar) Readability Structural and Vocabulary Range Dialect, accent, style Speed of Processing

words to be read per minute , rate of speech

2) Writing specifications for the test

iii) Structure, timing, medium/channel and techniques Test Structure(test section: grammar, voc., reading)Number of ItemsNumber of PassagesMedium/channel( tape, paper & pencil, ...)TimingTechniques

iv) Critical Levels of Performance iv) Scoring Procedures :

subjective or objective 3) Writing and Moderating Items

i) Sampling ( based on the contents)ii) Writing Items iii) Moderating Items (Reviewing )

4) PretestingInformal Trial of Items on Native Speakers Trialing Items on Non-native Speakers (Pretesting)

6) Item Analysis (analysis of the results) Reliability level of difficulty discrimination index distracters clearance of instructions and items timing

7) Calibration of scales8) Validation9) Writing handbooks for test-takers, test users & staff10. Training staff

7) Calibration of Scales For testing speaking and writing, a team of experts looks at samples of

skills and assign each of them to a point on the relevant scale 8) Validation It is essential for proficiency tests and repeatedly-used tests 9) Writing Handbooks for test takers, users, staff

It is essential for proficiency tests and repeatedly - used tests 10) Training Staff It is essential for proficiency tests and

repeatedly -used tests

See pp. 66 – 72 for examples of test development

Testing for Language Teachers

Education

Transcript of Testing for Language Teachers