Aligning tests to standards
-
Upload
fariba-chamani -
Category
Education
-
view
124 -
download
0
Transcript of Aligning tests to standards
copy Fariba Chamani 2015
GLEN FULCHER (2010) PRACTICAL LANGUAGE TESTING
CHAPTER 5 ALIGNING TESTS TO STANDARDS
Content of this chapterItrsquos as old as the hillsThe definition of lsquostandardsrsquoThe uses of standards Unintended consequences revisited Using standards for harmonization and identity How many standards can we afford Performance level descriptors (PLDs) and test scores Some initial decisions Standard-setting methodologies Evaluating standard setting Training The special case of the CEFR You can always count on uncertainty
Itrsquos as old as the hills Standard setting = The process of establishing one or more cut scores on examinations
Standards-based assessment = Using tests to assess learner performance and achievement in relation to an absolute standard
A development of criterion-referenced testing Using large-scale standardized tests Pre-dating the criterion-referenced testing move
Definition of lsquostandardrsquo
Standard = a level of performance required or experienced (Davies et al 1999)
Example The standard required for entry to the university is an A in English
The uses of standards Educational purposes (achievement tests)
Professional purposes (certification of aircraft engineers)
Political purposes (NCLB amp AYP)
Immigration Policy purposes
Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed
Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language
The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages
Using standards for harmonization amp identity
To enforce conformity to a single model that helps to create and maintain political unity and identity
ExamplesCarolingian empire of Charlemagne (CE 800ndash814)
CEFR (Now)
Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates
In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate
There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice
CEFR (Common European Framework of Reference )
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded
Problems with CEFRIt drains creativity among teachers
The same set of standards are used for all people across different contexts with different purposes
Validation is based on linking the test to the CEFR This is against validity theories
The use of standards and tests for harmonization ultimately leads to a desire for more control
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Content of this chapterItrsquos as old as the hillsThe definition of lsquostandardsrsquoThe uses of standards Unintended consequences revisited Using standards for harmonization and identity How many standards can we afford Performance level descriptors (PLDs) and test scores Some initial decisions Standard-setting methodologies Evaluating standard setting Training The special case of the CEFR You can always count on uncertainty
Itrsquos as old as the hills Standard setting = The process of establishing one or more cut scores on examinations
Standards-based assessment = Using tests to assess learner performance and achievement in relation to an absolute standard
A development of criterion-referenced testing Using large-scale standardized tests Pre-dating the criterion-referenced testing move
Definition of lsquostandardrsquo
Standard = a level of performance required or experienced (Davies et al 1999)
Example The standard required for entry to the university is an A in English
The uses of standards Educational purposes (achievement tests)
Professional purposes (certification of aircraft engineers)
Political purposes (NCLB amp AYP)
Immigration Policy purposes
Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed
Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language
The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages
Using standards for harmonization amp identity
To enforce conformity to a single model that helps to create and maintain political unity and identity
ExamplesCarolingian empire of Charlemagne (CE 800ndash814)
CEFR (Now)
Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates
In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate
There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice
CEFR (Common European Framework of Reference )
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded
Problems with CEFRIt drains creativity among teachers
The same set of standards are used for all people across different contexts with different purposes
Validation is based on linking the test to the CEFR This is against validity theories
The use of standards and tests for harmonization ultimately leads to a desire for more control
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Itrsquos as old as the hills Standard setting = The process of establishing one or more cut scores on examinations
Standards-based assessment = Using tests to assess learner performance and achievement in relation to an absolute standard
A development of criterion-referenced testing Using large-scale standardized tests Pre-dating the criterion-referenced testing move
Definition of lsquostandardrsquo
Standard = a level of performance required or experienced (Davies et al 1999)
Example The standard required for entry to the university is an A in English
The uses of standards Educational purposes (achievement tests)
Professional purposes (certification of aircraft engineers)
Political purposes (NCLB amp AYP)
Immigration Policy purposes
Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed
Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language
The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages
Using standards for harmonization amp identity
To enforce conformity to a single model that helps to create and maintain political unity and identity
ExamplesCarolingian empire of Charlemagne (CE 800ndash814)
CEFR (Now)
Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates
In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate
There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice
CEFR (Common European Framework of Reference )
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded
Problems with CEFRIt drains creativity among teachers
The same set of standards are used for all people across different contexts with different purposes
Validation is based on linking the test to the CEFR This is against validity theories
The use of standards and tests for harmonization ultimately leads to a desire for more control
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Definition of lsquostandardrsquo
Standard = a level of performance required or experienced (Davies et al 1999)
Example The standard required for entry to the university is an A in English
The uses of standards Educational purposes (achievement tests)
Professional purposes (certification of aircraft engineers)
Political purposes (NCLB amp AYP)
Immigration Policy purposes
Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed
Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language
The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages
Using standards for harmonization amp identity
To enforce conformity to a single model that helps to create and maintain political unity and identity
ExamplesCarolingian empire of Charlemagne (CE 800ndash814)
CEFR (Now)
Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates
In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate
There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice
CEFR (Common European Framework of Reference )
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded
Problems with CEFRIt drains creativity among teachers
The same set of standards are used for all people across different contexts with different purposes
Validation is based on linking the test to the CEFR This is against validity theories
The use of standards and tests for harmonization ultimately leads to a desire for more control
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
The uses of standards Educational purposes (achievement tests)
Professional purposes (certification of aircraft engineers)
Political purposes (NCLB amp AYP)
Immigration Policy purposes
Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed
Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language
The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages
Using standards for harmonization amp identity
To enforce conformity to a single model that helps to create and maintain political unity and identity
ExamplesCarolingian empire of Charlemagne (CE 800ndash814)
CEFR (Now)
Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates
In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate
There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice
CEFR (Common European Framework of Reference )
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded
Problems with CEFRIt drains creativity among teachers
The same set of standards are used for all people across different contexts with different purposes
Validation is based on linking the test to the CEFR This is against validity theories
The use of standards and tests for harmonization ultimately leads to a desire for more control
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Unintended consequences In case of NCLB ELL group is always lower than the standard amp resources are not channeled to where they are most needed
Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language
The use of language tests for immigration leads to fraudulent practices amp short-term paper marriages
Using standards for harmonization amp identity
To enforce conformity to a single model that helps to create and maintain political unity and identity
ExamplesCarolingian empire of Charlemagne (CE 800ndash814)
CEFR (Now)
Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates
In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate
There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice
CEFR (Common European Framework of Reference )
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded
Problems with CEFRIt drains creativity among teachers
The same set of standards are used for all people across different contexts with different purposes
Validation is based on linking the test to the CEFR This is against validity theories
The use of standards and tests for harmonization ultimately leads to a desire for more control
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Using standards for harmonization amp identity
To enforce conformity to a single model that helps to create and maintain political unity and identity
ExamplesCarolingian empire of Charlemagne (CE 800ndash814)
CEFR (Now)
Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates
In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate
There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice
CEFR (Common European Framework of Reference )
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded
Problems with CEFRIt drains creativity among teachers
The same set of standards are used for all people across different contexts with different purposes
Validation is based on linking the test to the CEFR This is against validity theories
The use of standards and tests for harmonization ultimately leads to a desire for more control
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Carolingian empire of Charlemagne Within the empire of Charlemagne in Central and Western Europe various groups followed different calendars and the main Christian festivals fell on different dates
In order to bring uniformity Charlemagne set a new standard for lsquocomputistsrsquo who worked out the time of festivals They required to pass a test in order to get their certificate
There are no lsquocorrect answersrsquo for the questions in the Carolingian test they are scored as lsquocorrectrsquo because they are defined as such by the standard and the standard is arbitrarily chosen with the intention of harmonizing practice
CEFR (Common European Framework of Reference )
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded
Problems with CEFRIt drains creativity among teachers
The same set of standards are used for all people across different contexts with different purposes
Validation is based on linking the test to the CEFR This is against validity theories
The use of standards and tests for harmonization ultimately leads to a desire for more control
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
CEFR (Common European Framework of Reference )
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded
Problems with CEFRIt drains creativity among teachers
The same set of standards are used for all people across different contexts with different purposes
Validation is based on linking the test to the CEFR This is against validity theories
The use of standards and tests for harmonization ultimately leads to a desire for more control
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Problems with CEFRIt drains creativity among teachers
The same set of standards are used for all people across different contexts with different purposes
Validation is based on linking the test to the CEFR This is against validity theories
The use of standards and tests for harmonization ultimately leads to a desire for more control
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
How many standards can we afford
The number of performance levels depends on the goals and the use of the test
Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes the greater will be the danger of a small difference in marks
Index of Separation estimates the number of performance levels into which a test can reliably place test takers
Sometimes we have to use numerous categories to motivate young learners
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Performance level descriptors (PLDs) amp Test scores
PLDs are often developed based on intuitive and experiential method amp the labels and descriptors are simple reflections of the values of policy makers
There are around four levels lsquoadvanced ndash proficient ndash basic ndash below basicrsquo
The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker
Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs If we have two performance levels (pass and fail) wersquoll need a single cut score
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Standard based tests CRT amp scoring rubrics
It is said that tests used in standards-based testing are criterion- referenced yet for Glaser the criterion was the domain and it does not have anything to do with standard setting and classification
The standards-based testing movement has interpreted lsquocriterionrsquo to mean lsquostandardrsquo
The focus within PLDs is on the general levels of competence proficiency or performance while scoring rubrics address only single items
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar 1979)
Decision 1 Compensatory or non-compensatory marking The strength in other areas lsquocompensatesrsquo for the weakness in one area
Decision 2 What classification errors can you tolerate
Decision 3 Are you going to allow test takers who lsquofailrsquo a test to retake it If so what time lapse is required to retake the test
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Second Page Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum
Cycle Diagram
Test-centered
Criterion-referenced
Norm-referenced
Examinee-centered
Standard-Setting Methods
Classification of
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Standard-setting methodologies
Test-centered
bull Angoffbull Ebelbull Nedelskybull Bookmark
Examinee-centered
bull Method of Contrasting Groups
bull Method of Borderline group
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Common process of standard setting
Select an appropriate standard setting method depending upon the purpose of the standard setting available data and personnel
Select a panel of judges based upon explicit criteria Prepare the PLDs and other materials as appropriate Train the judges to use the method select Rate items or persons collect and store data Provide feedback on rating and initiate discussion for judges to
explain their ratings listen to others and revise their views or decisions before another round of judging
Collect final ratings and establish cut scores Ask the judges to evaluate the process Document the process in order to justify the conclusions reached
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Angoff method Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly
The average of these probabilities across judges or raters is the cut score
If the test contains polytomous items or tasks the proportion of the maximum score is used instead of the probability (modified Angoff)
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Advantages amp disadvantages
Clarity
Simplicity
Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way
+ -
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Ebel method 2 Rounds Experts classify independently test items by
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionabl
e
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Ebel method The judges estimate the percentage of items a borderline
test taker would get correct for each cell Then the percentage for each cell is multiplied by the
number of items so if the lsquoeasyessentialrsquo cell has 20 items 20 1113088 85 = 1700
These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge
Finally these are averaged across judges to give a final cut score
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
All items could be classified 12 cells in a 34 grid defined by the three difficulty and four relevance category As in the example
categories Expert 3 Expert 4 Expert 5 Number of items
in a category
(А)
correctly performed
items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
Number of items
in a category
(А)
correctly
performed items
(В)
АВ
EssentialEasy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0
Questionable
Easy 0 0 0 0 0 0 0 0 0Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0Mean 251 267 35
Mean for all experts 28
Cut-score 12
hellip
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Problems with EBELThe complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges
As it is assumed that some items may have questionable
relevance to the construct of interest it implicitly throws into
doubt the rigor of the test development process and validity
arguments
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Nedelsky method (Multiple-choice)The experts estimate the multiple-choice items a borderline
test taker would be able to eliminate
In a four-option item with three distractors if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 ) but if he can only rule out 1 of the items the chance of answering the item correctly is 1 in 3 (33 )
These probabilities are averaged across all items for each judge and then across all judges to arrive at a cut score
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Problems with Nedelsky method
It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options However it is highly unlikely that test takers answer items in this way
Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Standard Setting
Presentation of the percentage ofstudents falling into each performance level and each median cut-score from Round 2 After discussion individual judgments
Overview of established cut-scores by every expert repeating of the same procedure as
in the first step
Experts are informed about the essential number of cut-scores to establish Experts work insmall groups all the essential material is
introduced to them
Basic steps of the procedure
Round III
Round II
Round I
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Procedures in Bookmark method
Judges are presented with the necessary materials Then they are asked to keep in mind a borderline
student and place a lsquobookmarkrsquo in the book between two items such that the candidate is more likely to be able to answer the items below correctly and the items above incorrectly
The bookmarks are discussed in group and finally
the median of the bookmarks for each cut point is taken as that grouprsquos recommendation for that cut-point
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Examinee-centered methods
The judges make decisions about whether individual test takers are likely to be just below a particular standard the test is then administered to the test takers to discover where the cut score should lie
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Borderline group method The judges define what borderline candidates are
like and then identify borderline candidates who fit the definition
Once the students have been placed into groups the test can be administered The median score for a group defined as borderline is used as the cut score
The main problem the cut score is dependent upon the group being used in the study
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Method of contrasting groupsProcedure includes testing of two groups of examinees
bullThe classification must be done using independent criteria such as teacher judgments
bullThe test is then given and the score distributions are calculated There are likely to be overlaps in the distributions
bull The cut score will be where overlap is observed in the distributions
Competent Non-competent
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Which method is the lsquobestrsquo
It depends on what kind of judgments you can get for your standard-setting study and the quality of the judges that you have available
However using the contrasting group approach is recommended if itrsquos possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores
The problem is getting the judgments of a number of people on a large group of individuals
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Evaluating standard setting (Kane 1994)
Procedural evidence
bull What procedures were used for the standard-setting to ensure that the process is systematic
bull Were the judges properly trained in the methodology and allowed to express their views freely
Internal evidence bullDeals with the consistency of results arising from the procedurebullIt also estimates the extent of agreement between judges (Cohenrsquos kappa )
External evidence bullCorrelation of scores of learners in a borderline group study with some other test of the same constructbullHigh correlation = the established cut scores are defensible
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Training a critical part of standard setting
Training activities include familiarization with the PLDs and the test looking at the scoring keys making practice judgments and getting feedback
Different views may lead to disagreements among the judges Training should not be designed to eliminate these variations but to allow free discussion among judges If the judges do not converge the outcome should be accepted by the researchers
The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
The special case of the CEFR bull The CEFR Manual contains performance level descriptors for standard
setting in order to introduce a common language and a single reporting system into Europe
bull It recommends five processes to lsquorelatersquo Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning Teaching Assessment These processes are Familiarization specification standardization trainingbenchmarking standard-setting and validation
bull Familiarization standard-setting and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
PLDs in CEFR amp in other standard-based systems
The use of PLDs in the CEFR is institutionalized amp their meaning is generalized across nations
Standardization facilitates lsquothe implementation of a common understanding of CEFR and training is cloning rather than familiarization
Benchmarking = the process of rating individual performance samples using the CEFR PLDs
Standard-setting = lsquomappingrsquo the existing cut scores from tests onto CEFR levels
PLDs are evaluated in terms of their usefulness and meaningfulness they can be discarded or changed
Standardization amp training ensure that everyone understands the standard-setting method yet judgments are freely made
Benchmarking = the typical performances that are identified after standard-setting
Standard-setting = establishing cut scores on tests
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
You can always count on uncertainty
Standards-based testing can be positive if people can reach a consensus rather than being forced to see the world through a single lens Used in this way standards are never fixed monolithic edifices They are open to change and even rejection in the service of language education
Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-
Thank You For Your Attention
- Glen Fulcher (2010) Practical Language Testing Chapter 5 A
- Content of this chapter
- Itrsquos as old as the hills
- Definition of lsquostandardrsquo
- The uses of standards
- Unintended consequences
- Using standards for harmonization amp identity
- Carolingian empire of Charlemagne
- CEFR (Common European Framework of Reference )
- Problems with CEFR
- How many standards can we afford
- Performance level descriptors (PLDs) amp Test scores
- Standard based tests CRT amp scoring rubrics
- Some initial decisions
- Slide 15
- Standard-setting methodologies
- Common process of standard setting
- Test-centered methods
- Angoff method
- Advantages amp disadvantages
- Ebel method
- Ebel method (2)
- Slide 23
- Problems with EBEL
- Nedelsky method (Multiple-choice)
- Problems with Nedelsky method
- Bookmark method
- Slide 28
- Procedures in Bookmark method
- Examinee-centered methods
- Borderline group method
- Method of contrasting groups
- Slide 33
- Which method is the lsquobestrsquo
- Evaluating standard setting (Kane 1994)
- Training a critical part of standard setting
- The special case of the CEFR
- PLDs in CEFR amp in other standard-based systems
- You can always count on uncertainty
- Slide 40
-