Role of Statistics in Developing Standardized Examinations in the US by Mohammad Hafidz Omar, Ph.D....

22
Sem inar Dept of Mathematical Sciences King Fahd University of Petroleum and Minerals Presenter Dr.M oham m ad H.Om ar, Dept of M athem aticalSciences,KFUPM Title The Role ofStatistics in Developing Standardized Exam inations in the U S Audience AllKFUPM com m unity are cordially Invited Dat e Tuesday,Apr 19,2005 Time 12:30 PM Location Building 5,Sm art Classroom # 201 Abstract Prior to joining KFUPM , the presenter spent m ore than 6 years working in educational testing organizations in the United States and attended many conferences on educational assessm ent issues organized by associations like the National Council of M easurem ent in Education and Am erican Educational Research Association. In this presentation, the presenter plans to share w ith the audience the role of statistical indices and procedures in inform ing the process of developing standardized exam inations (such as ACT, SAT, TOEFL, and state-m andated exam s). Definition of standardized and non-standardized examinations will be provided in this talk. Issues related to item and examination level test-construction and analyses and how statistics help address som e of these issues w illalso be discussed. Tea and Coff ee will be provided

Transcript of Role of Statistics in Developing Standardized Examinations in the US by Mohammad Hafidz Omar, Ph.D....

Seminar

Dept of Mathematical Sciences

King Fahd University of Petroleum and Minerals

Presenter Dr. Mohammad H. Omar, Dept of Mathematical Sciences, KFUPM

Title The Role of Statistics in Developing Standardized Examinations in the US

Audience All KFUPM community are cordially Invited

Date Tuesday, Apr 19, 2005

Time 12:30 PM

Location Building 5, Smart Classroom # 201

Abstract

Prior to joining KFUPM, the presenter spent more than 6 years working in educational testing organizations in the United States and attended many conferences on educational assessment issues organized by associations like the National Council of Measurement in Education and American Educational Research Association. In this presentation, the presenter plans to share with the audience the role of statistical indices and procedures in informing the process of developing standardized examinations (such as ACT, SAT, TOEFL, and state-mandated exams). Definition of standardized and non-standardized examinations will be provided in this talk. Issues related to item and examination level test-construction and analyses and how statistics help address some of these issues will also be discussed.

Tea and Coffee will be provided

Role of Statistics in Role of Statistics in Developing Standardized Developing Standardized Examinations in the USExaminations in the US

byby

Mohammad Hafidz Omar, Mohammad Hafidz Omar, Ph.D.Ph.D.

April 19, 2005April 19, 2005

Map of TalkMap of Talk

• What is a standardize test?What is a standardize test?• Why standardize Tests?Why standardize Tests?• Who builds standardize tests in the United Who builds standardize tests in the United

States?States?• Steps to Building a standardize testSteps to Building a standardize test• Test Questions & some statistics used to Test Questions & some statistics used to

describe themdescribe them• Statistics used for describing exam scoresStatistics used for describing exam scores• Research studies in educational testing Research studies in educational testing

that uses advanced statistical proceduresthat uses advanced statistical procedures

What is a “standardized What is a “standardized Examination”?Examination”?

• A A standardizedstandardized test: A test test: A test which the which the conditions of administrationconditions of administration and and

thethe scoring proceduresscoring procedures are designed to are designed to

be thebe the samesame inin all uses all uses of the testof the test– Conditions of administration: Conditions of administration:

• 1) physical test setting1) physical test setting• 2) directions for examinees2) directions for examinees• 3) test materials3) test materials• 4) administration time4) administration time

– Scoring procedures: Scoring procedures: • 1) derivation of scores1) derivation of scores• 2) transformation of raw scores2) transformation of raw scores

Why standardize tests?Why standardize tests?

• Statistical reason:Statistical reason:– Reduction of unwanted variations inReduction of unwanted variations in

•Administration conditionsAdministration conditions

•Scoring practicesScoring practices

• Practical reason:Practical reason:– Appeal to many test usersAppeal to many test users– Same treatment and conditions for all Same treatment and conditions for all

students taking the tests (fairness) students taking the tests (fairness)

Who builds standardize tests in Who builds standardize tests in the United States?the United States?

• Testing OrganizationsTesting Organizations– Educational Testing Service (ETS)Educational Testing Service (ETS)– American College Testing (ACT)American College Testing (ACT)– National Board of Medical Examiners (NBME)National Board of Medical Examiners (NBME)– Iowa Testing Programs (ITP)Iowa Testing Programs (ITP)– Center for Educational Testing and Evaluation (CETE)Center for Educational Testing and Evaluation (CETE)

• State Department of EducationState Department of Education– New Mexico State Department of EducationNew Mexico State Department of Education

• Build tests themselves orBuild tests themselves or• Contract out job to testing organizationsContract out job to testing organizations

• Large School DistrictsLarge School Districts– Wichita Public School DistrictsWichita Public School Districts

a) Administration conditionsa) Administration conditions

• Design of experiment concept of Design of experiment concept of control for unnecessary factorscontrol for unnecessary factors

• Apply the same treatment conditions Apply the same treatment conditions for all test takersfor all test takers

• 1) physical test setting (group vs individual testing, 1) physical test setting (group vs individual testing, etc)etc)

• 2) 2) directionsdirections for examinees for examinees

• 3) 3) test materialstest materials

• 4) 4) administration timeadministration time

b) Scoring Proceduresb) Scoring Procedures

• Same scoring processSame scoring process– Scoring rubric for open-ended itemsScoring rubric for open-ended items

• Same score units and same Same score units and same measurements for everybodymeasurements for everybody– Raw test scores (X)Raw test scores (X)– Scale ScoresScale Scores

• Same Transformation of Raw ScoresSame Transformation of Raw Scores– Raw (X) Raw (X) Equating process Equating process Scale Scores Scale Scores

h(X)h(X)

Overview of Typical Overview of Typical Standardized Examination Standardized Examination

building Processbuilding Process• Costly process

• Important Quality control procedures at each phase

• Process takes time (months to years)

1)Creating Test specifications 2)Fresh Item Development3)Field-Test Development 4)Operational (Live) Test Development

1) Creating Test specifications

• Purpose: – To operationalize the intended purpose of testing

• A team of content experts and stakeholders – discuss the specifications vs the intended purpose

• Serves as a guideline to building examinations– How many items should be written in each

content/skill category? – Which Content/skill area is more important than

others?

• 2-way table of specifications typically contains– content areas (domains) versus– learning objectives – with % of importance associated in each cell

2) Fresh Item Development• Purpose:

– building quality items to meet test specifications• Writing Items to meet Test Specifications

– Q: Minimum # of items to write?– Which cell will need to have more items?– Item Review (Content & Bias Review)

• Design of Experiment stage– Design of Test (easy items first, then mixture – increase motivation)– Design of Testing event (what time of year, sample, etc)

• Data Collection stage:– Pilot-testing of Items– Scoring of items & PT exams

• Analyses Stage:– analyzing Test Items

• Data Interpretation & decision-making stage:– Item Review with aid of item statistics

• Content Review• Bias review

– Quality control step: (1) Keep good quality item, (2) Revise items with minor problem & re-pilot or (3)Scrap bad items

3) 3) Field-Test Development • Purpose:

– building quality exam scales to measure the construct (structure) of the test as intended by the test specifications

• Design of Experiment stage– Designing Field-Test Booklets to meet Specifications

• Use good items only from previous stage (items with known descriptive statistics)

– Design of Testing event• Data collection:

– Field-Testing of Test booklets– Scoring of items and FT Exams

• Analyses– analyzing Examination Booklets (for scale reliability and

validity)• Interpreting results: Item & Test Review

– Do tests meet the minimum statistical requirements. (rxx’ > 0.90)

– If not, what can be done differently?

4) 4) Operational (Live) Test Development

• Purpose: – To measure student abilities as intended by the purpose of the test

• Design of Experiment stage– Design of Operational Test

• Use only good FT items and FT item sets• Assembling Operational Exam Booklets

– Design of Pilot Tests (e.g. some state mandated programs)• New & Some of the revised items

– Design of Field Test (e.g. GRE experimental section)• Good items that has been piloted before• How many sections? How many students per section?

– Design of additional research studies• e.g. Different forms of the test (Paper-&-pencil vs computer version)

– Design of Testing events• Data Collection:

– First Operational Testing of Students with Final version of examinations

– Scoring of items and Exams• Analyses of Operational Examinations• Research studies to establish Reporting scales

Different types of Exam item Different types of Exam item formatformat

• Machine –Scorable formatsMachine –Scorable formats– Multiple-choice QuestionsMultiple-choice Questions– True-FalseTrue-False– Multiple true-falseMultiple true-false– Multiple-mark questions (Pomplun & Omar, 1997) Multiple-mark questions (Pomplun & Omar, 1997)

– aka multiple-answer multiple-choice questions– aka multiple-answer multiple-choice questions– Likert-like Type Items (agree/disagree continuum)Likert-like Type Items (agree/disagree continuum)

• Manual (Human) scoring formatsManual (Human) scoring formats– Short answersShort answers– Open-ended test itemsOpen-ended test items

• Requires a scoring rubric to score papersRequires a scoring rubric to score papers

Statistical considerations in Statistical considerations in Examination constructionExamination construction

• Overall design of tests Overall design of tests – to achieve reliable (consistent) and valid to achieve reliable (consistent) and valid

resultsresults• Designing testing events Designing testing events

– to collect reliable and valid data (to collect reliable and valid data (correct pilot sample, correct pilot sample,

correct time of the year, etccorrect time of the year, etc))– e.g. SAT: Spring/Summer student population e.g. SAT: Spring/Summer student population

differencedifference• Appropriate & Correct Statistical analyses Appropriate & Correct Statistical analyses

of examination dataof examination data• Quality Control of test items and examsQuality Control of test items and exams

Analyses & Interpretation:Analyses & Interpretation:Descriptive statistics for Descriptive statistics for

distractors (Distractor Analysis)distractors (Distractor Analysis)• Applies to Multiple-choice, true-false, Applies to Multiple-choice, true-false,

multiple true-false format onlymultiple true-false format only

• Statistics: Statistics: – Proportion endorsing each distractorProportion endorsing each distractor– Informs the exam authors which Informs the exam authors which

distractor(s) distractor(s) •are not functioning or are not functioning or

•Counter-intuitively more attractive than the Counter-intuitively more attractive than the intended right answer (hi ability wrong answer)intended right answer (hi ability wrong answer)

Analyses and Interpretation:Analyses and Interpretation:Item-Level StatisticsItem-Level Statistics

• Difficulty of ItemsDifficulty of Items – Statistics:Statistics:

• Proportion correct {p-value} – mc, t/f, m-t/f, mm, short answerProportion correct {p-value} – mc, t/f, m-t/f, mm, short answer• Item mean – mm, open-ended itemsItem mean – mm, open-ended items

– Describes how difficult an item isDescribes how difficult an item is• Discrimination Discrimination

– Statistics:Statistics:• Discrimination indexDiscrimination index: high vs Low examinee difference in p-value: high vs Low examinee difference in p-value

– An index describing sensitivity to instruction An index describing sensitivity to instruction • item-total correlations:item-total correlations: correlation of item (dichotomously or correlation of item (dichotomously or

polychotomously scored) with the total score polychotomously scored) with the total score – pt-biserialspt-biserials: correlation between total score & dichotomous : correlation between total score & dichotomous

(right/wrong) item being examined (right/wrong) item being examined – BiserialsBiserials: same as pt-biserials except that the dichotomous item : same as pt-biserials except that the dichotomous item

is now assumed to come from a normal distribution of student is now assumed to come from a normal distribution of student ability in responding to itemability in responding to item

– PolyserialsPolyserials: same as biserials except that the item is : same as biserials except that the item is polychotomously scoredpolychotomously scored

– Describes how an item relates (thus, contributes) to the total scoreDescribes how an item relates (thus, contributes) to the total score

Examination-Level StatisticsExamination-Level Statistics• Overall Difficulty of Exams/Scale Overall Difficulty of Exams/Scale

– Statistics: Test mean, Average item DifficultyStatistics: Test mean, Average item Difficulty

• Overall Dispersion of Exam/Scale scores Overall Dispersion of Exam/Scale scores – Statistics: Test variability – standard deviation, variance, range, etcStatistics: Test variability – standard deviation, variance, range, etc

• Test SpeedednessTest Speededness– Statistics: 1)Statistics: 1)Percent of students attempting the last few questionsPercent of students attempting the last few questions– 2) Percentages of examinees finishing the test within the allotted time period2) Percentages of examinees finishing the test within the allotted time period– Not speeded test: percentage is more than 95%Not speeded test: percentage is more than 95%

• Consistency of the Scale/Exam ScoresConsistency of the Scale/Exam Scores – Statistics: Statistics:

• Scale Reliability IndicesScale Reliability Indices– KR20: for dichotomously scored itemsKR20: for dichotomously scored items– Coefficient alpha: for dichotomously and polychotomously scored itemsCoefficient alpha: for dichotomously and polychotomously scored items

• Standard error of Measurement IndicesStandard error of Measurement Indices• Validity Measures of Scale/Exam ScoresValidity Measures of Scale/Exam Scores

– Intercorrelation matrixIntercorrelation matrix• High Correlation with similar measuresHigh Correlation with similar measures• Low correlation with dissimilar measuresLow correlation with dissimilar measures

– Structural analyses (Factor analyses, etc)Structural analyses (Factor analyses, etc)

Statistical procedures Statistical procedures describing describing ValidityValidity of of Examination scores for its Examination scores for its intended useintended use• Is Reality of Exam for the students same as Is Reality of Exam for the students same as

Authors’ Exam Specifications? Authors’ Exam Specifications? – Construct validity: Analyses on exam structures Construct validity: Analyses on exam structures

(Intercorrelation matrix, Factor analyses, etc)(Intercorrelation matrix, Factor analyses, etc)• Can the exam measure the intended learning factors Can the exam measure the intended learning factors

(constructs)? (constructs)? • Answer: with Factor analyses (Data Reduction method)Answer: with Factor analyses (Data Reduction method)

– Predictive validity: predictive power of exam scores for Predictive validity: predictive power of exam scores for explaining important variablesexplaining important variables• e.g. Can exam scores explain (or predict) success in college? e.g. Can exam scores explain (or predict) success in college? • Regression AnalysesRegression Analyses

– Differential Item Functioning: statistical bias in test itemsDifferential Item Functioning: statistical bias in test items• Are test items fair for all subgroups (Female, Hispanic, Are test items fair for all subgroups (Female, Hispanic,

Blacks, etc) of examinees taking the test?Blacks, etc) of examinees taking the test?• Mantel-Haenszel chi-squared StatisticsMantel-Haenszel chi-squared Statistics

Some research areas in Some research areas in Educational Educational TestingTesting that involve that involve further statistical analysesfurther statistical analyses• Reliability Theory Reliability Theory

– How consistent is a set of examination scores? Signal to How consistent is a set of examination scores? Signal to signal+noise, signal+noise, 22/(/(22+ +

22), ratio in educational ), ratio in educational measurementmeasurement

• Generalizability TheoryGeneralizability Theory– Describing & Controlling for more than 1 source of error Describing & Controlling for more than 1 source of error

variancevariance

• Differential Item Functioning Differential Item Functioning – Pair-wise difference (F vs M, B vs W) in student Pair-wise difference (F vs M, B vs W) in student

performance on itemsperformance on items– Type I error rate control (many items & comparison Type I error rate control (many items & comparison

inflate false detection rates) issueinflate false detection rates) issue

Some research areas in Some research areas in Educational Testing that involve Educational Testing that involve further statistical analysesfurther statistical analyses (continued)(continued)• Test EquatingTest Equating

– Two or more forms of the exam: Are they interchangeable?Two or more forms of the exam: Are they interchangeable?– If scores on form X is regressed on scores from form Y, will the If scores on form X is regressed on scores from form Y, will the

scores from either test editions be interchangeable? Different scores from either test editions be interchangeable? Different regression functionsregression functions

• Item Response TheoryItem Response Theory– Theory relating students’ unobserved ability with their responses Theory relating students’ unobserved ability with their responses

to itemsto items– Probability of responding correctly to test items for each level of Probability of responding correctly to test items for each level of

ability (item characteristic curves)ability (item characteristic curves)– Can put items (not test) on the same common scale.Can put items (not test) on the same common scale.

• Vertical ScalingVertical Scaling– How do student performance from different school grade groups How do student performance from different school grade groups

compare with each other?compare with each other?– Are their means increasing rapidly, slowly, etc?Are their means increasing rapidly, slowly, etc?– Are their variances constant, increasing, or decreasing?Are their variances constant, increasing, or decreasing?

Some research areas in Some research areas in Educational Educational TestingTesting that involve that involve further statistical analyses further statistical analyses (continued)(continued)• Item BankingItem Banking

– Are the same items from different administrations significantly Are the same items from different administrations significantly different in their statistical properties?different in their statistical properties?

– Need Item Response Theory to calibrate all items so that there’s Need Item Response Theory to calibrate all items so that there’s one common scale.one common scale.

– Advantage: Can easily build test forms with similar test Advantage: Can easily build test forms with similar test difficulty.difficulty.

• Computerized Test Computerized Test – Are score results taken on computers interchangeable with Are score results taken on computers interchangeable with

those on paper-and-pencil editions? (e.g. those on paper-and-pencil editions? (e.g. http://ftp.ets.org/pub/gre/002.pdf)http://ftp.ets.org/pub/gre/002.pdf)

– Is measure of student performances free from or tainted by their Is measure of student performances free from or tainted by their level of computer anxiety? level of computer anxiety?

• Computer Adaptive TestingComputer Adaptive Testing– increase measurement precision (test information function) by increase measurement precision (test information function) by

allowing students to take only items that are at their own ability allowing students to take only items that are at their own ability level. level.