Threats to the Validity of Measures of Achievement Gains Laura Hamilton and Daniel McCaffrey, RAND...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Threats to the Validity of Measures of Achievement Gains Laura Hamilton and Daniel McCaffrey, RAND...
Threats to the Validity of Measures Threats to the Validity of Measures of Achievement Gains of Achievement Gains
Laura Hamilton and Daniel McCaffrey, RAND Corporation
Daniel Koretz, Harvard University
November 8, 2005
2
Growth Measures are Becoming More Growth Measures are Becoming More Common in State Accountability SystemsCommon in State Accountability Systems NCLB is primarily not a growth-based approach to
accountability, other than through safe harbor Many states supplement NCLB with growth-based
measuresCalifornia’s Academic Performance IndexMassachusetts Performance and Improvement
ratings U.S. Department of Education has recently
expressed willingness to explore growth measures
3
Today’s Presentation Examines Threats to Today’s Presentation Examines Threats to Validity of Growth MeasuresValidity of Growth Measures
Background: How growth is measured
Framework for validating measures of change
Threats to validityDimensionalityScore inflation
Implications
4
Growth Metrics Come in Several FormsGrowth Metrics Come in Several Forms
Cohort to cohort (CTC)E.g., the average for this year’s fifth graders
compared to last year’s fifth graders Quasi-longitudinal
E.g., the average for this year’s fifth graders compared to last year’s fourth graders
True longitudinal or individual growth (IG)E.g., the average of the individual gains for this
year’s fifth graders
5
Individual Growth Models are Generally Individual Growth Models are Generally PreferredPreferred
Address problems stemming from changes in student populations over timeCan yield biased estimates if students with
incomplete data are different from other students
Provide better information to inform decisions about individual students or groups of studentsCTC changes provides little information for stable
schools
6
All Growth Models Require Assumptions about All Growth Models Require Assumptions about Consistency of Constructs MeasuredConsistency of Constructs Measured
Users of information from growth models assume construct remains constantFor CTC models, nature of achievement and test
content in a single grade should not changeFor IG models, nature of achievement and
constructs measured should not change as students progress through school
Assumption of consistency is violated to varying degrees depending on features of models, tests, curriculum
7
Consistency is One Aspect of ValidityConsistency is One Aspect of Validity
Validity applies to inferences, not just to tests Growth modeling raises concerns about validity of
inferences about changeNeed to understand what users infer from
change scoresThese inferences might vary by group (e.g.,
parents, school administrators)Match between what is inferred and what is
actually measured is critical to validity
8
Framework for Validating Measures of ChangeFramework for Validating Measures of Change
Validation of change scores has focused mainly on comparing trends between scores on two tests or on correlations between alternate measures
These traditional approaches do not address degree of match between tests or nonuniformity of changes within a test
Koretz, McCaffrey, and Hamilton (2001) developed a framework for validating tests under high-stakes conditions, with a focus on measuring change
9
Framework Addresses Nonuniformity of Gains Framework Addresses Nonuniformity of Gains Within a TestWithin a Test
Test scores and inferences are considered in terms of specific performance elements Substantive elements represent the domain of
interestNon-substantive elements are irrelevant to the
domain of interest Performance elements are associated with weights
Weights are typically not explicitSome may be unintentional
Validity requires close match between test weights and inference weights
10
A Simple Linear Model for Test ScoresA Simple Linear Model for Test Scores If we assume performance elements are additive,
the a student’s scores in year t is
where jt denotes the student’s performance on element j in year t and jt is the test weight
The inference about a score assumes it is also a weighted sum of elements but might use different weights
Some weights can be zero
J
jjtjtts
1
J
jjtjtt ws
1
11
Several Factors Undermine Validity of Several Factors Undermine Validity of Inferences About ChangeInferences About Change
Changing nature of sample in CTC modelsDifferences in characteristics of students
included at different time points undermine comparability
We do not address this problem here Dimensionality: Changes in performance elements
and their weights Score inflation: Special case of dimensionality
problem stemming from increases in scores that do not match increases in achievement
12
DimensionalityDimensionality Tests typically assess multiple performance
elementsTest specifications or maps to standards
provide explicit information about performance elements
But implicit and unintended elements are also likely to affect performance
We use the term “dimensionality” broadly to cover all types of performance elements
Users’ inferences are also likely to be multidimensional
Empirical unidimensionality is not sufficient to conclude dimensionality is not a problem
13
Dimensionality Affects Inferences about Dimensionality Affects Inferences about Influences on AchievementInfluences on Achievement
Analyses of NELS:88 math and science assessments examine relationships among achievement, student background, and school and classroom experiences using subscales of achievement measure
For example, gender differences in science depend on what is measuredDifference is larger on items that require out-of-
school knowledge or spatial reasoningFocus on total score or on publisher-developed
test specifications masks this difference Similar findings for relationships with other
student characteristics and school experiences
14
Dimensionality is Relevant to Value-Added Dimensionality is Relevant to Value-Added ModelingModeling
Subscales from a single mathematics achievement test produce dramatically different resultsStudy used Procedures and Problem Solving
subscores from the Stanford Achievement TestVariation within teachers across subscores was
as large as or larger than variation across teachers
Results suggest that decisions about teacher or school effectiveness depend strongly on outcome measure
Changes in weights given to subscores could affect estimates of teacher or school effectiveness
15
The Effects of Different Weightings of The Effects of Different Weightings of Computation and Problem Solving Scores on Computation and Problem Solving Scores on
Teacher EffectsTeacher Effects-0
.50.
00.
51.
0
Tea
cher
Eff
ect
Sca
le
0 0.2 0.4 0.6 0.8 1
16
Threats Stem from Changing Performance Threats Stem from Changing Performance Weights or Mismatch with Inference WeightsWeights or Mismatch with Inference Weights
Many performance elements are likely to be inadvertent and non-substantive; most measures of change will not be fully aligned with users’ inferences
J
jjtjjt
J
jjtjt wws
11
17
Threats Stem from Changing Performance Threats Stem from Changing Performance Weights or Mismatch with Inference WeightsWeights or Mismatch with Inference Weights
Sensitivity of test items to instruction is likely to vary across grades and across performance elements within the test, resulting in changing weights and/or incorrect inferences about educator effectiveness
When tests measure multiple elements, weights that change over time can contribute to gain scores independent of any gains on the performance elements
J
jjtjtjt
J
jjtjtjjt
J
jjtjtjtt
w
wss
11
111
111
18
Implications for CTC and IG Models VaryImplications for CTC and IG Models Vary
Most CTC models use the same test or parallel test forms from one year to the nextTest weights and inference weights will tend to
remain reasonably constant over timeBut performance elements might differ in their
sensitivity to instruction IG models face additional problem of changes in
dimensionality and instructional sensitivity across gradesProblem is likely to be most severe for far-apart
grade levels and for subjects in which the curriculum is not cumulative
19
Score InflationScore Inflation
Score inflation refers to increases in test scores that are not matched by increases in the underlying achievement construct the test was intended to measure
Score inflation represents a special case of dimensionality-related problems
20
Score Inflation is Common in High-Stakes Score Inflation is Common in High-Stakes Testing ContextsTesting Contexts
Analyses of high-stakes test scores show gains in those scores are not matched by gains on other tests of the same content
Discrepancies in trends on high- and low-stakes tests suggest gains on high-stakes tests do not accurately reflect gains in the underlying achievement the test was intended to measure
21
Example of Score InflationExample of Score Inflation
Mathematics test scores
Source: Koretz, Linn, Dunbar, & Shepard, 1991
22
Variation in Teachers’ Responses to Tests Variation in Teachers’ Responses to Tests Leads to Variation in InflationLeads to Variation in Inflation
Teachers respond to high-stakes testing in ways that are intended to maximize score increasesPlacing more emphasis on tested topics than on
untested topics, even when the latter are relevant to users’ inferences
Focusing on “bubble kids” (those just below the cut score)
Coaching on item styles, prompts, or rubrics (aspects of the test that are incidental to the domain being tested)
Many of these actions inflate scores by producing test-score gains that are larger than the gains in the broader achievement domain
23
Recent Surveys Suggest Teachers’ Practices Recent Surveys Suggest Teachers’ Practices are Influenced by Testsare Influenced by Tests
Data from surveys of teachers in California, Georgia, and Pennsylvania
Most teachers report increased focus on standards and on content emphasized on tests
More than half of elementary teachers report increasing time spent on test-taking strategies
Approximately 25% of teachers say they focus more on students near the “proficient” cut score
Responses tend to be stronger in math than in science
24
Score Inflation Exacerbates Inconsistencies in Score Inflation Exacerbates Inconsistencies in Test and Inference Weights Test and Inference Weights
J
jjtjtjt
J
jjtjtjjt
J
jjtjtjtt
w
wss
11
111
111
25
Threats Stemming from Score InflationThreats Stemming from Score Inflation
Problems arising from inflation are similar to those arising from dimensionalityOccurs when students make substantial gains
on elements that might or might not have large inference weights, but fail to make gains on other elements that have high inference weights
Threatens the validity of inferences about gains in achievement when achievement is measured using high-stakes tests
26
Implications for CTC and IG ModelsImplications for CTC and IG Models
Most research on score inflation has focused on CTC measuresEvidence suggests score inflation is large in the
first few years of test implementation but eventually plateaus
Even if inflation lessens over time, inferences about change should be limited to tested material; change scores provide no information about untested material
IG models can be affected by variation in inflation across grades; plateau effects might never occur
27
Improving the Validity of Inferences about Improving the Validity of Inferences about ChangeChange
Users of test-score information need to recognize that measuring change is not necessarily the same as measuring growth
Test developers should make their measures as resistant to inflation as possible
Future research should address dimensionality and score inflation in the context of CTC and TL measures
28
SummarySummary Test scores and inferences depend on multiple
performance elements Valid inferences require consistency between
inference and test weights Inconsistency implies that changes in scores
could be unrelated to the performance elements of interestScore inflation
CTC susceptible to errors from growth on non-substantive or restricted set of elementsEffects likely to plateau
IG susceptible to changes in elements or content across gradesCan have big impact on growth and related
measures