Threats to the Validity of Measures of Achievement Gains Laura Hamilton and Daniel McCaffrey, RAND...

Threats to the Validity of Measures Threats to the Validity of Measures of Achievement Gains of Achievement Gains

Laura Hamilton and Daniel McCaffrey, RAND Corporation

Daniel Koretz, Harvard University

November 8, 2005

2

Growth Measures are Becoming More Growth Measures are Becoming More Common in State Accountability SystemsCommon in State Accountability Systems NCLB is primarily not a growth-based approach to

accountability, other than through safe harbor Many states supplement NCLB with growth-based

measuresCalifornia’s Academic Performance IndexMassachusetts Performance and Improvement

ratings U.S. Department of Education has recently

expressed willingness to explore growth measures

3

Today’s Presentation Examines Threats to Today’s Presentation Examines Threats to Validity of Growth MeasuresValidity of Growth Measures

Background: How growth is measured

Framework for validating measures of change

Threats to validityDimensionalityScore inflation

Implications

4

Growth Metrics Come in Several FormsGrowth Metrics Come in Several Forms

Cohort to cohort (CTC)E.g., the average for this year’s fifth graders

compared to last year’s fifth graders Quasi-longitudinal

E.g., the average for this year’s fifth graders compared to last year’s fourth graders

True longitudinal or individual growth (IG)E.g., the average of the individual gains for this

year’s fifth graders

5

Individual Growth Models are Generally Individual Growth Models are Generally PreferredPreferred

Address problems stemming from changes in student populations over timeCan yield biased estimates if students with

incomplete data are different from other students

Provide better information to inform decisions about individual students or groups of studentsCTC changes provides little information for stable

schools

6

All Growth Models Require Assumptions about All Growth Models Require Assumptions about Consistency of Constructs MeasuredConsistency of Constructs Measured

Users of information from growth models assume construct remains constantFor CTC models, nature of achievement and test

content in a single grade should not changeFor IG models, nature of achievement and

constructs measured should not change as students progress through school

Assumption of consistency is violated to varying degrees depending on features of models, tests, curriculum

7

Consistency is One Aspect of ValidityConsistency is One Aspect of Validity

Validity applies to inferences, not just to tests Growth modeling raises concerns about validity of

inferences about changeNeed to understand what users infer from

change scoresThese inferences might vary by group (e.g.,

parents, school administrators)Match between what is inferred and what is

actually measured is critical to validity

8

Framework for Validating Measures of ChangeFramework for Validating Measures of Change

Validation of change scores has focused mainly on comparing trends between scores on two tests or on correlations between alternate measures

These traditional approaches do not address degree of match between tests or nonuniformity of changes within a test

Koretz, McCaffrey, and Hamilton (2001) developed a framework for validating tests under high-stakes conditions, with a focus on measuring change

9

Framework Addresses Nonuniformity of Gains Framework Addresses Nonuniformity of Gains Within a TestWithin a Test

Test scores and inferences are considered in terms of specific performance elements Substantive elements represent the domain of

interestNon-substantive elements are irrelevant to the

domain of interest Performance elements are associated with weights

Weights are typically not explicitSome may be unintentional

Validity requires close match between test weights and inference weights

10

A Simple Linear Model for Test ScoresA Simple Linear Model for Test Scores If we assume performance elements are additive,

the a student’s scores in year t is

where jt denotes the student’s performance on element j in year t and jt is the test weight

The inference about a score assumes it is also a weighted sum of elements but might use different weights

Some weights can be zero

J

jjtjtts

1

J

jjtjtt ws

1

11

Several Factors Undermine Validity of Several Factors Undermine Validity of Inferences About ChangeInferences About Change

Changing nature of sample in CTC modelsDifferences in characteristics of students

included at different time points undermine comparability

We do not address this problem here Dimensionality: Changes in performance elements

and their weights Score inflation: Special case of dimensionality

problem stemming from increases in scores that do not match increases in achievement

12

DimensionalityDimensionality Tests typically assess multiple performance

elementsTest specifications or maps to standards

provide explicit information about performance elements

But implicit and unintended elements are also likely to affect performance

We use the term “dimensionality” broadly to cover all types of performance elements

Users’ inferences are also likely to be multidimensional

Empirical unidimensionality is not sufficient to conclude dimensionality is not a problem

13

Dimensionality Affects Inferences about Dimensionality Affects Inferences about Influences on AchievementInfluences on Achievement

Analyses of NELS:88 math and science assessments examine relationships among achievement, student background, and school and classroom experiences using subscales of achievement measure

For example, gender differences in science depend on what is measuredDifference is larger on items that require out-of-

school knowledge or spatial reasoningFocus on total score or on publisher-developed

test specifications masks this difference Similar findings for relationships with other

student characteristics and school experiences

14

Dimensionality is Relevant to Value-Added Dimensionality is Relevant to Value-Added ModelingModeling

Subscales from a single mathematics achievement test produce dramatically different resultsStudy used Procedures and Problem Solving

subscores from the Stanford Achievement TestVariation within teachers across subscores was

as large as or larger than variation across teachers

Results suggest that decisions about teacher or school effectiveness depend strongly on outcome measure

Changes in weights given to subscores could affect estimates of teacher or school effectiveness

15

The Effects of Different Weightings of The Effects of Different Weightings of Computation and Problem Solving Scores on Computation and Problem Solving Scores on

Teacher EffectsTeacher Effects-0

.50.

00.

51.

0

Tea

cher

Eff

ect

Sca

le

0 0.2 0.4 0.6 0.8 1

16

Threats Stem from Changing Performance Threats Stem from Changing Performance Weights or Mismatch with Inference WeightsWeights or Mismatch with Inference Weights

Many performance elements are likely to be inadvertent and non-substantive; most measures of change will not be fully aligned with users’ inferences

J

jjtjjt

J

jjtjt wws

11

17

Threats Stem from Changing Performance Threats Stem from Changing Performance Weights or Mismatch with Inference WeightsWeights or Mismatch with Inference Weights

Sensitivity of test items to instruction is likely to vary across grades and across performance elements within the test, resulting in changing weights and/or incorrect inferences about educator effectiveness

When tests measure multiple elements, weights that change over time can contribute to gain scores independent of any gains on the performance elements

J

jjtjtjt

J

jjtjtjjt

J

jjtjtjtt

w

wss

11

111

111

18

Implications for CTC and IG Models VaryImplications for CTC and IG Models Vary

Most CTC models use the same test or parallel test forms from one year to the nextTest weights and inference weights will tend to

remain reasonably constant over timeBut performance elements might differ in their

sensitivity to instruction IG models face additional problem of changes in

dimensionality and instructional sensitivity across gradesProblem is likely to be most severe for far-apart

grade levels and for subjects in which the curriculum is not cumulative

19

Score InflationScore Inflation

Score inflation refers to increases in test scores that are not matched by increases in the underlying achievement construct the test was intended to measure

Score inflation represents a special case of dimensionality-related problems

20

Score Inflation is Common in High-Stakes Score Inflation is Common in High-Stakes Testing ContextsTesting Contexts

Analyses of high-stakes test scores show gains in those scores are not matched by gains on other tests of the same content

Discrepancies in trends on high- and low-stakes tests suggest gains on high-stakes tests do not accurately reflect gains in the underlying achievement the test was intended to measure

21

Example of Score InflationExample of Score Inflation

Mathematics test scores

Source: Koretz, Linn, Dunbar, & Shepard, 1991

22

Variation in Teachers’ Responses to Tests Variation in Teachers’ Responses to Tests Leads to Variation in InflationLeads to Variation in Inflation

Teachers respond to high-stakes testing in ways that are intended to maximize score increasesPlacing more emphasis on tested topics than on

untested topics, even when the latter are relevant to users’ inferences

Focusing on “bubble kids” (those just below the cut score)

Coaching on item styles, prompts, or rubrics (aspects of the test that are incidental to the domain being tested)

Many of these actions inflate scores by producing test-score gains that are larger than the gains in the broader achievement domain

23

Recent Surveys Suggest Teachers’ Practices Recent Surveys Suggest Teachers’ Practices are Influenced by Testsare Influenced by Tests

Data from surveys of teachers in California, Georgia, and Pennsylvania

Most teachers report increased focus on standards and on content emphasized on tests

More than half of elementary teachers report increasing time spent on test-taking strategies

Approximately 25% of teachers say they focus more on students near the “proficient” cut score

Responses tend to be stronger in math than in science

24

Score Inflation Exacerbates Inconsistencies in Score Inflation Exacerbates Inconsistencies in Test and Inference Weights Test and Inference Weights

J

jjtjtjt

J

jjtjtjjt

J

jjtjtjtt

w

wss

11

111

111

25

Threats Stemming from Score InflationThreats Stemming from Score Inflation

Problems arising from inflation are similar to those arising from dimensionalityOccurs when students make substantial gains

on elements that might or might not have large inference weights, but fail to make gains on other elements that have high inference weights

Threatens the validity of inferences about gains in achievement when achievement is measured using high-stakes tests

26

Implications for CTC and IG ModelsImplications for CTC and IG Models

Most research on score inflation has focused on CTC measuresEvidence suggests score inflation is large in the

first few years of test implementation but eventually plateaus

Even if inflation lessens over time, inferences about change should be limited to tested material; change scores provide no information about untested material

IG models can be affected by variation in inflation across grades; plateau effects might never occur

27

Improving the Validity of Inferences about Improving the Validity of Inferences about ChangeChange

Users of test-score information need to recognize that measuring change is not necessarily the same as measuring growth

Test developers should make their measures as resistant to inflation as possible

Future research should address dimensionality and score inflation in the context of CTC and TL measures

28

SummarySummary Test scores and inferences depend on multiple

performance elements Valid inferences require consistency between

inference and test weights Inconsistency implies that changes in scores

could be unrelated to the performance elements of interestScore inflation

CTC susceptible to errors from growth on non-substantive or restricted set of elementsEffects likely to plateau

IG susceptible to changes in elements or content across gradesCan have big impact on growth and related

measures

Threats to the Validity of Measures of Achievement Gains Laura Hamilton and Daniel McCaffrey, RAND...

Documents

Transcript of Threats to the Validity of Measures of Achievement Gains Laura Hamilton and Daniel McCaffrey, RAND...