Consortium Roles in State Assessment and Accountability ......In the transition to the...

Consortium Roles in State

Assessment and

Accountability Systems

2016 National Conference on Student Assessment

June 2016

Presentation Overview

• Gail Pagano, Smarter Balanced Deputy Director of Member

Services

– Member/Consortium collaboration and respective roles in accountability

• Liru Zhang, Delaware Department of Education

– State perspective on accountability when using a consortium

assessment

• Marty McCall, Smarter Balanced Director of Psychometrics

– Assessment services provided to members by the Consortium

• Joe Willhoft, Smarter Balanced Technical Advisor

– What’s on the horizon for Smarter Balanced?

Member/Consortium

Collaboration and Respective

Roles in AccountabilityGail Pagano, Deputy Director of Member Services

Current Smarter Balanced Membership

About Smarter Balanced

•Smarter Balanced is a member-lead organization that operates on a transparent, consensus-based governance.

•Governing members include 15 states, the US Virgin Islands and the Bureau of Indian Education. Each Governing Member has an active MOU with the Regents of the University of California and may vote on all policies and other matters of significant importance.

•Affiliate members are actively engaged in Smarter Balanced discussions and activities and adhere to the adopted policies and principles , however, they do not have an MOU with UCLA and may not cast a vote on policies and other matters.

•During the development phase of the assessment system, members worked collaboratively with Smarter Balanced staff to develop a comprehensive assessment system that includes three components, summative, interim and formative.

•This work included thousands of educators across the country who contributed to every aspect of assessment development.

The Peer Review Process

• Information gathering• Smarter Balanced staff attended CCSSO sponsored meetings

• Collaborative effort• Created a project plan

• Member steering committee

• Customized the ED-provided Map of Critical Elements

– Identified Consortium vs member responsibilities

• Contracted with an independent consultant to draft the critical element indexes

• Discussed and agreed to a common critical element index format

• Used a rolling release of evidence submission indexes to members to ensure a comprehensive review

Example of Smarter Balanced

Customized Matrix

Evidence Development and Submission

• Consortium-provided evidence to assert the technical quality of the assessment

– Created a master documents list

• Evidence collected during the development phase (2011 – 2014)

• Evidence based on the analysis of data from the first operational year (2014-2015)

• 2014-15 Technical Report

– Drafted Consortium-provided evidence indexes with guidance for member-provided evidence

• Member-provided evidence

– Points to Consortium-provided evidence

– Asserts adherence or non-adherence to Smarter Balanced procedures and protocols (e.g., use of a proprietary TDS vs the Smarter Balanced open source solution).

– Additional evidence of technical quality that falls outside the Consortium’s responsibility

• Consortium-provided evidence submitted to USED simultaneously with member-provided submission

Consortium Support for Member Use of

Growth Models for Accountability

• Survey to gather information about member understanding

and use of the various growth models

• In-person and WebEx presentations about options for

measuring achievement growth using Smarter Balanced

Summative Test Scores

• Growth Measure Evaluation Tool

• White paper: A Guide to Understanding and Selecting

measures of Growth for Smarter Balanced Members

• Follow-up presentations

– Interpreting Growth Indicators for Sub-groups

– Reporting Growth Results: Principles and Discussion

Linking Distinct Tests

By

Liru Zhang

Delaware Department of Education

Presented at the National Conference for Student Assessment

Philadelphia, PA,

June, 2016

11

Linking Test Scores from Distinct Tests

In the transition to the next-generation assessment, many states have

faced the challenges of reestablish its accountability system. Newly

implemented assessments differ from previous statewide assessments in

content standards measured and test mode. With such significant

changes, test scores derived from the two distinct tests cannot be

compared directly; however, some continuity is expected from

stakeholders to allow comparisons of, such as, state test report,

achievement levels, and achievement gaps.

In this presentation, the general procedure for linking scores from

distinct tests, common issues, and some limitations of statistical linkage

are briefly discussed with empirical examples (e.g., linking test scores

of a state assessment to the Smarter-balanced assessment).

12

Why Linking Distinct Tests?

To establishing the relationship of test scores between the

previous assessment to the currently implemented one is a

common practice in statewide assessment program.

The objective of linking test scores from distinct tests is to

maintain the continuity of the assessment program. The linking

results provide information to help:

the transition of school accountability system

the evaluation of teachers’ effectiveness

teachers, parents, and policy makers about student

achievement

the longitudinal comparison

13

Background of Linking

Linking test scores from distinct assessments through statistical

procedures must satisfy certain requirements to support interpretable

and valid comparisons (Mislevy, 1992; Linn, 1993; Ercikan, 1998;

Johnson et al. 1998, 2002). The accuracy of such kinds of linkage

strongly depends on the context of the assessments, the groups used for

calculating statistics, and the time of administering the tests (Linn,

1993).

It is important to explore the similarity of test construct between the two

assessments and related psychometric issues in the linking design and

for meaningful interpretations and appropriate use of linking results.

14

Validity of Linking

Linking test scores from different assessments through statistical procedures must satisfy certain requirements to support interpretable and valid comparisons.

The accuracy of this kind of linkage heavily depends on the context of the assessments, the groups used for calculating statistics, and the time of administering the tests (Linn, 1993, Mislevy, 1992).

More importantly, the two linked tests must measure similar constructs; otherwise, scaling is merely a mathematical operation applied to two sets of data to match test score distributions (Dorans et al, 1997; Dorans, 1999).

15

General Procedure for Linking

To linking the test scores between distinct tests, the general procedure includes:

Linking design (e.g., common group, common items)

Linking method (e.g., equipercentile, regression)

Content linkage (e.g., three-tier content linkage)

Statistical linkage

Examine group invariance for subgroups (e.g., gender, racial/ethnicity)

Validation

16

Three-Tier Content Link

Three-Tier Content Link:

1. Compare the content Standards measured

2. Compare the test specifications (e.g., test length, proportions of

content domains, testing conditions)

3. Compare item types and scoring rubrics (e.g., MC, CR; holistic

rubrics, specific rubrics)

Identify similarities, dissimilarities, and significant differences in test

content between the two tests.

It is important to recognize the differences in test purpose, intended

use of test scores, and testing conditions between the two tests.

17

An Outline of Content Link

Category

State Summative Assessments Smarter

Reading Mathematics ELA/Literacy Mathematics

Standards State Prioritized Content Standards Common Core State Standards

Grades 3-10 3-10 3-8, 11 3-8, 11

Sub-TestReading

ComprehensionMathematical reasoning

Reading, Writing,

Listening, ResearchFour claims

Component CAT CAT and Performance Tasks

Item Format MC and Technology-Enhanced Items

Selected-response,

Constructed-response,

and essay

Selected-response,

Constructed-response

Model Rasch Two-Parameter Model

Scale Vertical scale (Approximately 300-1300) Vertical scale (2000-2800)

Reporting

ScoresTwo scale scores Two scale scores

A composite score and

four claim scores

A composite score and

three claim scores

Proficiency Four achievement levels Four achievement levels

Test

ScheduleSpring with two opportunities Spring with one opportunity

Test Year 2014 Spring 2015 Spring

18

Linking Design and Method (1)

The statistical linkage can be performed with the common-group and/or

the common-item designs. If two tests are given in different years, there

is no common group (or common students). In most cases, there are no

common items between the two tests.

The non-equivalent group design is employed to link test scores

between the state assessment and Smarter via the unsmoothed

equipercentile procedure (Kolen and Brennan, 1995, 2004).

Some considerations with the current example:

1. Score point-to-score point linkage or just link the cut scores

2. Use Smarter ELA score or reading score for linking

3. Sample adjustment as needed

4. Other approach, such moderation, mean linking

19

Equipercentile Procedure

Among a range of perspectives and statistical methods, equipercentile is

the commonly used procedure in practice. Kolen and Brennan (1995,

2004) denote the advantages of the equipercentile method; such as

within the range of observed scores; the relationships between linked

tests are not assumed to be linear; and the cumulative distribution

function of X-scores is approximated by that of Y-scores; and the

moments for transformed scores are approximately the same as Y.

However, one challenge in the linking process is that no student earned

a particular score on a distribution, so that the corresponding percentile

rank is no longer unique. The score range is relatively wide especially

in the two ends of the distribution. Thus, subjective choice must be

made; and rounding, which might contribute to the linking errors.

Smoothed vs. nonsmoothed approach

20

Linking Design and Method (2)

Four samples are identified to match students who took the 2014 state

assessment at grade 3, 5, 7, and 10 and students who took the 2015

Smarter assessment at grade 4, 6, 8, and 11 for the common-group

design.

The linear regression approach, simple-regression and multiple-

regression, is applied to predict student performance on the Smarter

assessment based on their test scores on the corresponding state

assessment.

Dependent variable - Smarter scale score

Independent variables – State test scale score

Dummy variables, such as gender, race, SWD, SES, and ELL

Similar issues are remained as discussed earlier. This linkage is in the

nature of prediction of student performance.

21

Linking Invariance

To inspect the property of invariance, and identify possible sampling

errors, independent linking functions obtained from subgroups or a

sample, are compared with the linking functions from the population.

1. The standard error of linking should be estimated (e.g., by Petersen,

Kolen & Hoover ,1993)

2. Two statistics, Root Mean Square Difference (RMSD) and Root

Expected Mean Square Difference (REMSD) by Dorans and Holland

(2000) are used to summarize the differences between the

transformation functions obtained from the total group and from

subgroups.

3. For the current examples, the comparisons of using Smarter reading

scores and ELA/LIT scores for linking can be compared. The residuals

should examined from the regression analysis.

22

Validation of Linking

To set a reasonable baseline for the new high-stakes

accountability system based on linking results, a validation

should be considered.

The established relationships between the two tests could

be applied to the similar population from different years to

validate the linking results.

If applicable, the linking functions obtained from different

methods can be compared for evidence from cross-

validation.

23

Assessment Services Provided

to Members by the Consortium

Marty McCall, Director of Psychometrics

Role of a consortium

• Provide a high quality test

– Error and precision appropriate for a variety of

accountability models

– Scale stability so that interpretation is consistent

for a variety of growth models

• Document technical quality for members

• Serve as a forum for members’ as they craft

accountability systems

– Provide expertise

– Host venues for discussion

NOT the role of a consortium

• Impose a single accountability system

• Require the same policy rules for all members

• Provide information to member stakeholders without member assent

Advantages to being in a consortium

• Pooled expertise– Shared expertise of member staffs

– Deep national expert pool • TAC

• Content experts

• Researchers in special needs populations

• CRESST

• Contracted expertise

Advantages to being in a consortium

• Sheer quantity– Number and quality of items and tasks

– More supports than states can provide in isolation

– Scale and complexity of standard setting

– Big Samples• Large sample for norms

• Sufficient special population numbers for analysis

• Capacity for field testing

• Can do more detailed analysis of technical quality especially by subgroups

Consortium role:

Provide a high quality test

• Choice of psychometric model

• Test Design– Blueprints

– Algorithms

• Scale construction and maintenance

• Item and task development

• Test construction and packaging

Consortium role:

Maintain a high quality test

• Documents technical quality– Reliability

– Accuracy of achievement level assignment

– Technical quality for subgroups

• Field Testing and analysis

• Scale maintenance

• Item pool maintenance– Refreshing pools, item release and retirement

– Ongoing pool analysis

Serve as a forum for members as they

craft accountability systems

• Member meetings topics

– Discuss ways to deal with a variety of policy issues

– Share information about methods, policy, legal

constraints

• TAC meetings, webinars, and white papers

– TAC meeting discussions open to members

– Series on growth models—Joseph Martineau, Kadriye

Erikan, Martha Thurlow

– Joseph Martineau paper-a taxonomy of growth

models and tools for selection

What’s on the Horizon for

Smarter Balanced?

Joe Willhoft, Ph.D.

Five Themes

• Smarter Business Model

• Membership

• Common vs. Customized

• Exchange of data

• Innovation

Smarter Balanced Business Model

• A unique model for K-12 accountability testing

• SB is an affiliation of members (mostly states)– Not a 501(c)3

– A common MOU betw. each member and UCLA

• Each member selects own service provider(s)

• Defines “this, not that” to distinguish SB services from member responsibilities

• Managing and evolving the business model w/o having to re-negotiate MOU

Membership: 2010 vs. 2016

CT

HI CA

ID CT

KS AL DE

ME CO HI

MI DE ID

MO GA MI

MT IA MT

NV KY NV

NM NH NH

NC NJ ND

OR ND OR

UT OH SD

VT OK USVI

WA PA VT

WV SC WA IA

WI SD WV NC

2010 2010 2016 2016

GOVERNING ADVISORY MEMBERS AFFILIATES

N=17 N=14 N=16 N=2

2010 Membership 2016 Membership

Membership

• 2010: 31 states – 17 Governing; 14 Advisory

• 2016: 18 states/entities – 16 Member; 2 Affiliate

• 30% of US states; about 20% of enrollment 3-8, 11

• Members = Revenue

• Business Model impact on state procurement

Common vs. Customized Tests

• Common items, blueprints, scoring rules

• Members have options:– Summative + Interim & Formatives or Summative only

– 3-8 and 11 or 3-8 only

– Use SB score reports or Member-designed reports

– Test delivery platforms

• Manage boundaries around options to safeguard validity and meaning of SB scale & scores

Exchange of Data

• Members have own privacy and data exchange rules:– Can individually identifiable data leave the state?

– Can discoverable data leave the state

– Can any data leave the state?

• Many analysis issues: item field testing; item exposure and drift; use of tools and supports; etc.

Innovation

• USED Grant supported multiple innovations:– Expansion of supports for SWD and ELs

– Formative tools for classroom teachers

– Integration of adaptive testing w performance items & tasks

– Permissions for authentic text in Reading

– Development of an alignment framework

• Innovation is always in jeopardy

• Entropy usually wins; without attention and resources, it always wins

Discussion

Consortium Roles in State Assessment and Accountability ......In the transition to the...

Documents

Transcript of Consortium Roles in State Assessment and Accountability ......In the transition to the...