Consortium Roles in State Assessment and Accountability ......In the transition to the...
Transcript of Consortium Roles in State Assessment and Accountability ......In the transition to the...
Consortium Roles in State
Assessment and
Accountability Systems
2016 National Conference on Student Assessment
June 2016
Presentation Overview
• Gail Pagano, Smarter Balanced Deputy Director of Member
Services
– Member/Consortium collaboration and respective roles in accountability
• Liru Zhang, Delaware Department of Education
– State perspective on accountability when using a consortium
assessment
• Marty McCall, Smarter Balanced Director of Psychometrics
– Assessment services provided to members by the Consortium
• Joe Willhoft, Smarter Balanced Technical Advisor
– What’s on the horizon for Smarter Balanced?
Member/Consortium
Collaboration and Respective
Roles in AccountabilityGail Pagano, Deputy Director of Member Services
Current Smarter Balanced Membership
About Smarter Balanced
•Smarter Balanced is a member-lead organization that operates on a transparent, consensus-based governance.
•Governing members include 15 states, the US Virgin Islands and the Bureau of Indian Education. Each Governing Member has an active MOU with the Regents of the University of California and may vote on all policies and other matters of significant importance.
•Affiliate members are actively engaged in Smarter Balanced discussions and activities and adhere to the adopted policies and principles , however, they do not have an MOU with UCLA and may not cast a vote on policies and other matters.
•During the development phase of the assessment system, members worked collaboratively with Smarter Balanced staff to develop a comprehensive assessment system that includes three components, summative, interim and formative.
•This work included thousands of educators across the country who contributed to every aspect of assessment development.
The Peer Review Process
• Information gathering• Smarter Balanced staff attended CCSSO sponsored meetings
• Collaborative effort• Created a project plan
• Member steering committee
• Customized the ED-provided Map of Critical Elements
– Identified Consortium vs member responsibilities
• Contracted with an independent consultant to draft the critical element indexes
• Discussed and agreed to a common critical element index format
• Used a rolling release of evidence submission indexes to members to ensure a comprehensive review
Example of Smarter Balanced
Customized Matrix
Evidence Development and Submission
• Consortium-provided evidence to assert the technical quality of the assessment
– Created a master documents list
• Evidence collected during the development phase (2011 – 2014)
• Evidence based on the analysis of data from the first operational year (2014-2015)
• 2014-15 Technical Report
– Drafted Consortium-provided evidence indexes with guidance for member-provided evidence
• Member-provided evidence
– Points to Consortium-provided evidence
– Asserts adherence or non-adherence to Smarter Balanced procedures and protocols (e.g., use of a proprietary TDS vs the Smarter Balanced open source solution).
– Additional evidence of technical quality that falls outside the Consortium’s responsibility
• Consortium-provided evidence submitted to USED simultaneously with member-provided submission
Consortium Support for Member Use of
Growth Models for Accountability
• Survey to gather information about member understanding
and use of the various growth models
• In-person and WebEx presentations about options for
measuring achievement growth using Smarter Balanced
Summative Test Scores
• Growth Measure Evaluation Tool
• White paper: A Guide to Understanding and Selecting
measures of Growth for Smarter Balanced Members
• Follow-up presentations
– Interpreting Growth Indicators for Sub-groups
– Reporting Growth Results: Principles and Discussion
Linking Distinct Tests
By
Liru Zhang
Delaware Department of Education
Presented at the National Conference for Student Assessment
Philadelphia, PA,
June, 2016
11
Linking Test Scores from Distinct Tests
In the transition to the next-generation assessment, many states have
faced the challenges of reestablish its accountability system. Newly
implemented assessments differ from previous statewide assessments in
content standards measured and test mode. With such significant
changes, test scores derived from the two distinct tests cannot be
compared directly; however, some continuity is expected from
stakeholders to allow comparisons of, such as, state test report,
achievement levels, and achievement gaps.
In this presentation, the general procedure for linking scores from
distinct tests, common issues, and some limitations of statistical linkage
are briefly discussed with empirical examples (e.g., linking test scores
of a state assessment to the Smarter-balanced assessment).
12
Why Linking Distinct Tests?
To establishing the relationship of test scores between the
previous assessment to the currently implemented one is a
common practice in statewide assessment program.
The objective of linking test scores from distinct tests is to
maintain the continuity of the assessment program. The linking
results provide information to help:
the transition of school accountability system
the evaluation of teachers’ effectiveness
teachers, parents, and policy makers about student
achievement
the longitudinal comparison
13
Background of Linking
Linking test scores from distinct assessments through statistical
procedures must satisfy certain requirements to support interpretable
and valid comparisons (Mislevy, 1992; Linn, 1993; Ercikan, 1998;
Johnson et al. 1998, 2002). The accuracy of such kinds of linkage
strongly depends on the context of the assessments, the groups used for
calculating statistics, and the time of administering the tests (Linn,
1993).
It is important to explore the similarity of test construct between the two
assessments and related psychometric issues in the linking design and
for meaningful interpretations and appropriate use of linking results.
14
Validity of Linking
Linking test scores from different assessments through statistical procedures must satisfy certain requirements to support interpretable and valid comparisons.
The accuracy of this kind of linkage heavily depends on the context of the assessments, the groups used for calculating statistics, and the time of administering the tests (Linn, 1993, Mislevy, 1992).
More importantly, the two linked tests must measure similar constructs; otherwise, scaling is merely a mathematical operation applied to two sets of data to match test score distributions (Dorans et al, 1997; Dorans, 1999).
15
General Procedure for Linking
To linking the test scores between distinct tests, the general procedure includes:
Linking design (e.g., common group, common items)
Linking method (e.g., equipercentile, regression)
Content linkage (e.g., three-tier content linkage)
Statistical linkage
Examine group invariance for subgroups (e.g., gender, racial/ethnicity)
Validation
16
Three-Tier Content Link
Three-Tier Content Link:
1. Compare the content Standards measured
2. Compare the test specifications (e.g., test length, proportions of
content domains, testing conditions)
3. Compare item types and scoring rubrics (e.g., MC, CR; holistic
rubrics, specific rubrics)
Identify similarities, dissimilarities, and significant differences in test
content between the two tests.
It is important to recognize the differences in test purpose, intended
use of test scores, and testing conditions between the two tests.
17
An Outline of Content Link
Category
State Summative Assessments Smarter
Reading Mathematics ELA/Literacy Mathematics
Standards State Prioritized Content Standards Common Core State Standards
Grades 3-10 3-10 3-8, 11 3-8, 11
Sub-TestReading
ComprehensionMathematical reasoning
Reading, Writing,
Listening, ResearchFour claims
Component CAT CAT and Performance Tasks
Item Format MC and Technology-Enhanced Items
Selected-response,
Constructed-response,
and essay
Selected-response,
Constructed-response
Model Rasch Two-Parameter Model
Scale Vertical scale (Approximately 300-1300) Vertical scale (2000-2800)
Reporting
ScoresTwo scale scores Two scale scores
A composite score and
four claim scores
A composite score and
three claim scores
Proficiency Four achievement levels Four achievement levels
Test
ScheduleSpring with two opportunities Spring with one opportunity
Test Year 2014 Spring 2015 Spring
18
Linking Design and Method (1)
The statistical linkage can be performed with the common-group and/or
the common-item designs. If two tests are given in different years, there
is no common group (or common students). In most cases, there are no
common items between the two tests.
The non-equivalent group design is employed to link test scores
between the state assessment and Smarter via the unsmoothed
equipercentile procedure (Kolen and Brennan, 1995, 2004).
Some considerations with the current example:
1. Score point-to-score point linkage or just link the cut scores
2. Use Smarter ELA score or reading score for linking
3. Sample adjustment as needed
4. Other approach, such moderation, mean linking
19
Equipercentile Procedure
Among a range of perspectives and statistical methods, equipercentile is
the commonly used procedure in practice. Kolen and Brennan (1995,
2004) denote the advantages of the equipercentile method; such as
within the range of observed scores; the relationships between linked
tests are not assumed to be linear; and the cumulative distribution
function of X-scores is approximated by that of Y-scores; and the
moments for transformed scores are approximately the same as Y.
However, one challenge in the linking process is that no student earned
a particular score on a distribution, so that the corresponding percentile
rank is no longer unique. The score range is relatively wide especially
in the two ends of the distribution. Thus, subjective choice must be
made; and rounding, which might contribute to the linking errors.
Smoothed vs. nonsmoothed approach
20
Linking Design and Method (2)
Four samples are identified to match students who took the 2014 state
assessment at grade 3, 5, 7, and 10 and students who took the 2015
Smarter assessment at grade 4, 6, 8, and 11 for the common-group
design.
The linear regression approach, simple-regression and multiple-
regression, is applied to predict student performance on the Smarter
assessment based on their test scores on the corresponding state
assessment.
Dependent variable - Smarter scale score
Independent variables – State test scale score
Dummy variables, such as gender, race, SWD, SES, and ELL
Similar issues are remained as discussed earlier. This linkage is in the
nature of prediction of student performance.
21
Linking Invariance
To inspect the property of invariance, and identify possible sampling
errors, independent linking functions obtained from subgroups or a
sample, are compared with the linking functions from the population.
1. The standard error of linking should be estimated (e.g., by Petersen,
Kolen & Hoover ,1993)
2. Two statistics, Root Mean Square Difference (RMSD) and Root
Expected Mean Square Difference (REMSD) by Dorans and Holland
(2000) are used to summarize the differences between the
transformation functions obtained from the total group and from
subgroups.
3. For the current examples, the comparisons of using Smarter reading
scores and ELA/LIT scores for linking can be compared. The residuals
should examined from the regression analysis.
22
Validation of Linking
To set a reasonable baseline for the new high-stakes
accountability system based on linking results, a validation
should be considered.
The established relationships between the two tests could
be applied to the similar population from different years to
validate the linking results.
If applicable, the linking functions obtained from different
methods can be compared for evidence from cross-
validation.
23
Assessment Services Provided
to Members by the Consortium
Marty McCall, Director of Psychometrics
Role of a consortium
• Provide a high quality test
– Error and precision appropriate for a variety of
accountability models
– Scale stability so that interpretation is consistent
for a variety of growth models
• Document technical quality for members
• Serve as a forum for members’ as they craft
accountability systems
– Provide expertise
– Host venues for discussion
NOT the role of a consortium
• Impose a single accountability system
• Require the same policy rules for all members
• Provide information to member stakeholders without member assent
Advantages to being in a consortium
• Pooled expertise– Shared expertise of member staffs
– Deep national expert pool • TAC
• Content experts
• Researchers in special needs populations
• CRESST
• Contracted expertise
Advantages to being in a consortium
• Sheer quantity– Number and quality of items and tasks
– More supports than states can provide in isolation
– Scale and complexity of standard setting
– Big Samples• Large sample for norms
• Sufficient special population numbers for analysis
• Capacity for field testing
• Can do more detailed analysis of technical quality especially by subgroups
Consortium role:
Provide a high quality test
• Choice of psychometric model
• Test Design– Blueprints
– Algorithms
• Scale construction and maintenance
• Item and task development
• Test construction and packaging
Consortium role:
Maintain a high quality test
• Documents technical quality– Reliability
– Accuracy of achievement level assignment
– Technical quality for subgroups
• Field Testing and analysis
• Scale maintenance
• Item pool maintenance– Refreshing pools, item release and retirement
– Ongoing pool analysis
Serve as a forum for members as they
craft accountability systems
• Member meetings topics
– Discuss ways to deal with a variety of policy issues
– Share information about methods, policy, legal
constraints
• TAC meetings, webinars, and white papers
– TAC meeting discussions open to members
– Series on growth models—Joseph Martineau, Kadriye
Erikan, Martha Thurlow
– Joseph Martineau paper-a taxonomy of growth
models and tools for selection
What’s on the Horizon for
Smarter Balanced?
Joe Willhoft, Ph.D.
Five Themes
• Smarter Business Model
• Membership
• Common vs. Customized
• Exchange of data
• Innovation
Smarter Balanced Business Model
• A unique model for K-12 accountability testing
• SB is an affiliation of members (mostly states)– Not a 501(c)3
– A common MOU betw. each member and UCLA
• Each member selects own service provider(s)
• Defines “this, not that” to distinguish SB services from member responsibilities
• Managing and evolving the business model w/o having to re-negotiate MOU
Membership: 2010 vs. 2016
CT
HI CA
ID CT
KS AL DE
ME CO HI
MI DE ID
MO GA MI
MT IA MT
NV KY NV
NM NH NH
NC NJ ND
OR ND OR
UT OH SD
VT OK USVI
WA PA VT
WV SC WA IA
WI SD WV NC
2010 2010 2016 2016
GOVERNING ADVISORY MEMBERS AFFILIATES
N=17 N=14 N=16 N=2
2010 Membership 2016 Membership
Membership
• 2010: 31 states – 17 Governing; 14 Advisory
• 2016: 18 states/entities – 16 Member; 2 Affiliate
• 30% of US states; about 20% of enrollment 3-8, 11
• Members = Revenue
• Business Model impact on state procurement
Common vs. Customized Tests
• Common items, blueprints, scoring rules
• Members have options:– Summative + Interim & Formatives or Summative only
– 3-8 and 11 or 3-8 only
– Use SB score reports or Member-designed reports
– Test delivery platforms
• Manage boundaries around options to safeguard validity and meaning of SB scale & scores
Exchange of Data
• Members have own privacy and data exchange rules:– Can individually identifiable data leave the state?
– Can discoverable data leave the state
– Can any data leave the state?
• Many analysis issues: item field testing; item exposure and drift; use of tools and supports; etc.
Innovation
• USED Grant supported multiple innovations:– Expansion of supports for SWD and ELs
– Formative tools for classroom teachers
– Integration of adaptive testing w performance items & tasks
– Permissions for authentic text in Reading
– Development of an alignment framework
• Innovation is always in jeopardy
• Entropy usually wins; without attention and resources, it always wins
Discussion