DeAnn Huinker, Daniel A. Sass, & Cindy M. Walker University of Wisconsin-Milwaukee

Measuring Mathematical Measuring Mathematical Knowledge for Teaching: Knowledge for Teaching:

Measurement and Modeling Measurement and Modeling Issues in Constructing and Using Issues in Constructing and Using

Teacher AssessmentsTeacher Assessments

DeAnn Huinker, Daniel A. Sass, & Cindy DeAnn Huinker, Daniel A. Sass, & Cindy M. WalkerM. Walker

University of Wisconsin-MilwaukeeUniversity of Wisconsin-Milwaukee

IntroductionIntroduction We have been using the Learning We have been using the Learning

Mathematics for Teaching (LMT) Mathematics for Teaching (LMT) assessments to evaluate the impact of assessments to evaluate the impact of the MMP on teacher content knowledgethe MMP on teacher content knowledge

Appreciative of the strong theoretical Appreciative of the strong theoretical foundation; however several pragmatic foundation; however several pragmatic challenges existchallenges exist

Purpose of this presentation is to share Purpose of this presentation is to share our experiences, challenges, and our experiences, challenges, and concernsconcerns

Item Response Theory Item Response Theory (IRT) 101(IRT) 101

Mathematical function that relates item Mathematical function that relates item parameters (i.e. difficulty and parameters (i.e. difficulty and discrimination) to examinee characteristicsdiscrimination) to examinee characteristics IRT ability and item parameter interpretationIRT ability and item parameter interpretation

IRT parameter estimates are invariant up IRT parameter estimates are invariant up to a linear transformation (i.e. to a linear transformation (i.e. indeterminacy of scale)indeterminacy of scale)

Several competing models to choose fromSeveral competing models to choose from How does IRT differ from classical test How does IRT differ from classical test

theory (CTT)?theory (CTT)?

Issue 1: Lack of Item Issue 1: Lack of Item EquatingEquating

Multiple sets of item parameters Multiple sets of item parameters which can occur due to scaling the which can occur due to scaling the same items fromsame items from

1) Different test compositions1) Different test compositions

2) Different groups of examinees2) Different groups of examinees

3) Both3) Both Which set of item parameters should Which set of item parameters should

be used?be used? Will repeated measures be used? Will repeated measures be used? Need to generalize to the population?Need to generalize to the population?

Issue 2: Scale Issue 2: Scale Development Development

In using the LMT measures, projects must In using the LMT measures, projects must decide whether to use established LMT scales decide whether to use established LMT scales or to construct their own assessments by or to construct their own assessments by choosing problems from the item poolchoosing problems from the item pool Which method is best and when?Which method is best and when?

Content validity issueContent validity issue Need to generalize ability estimatesNeed to generalize ability estimates Test lengthTest length Matching ability distribution to maximize test Matching ability distribution to maximize test

informationinformation Equating concernEquating concern

Should the pre- and post-test measures?Should the pre- and post-test measures? IRT vs. CTTIRT vs. CTT

Issue 2: Scale Issue 2: Scale Development Development

How do researchers decide which items How do researchers decide which items to use in constructing assessments?to use in constructing assessments?

We have found that the LMT item pool We have found that the LMT item pool often contains too few items to create the often contains too few items to create the preferred level of match to project goals preferred level of match to project goals and state standards for student learningand state standards for student learning Need to match item characteristics to expected Need to match item characteristics to expected

ability distribution ability distribution In some content areas there are too few items In some content areas there are too few items

and/or item characteristics are not idealand/or item characteristics are not ideal

Issue 3: Model SelectionIssue 3: Model Selection

What IRT model would be selected and What IRT model would be selected and how does it influence score how does it influence score interpretation?interpretation? One issue when modeling dichotomous data One issue when modeling dichotomous data

using IRT is selecting the most appropriate or using IRT is selecting the most appropriate or best fitting model (i.e., 1-, 2-, or 3-PL)best fitting model (i.e., 1-, 2-, or 3-PL)

Why not use polytomous models? Why not use polytomous models? To date items are scored using either CTT To date items are scored using either CTT

(i.e., summing the number of correct items) or (i.e., summing the number of correct items) or using the 2-PL model. using the 2-PL model.

Comparability of modelsComparability of models Role of item discrimination parameterRole of item discrimination parameter Score interpretation for CTT and 2-PLScore interpretation for CTT and 2-PL

Table 1Table 1

Data taken from Mathematical Data taken from Mathematical Explorations for Elementary Teachers Explorations for Elementary Teachers CourseCourse

NN MeanMean SDSD tt DFDF Sig.Sig.

PrePre 2525 -.7997-.7997 .73647.73647

PostPost 2525 .2248.2248 .58137.58137

Post-PrePost-Pre 2525 1.024521.02452 .75264.75264 6.8066.806 2424 .0001.0001

ConclusionsConclusions There are two primary issues related to There are two primary issues related to

analyzing data from the Michigan measures analyzing data from the Michigan measures that needs to be address and improved on. that needs to be address and improved on. 1) Item equating to ensure the ones are on 1) Item equating to ensure the ones are on

the same measurement scale.the same measurement scale.Benefit of invariance property (i.e., test Benefit of invariance property (i.e., test

length and item selection)length and item selection)2) The second issue is which IRT model is 2) The second issue is which IRT model is

more appropriate for the data and the more appropriate for the data and the degree to which fitting different models degree to which fitting different models affects score interpretation.affects score interpretation.

Questions and ConcernsQuestions and Concerns

How have you addressed some of How have you addressed some of these issues?these issues?

What are some issues that you have What are some issues that you have encountered when using this encountered when using this measure?measure?

Related measurement questions?Related measurement questions?

DeAnn Huinker, Daniel A. Sass, & Cindy M. Walker University of Wisconsin-Milwaukee

Documents

Transcript of DeAnn Huinker, Daniel A. Sass, & Cindy M. Walker University of Wisconsin-Milwaukee