Newton-Raphson Estimate Procedure in Test Assembly

[email protected]

ESTIMATION TO EXAMINEE PARAMETER USING ESTIMATION TO EXAMINEE PARAMETER USING NEWTON-RAPHSON METHOD: AN APPLICATION NEWTON-RAPHSON METHOD: AN APPLICATION

FOR LANGUAGE TESTINGFOR LANGUAGE TESTING

WidiatmokoWidiatmokoE.: [email protected].: [email protected]

Center for Languages Teacher Training and Center for Languages Teacher Training and Development, MOEDevelopment, MOE

Presented at the Annual Conference on Presented at the Annual Conference on LinguisticsLinguistics

Atma Jaya Catholic University JakartaAtma Jaya Catholic University Jakarta15-16 February 200615-16 February 2006

[email protected]

Curriculum/Syllabus Learning Process Evaluation

Why?

Mat

eria

ls

Met

hod,

A

ppro

ach,

T

echn

ique

Eva

luat

ion

[email protected]

Steps in the Construction of Steps in the Construction of Calibrated ItemsCalibrated Items

Outlining Topics Domain Specification

Constructing Items

Reviewing and Revising Items

Items Try-outSelecting Items

Good Items Items Calibration (Estimation)

Item Banking

[email protected]

Test & ItemsTest & Items

Test:Test:questions to measure examinee’s trait in a situationquestions to measure examinee’s trait in a situationexaminer focuses on what the examinees are like in their examiner focuses on what the examinees are like in their norm groupsnorm groupstest: easy, the examinee: higher ability; vice versa test: easy, the examinee: higher ability; vice versa (Hambleton, Swaminathan, & Rogers(Hambleton, Swaminathan, & Rogers, , 1991; Naga, 1992) or1991; Naga, 1992) or

Items: Items: their statistics subject to change or inconsistent depending their statistics subject to change or inconsistent depending

upon the groups’ traits of examineesupon the groups’ traits of examinees designed based on the aforementioned judgmentdesigned based on the aforementioned judgment difficulty (i.e., proportion of examinees passing the item) and difficulty (i.e., proportion of examinees passing the item) and

discrimination (i.e., item-total test biserial or point biserial) discrimination (i.e., item-total test biserial or point biserial) are group-dependent; implying that the values of these are group-dependent; implying that the values of these statistics depend on the examinee group in which they are statistics depend on the examinee group in which they are acquired (Magnusson, 1967; Hambleton, 1989)acquired (Magnusson, 1967; Hambleton, 1989)

[email protected]

IRTIRT

► IRT used by nearly all of the largest IRT used by nearly all of the largest test publishers, many state test publishers, many state departments of education, and departments of education, and industrial and professional industrial and professional organizations (Hambleton & Murray, organizations (Hambleton & Murray, 1983; Hambleton, 1989)1983; Hambleton, 1989)

[email protected]

IRTIRT: Local independence: Local independence composite scores of items by the composite scores of items by the

homogeneous subpopulation of examinees homogeneous subpopulation of examinees which are independent (Naga, 1992 cited in which are independent (Naga, 1992 cited in Widiatmoko, 2005)Widiatmoko, 2005)

responses to any two items are uncorrelated responses to any two items are uncorrelated in a homogeneous subpopulation with a in a homogeneous subpopulation with a particular level of particular level of (Hulin, Drasgow, & (Hulin, Drasgow, & Parsons, 1983)Parsons, 1983)

within any group of examinees all within any group of examinees all characterized by the same values characterized by the same values θ1, θ2, ..., θ1, θ2, ..., θkθk, the (conditional) distributions of the item , the (conditional) distributions of the item scores are all independent of each other scores are all independent of each other (Lord & Novick, 1968) and (McDonald, 1999) (Lord & Novick, 1968) and (McDonald, 1999)

[email protected]

IRT:IRT: Parameter invariance Parameter invariance

►parameters characterising an item do parameters characterising an item do not depend on the trait distribution of not depend on the trait distribution of the examinees and the parameter the examinees and the parameter characterising an examinee does not characterising an examinee does not depend on the set of test items depend on the set of test items ((Hambleton, Swaminathan, & RogersHambleton, Swaminathan, & Rogers, , 1991)1991)

[email protected]

IRT:IRT: Unidimension Unidimension

►presence of a dominant component or presence of a dominant component or factor influencing test performance factor influencing test performance ((Hambleton, Swaminathan, & RogersHambleton, Swaminathan, & Rogers, , 1991)1991)

► item that measures one trait or item that measures one trait or characteristic over the examinees characteristic over the examinees (Traub, 1983; Naga, 1992)(Traub, 1983; Naga, 1992)

[email protected]

Language testsLanguage tests

►designed using the concept of IRT designed using the concept of IRT in in discrete-points test paradigm (Weir, discrete-points test paradigm (Weir, 1990) 1990)

►discrete-points test focusing on one discrete-points test focusing on one point of grammar at a time; at only point of grammar at a time; at only one element of a particular component one element of a particular component of a grammar; only one skill at a time of a grammar; only one skill at a time and one aspect of a skill (Oller, 1979) and one aspect of a skill (Oller, 1979)

[email protected]

Is the test characteristic curve Is the test characteristic curve generated by Newton-Raphson method generated by Newton-Raphson method satisfied in one-parameter logistic satisfied in one-parameter logistic model?model?

Question formulatedQuestion formulated

[email protected]

Theory: Parameter Theory: Parameter estimationestimation

► determining the value determining the value of an examinee’s trait with of an examinee’s trait with adequate precision and classifying an examinee into adequate precision and classifying an examinee into trait categories with small probabilities of trait categories with small probabilities of misclassification (misclassification (Lord & Novick, 1968)Lord & Novick, 1968)

► incorporates item parameter and examinee incorporates item parameter and examinee parameter parameter

► basic consideration: that parameter estimates are basic consideration: that parameter estimates are chosen by selecting the values that make an chosen by selecting the values that make an observed data set appear most likely in light of a observed data set appear most likely in light of a particular model (particular model (Hulin, Drasgow, & Parsons,Hulin, Drasgow, & Parsons, 1983) 1983)

► item parameter: estimation to item difficulty and item parameter: estimation to item difficulty and discrimination discrimination

► examinee parameter: estimation to examinee’s traitexaminee parameter: estimation to examinee’s trait► concerns item banking concerns item banking

[email protected]

IRT modelsIRT models

►1PL, 2PL, 3PL, and 4PL models 1PL, 2PL, 3PL, and 4PL models ►systematic procedure for considering systematic procedure for considering

and quantifying the probability or and quantifying the probability or improbability of individual item and improbability of individual item and examinee’s response patterns in a set examinee’s response patterns in a set of test data (Henning, 1987)of test data (Henning, 1987)

►appropriate for dichotomous dataappropriate for dichotomous data►distinction among the models: the distinction among the models: the

numbers of parametersnumbers of parameters

[email protected]

Model parametersModel parameters

►11stst: scale of examinee’s trait and item : scale of examinee’s trait and item difficulty difficulty

►22ndnd: continuous estimate of : continuous estimate of discriminability discriminability

►33rdrd: index of pseudo chance-level : index of pseudo chance-level (guessing) (guessing)

►44thth: index of carelessness by the high : index of carelessness by the high achiever (Hambleton, 1989)achiever (Hambleton, 1989)

[email protected]

1PL model1PL model

► widely usedwidely used► probabilistic where the examinees and items are probabilistic where the examinees and items are

not only graded for trait and difficulty, but also not only graded for trait and difficulty, but also judged according to the probability or likelihood of judged according to the probability or likelihood of their response patterns given the observed their response patterns given the observed examinee’s trait and item difficulty (Henning, 1987)examinee’s trait and item difficulty (Henning, 1987)

► assumption all items are equally discriminating, assumption all items are equally discriminating, ► application is for relatively easy tests, application is for relatively easy tests, ► much smaller sample sizes are required if the main much smaller sample sizes are required if the main

purpose is to estimate purpose is to estimate θ θ ((Hulin Hulin et alet al. cited in . cited in Crocker & Algina, 1986) Crocker & Algina, 1986)

[email protected]

N-R MethodsN-R Methods

►finds zeros of the next derivatives of finds zeros of the next derivatives of maximized function (Krass, 2005). maximized function (Krass, 2005).

►obtains results where the drift of obtains results where the drift of parameter estimates arrested and the parameter estimates arrested and the parameters estimated more accurately parameters estimated more accurately than with the joint maximum likelihood than with the joint maximum likelihood procedure (Swaminathan, Hambleton, procedure (Swaminathan, Hambleton, Sireci, Xing, & Rizavi, 2003)Sireci, Xing, & Rizavi, 2003)

[email protected]

MethodologyMethodology

► a surveya survey► purposive random sampling purposive random sampling ► the population: 45 items along with their the population: 45 items along with their bibi and and

2000 examinees responding the items2000 examinees responding the items► 40 40 bibi randomly as the subpopulation of items randomly as the subpopulation of items ► only examinees respond the items correctly and only examinees respond the items correctly and

incorrectly are purposively selected data incorrectly are purposively selected data ► 70 examinees responding the items randomly as 70 examinees responding the items randomly as

the subpopulation of examinees the subpopulation of examinees ► the research analysis units: 40 the research analysis units: 40 bibi and 70 examinees and 70 examinees

responding the items responding the items ► the values of examinees’ latent traits analyzed the values of examinees’ latent traits analyzed

[email protected]

AnalysisAnalysis► initial initial θ θ of examinees’ latent traits extend from -1.735 to of examinees’ latent traits extend from -1.735 to

+3.664 +3.664 ► the first iteration includes the examinees 10, 20, 30, and 50 the first iteration includes the examinees 10, 20, 30, and 50 ► the second iteration includes the examinees 5, 9, and 39 the second iteration includes the examinees 5, 9, and 39 ► the third iteration includes the examinees 1, 2, 4, 7, 8, 11, 12, the third iteration includes the examinees 1, 2, 4, 7, 8, 11, 12,

13, 14, 15, 17, 18, 19, 21, 22, 24, 25, 27, 28, 31, 32, 34, 35, 13, 14, 15, 17, 18, 19, 21, 22, 24, 25, 27, 28, 31, 32, 34, 35, 37, 41, 42, 43, 44, 49, 51, 53, 54, 55, 57, 59, 61, 63, 64, 65, 37, 41, 42, 43, 44, 49, 51, 53, 54, 55, 57, 59, 61, 63, 64, 65, 67, 68, and 69 67, 68, and 69

► the fourth iteration includes the examinees 3, 6, 16, 23, 26, the fourth iteration includes the examinees 3, 6, 16, 23, 26, 29, 36, 38, 45, 46, 47, 56, 58, and 66 29, 36, 38, 45, 46, 47, 56, 58, and 66

► the fifth iteration includes the examinees 33, 40, 48, 52, 62, the fifth iteration includes the examinees 33, 40, 48, 52, 62, and 70 and 70

► the sixth iteration includes the examinee 60 the sixth iteration includes the examinee 60 ► it results in the examinees’ traits it results in the examinees’ traits θθ varying from -1.735 to varying from -1.735 to

+2.912 +2.912

[email protected]

Test Characteristic Curve for 1PL

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-1.74 -1.54 -1.19 -0.98 -0.79 -0.60 -0.41 -0.30 -0.12 0.08 0.18 0.63 0.90 1.05 1.38 2.06 2.91

Examinees Latent Trait

Prob

abili

ty o

f Exa

min

ees

Late

nt T

rait

[email protected]

ConclusionConclusion

►1PL model is not sufficiently satisfied. 1PL model is not sufficiently satisfied. Hypothetically, it may be due to the Hypothetically, it may be due to the number of examinees, the method number of examinees, the method employed, the model chosen, the test employed, the model chosen, the test length, and the other factors length, and the other factors

[email protected]

Recommendation & Recommendation & ImplicationImplication

►continuous studycontinuous study► in language testing, recommended to in language testing, recommended to

employ some methods of estimation for employ some methods of estimation for the widely ranged test items using 2PL the widely ranged test items using 2PL model, 3PL model, and other models model, 3PL model, and other models

►computer programs for the sake of the computer programs for the sake of the accurate and quick iterationaccurate and quick iteration

► item banking item banking

[email protected]

Questions

[email protected]

C U Next Year

Newton-Raphson Estimate Procedure in Test Assembly

Documents

Transcript of Newton-Raphson Estimate Procedure in Test Assembly