Conceptual Issues in Response-Time Modeling Wim J. van der Linden CTB/McGraw-Hill.
-
Upload
tobias-higgins -
Category
Documents
-
view
216 -
download
0
Transcript of Conceptual Issues in Response-Time Modeling Wim J. van der Linden CTB/McGraw-Hill.
Conceptual Issues in Response-Time Modeling
Wim J. van der Linden
CTB/McGraw-Hill
Outline
• Traditions of RT modeling
• RTs fixed or random?
• Item completion, responses, and RTs
• RT and speed
• Speed and ability
Outline Cont’d
• RT and item difficulty
• Dependences between responses and RTs
• Hierarchical model of responses and RTs
• Applications to testing problems
Traditions of RT Modeling
• Four different traditions– No model– Distinct models for RTs– Response models with RT parameters– RT models in mathematical psychology
• Alternative– Hierarchical model of responses and RTs
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Fixed or Random RTs
• Some models treat RTs as fixed quantities:– Roskam (1987, 1997); Thurstone (1937)
• RTs treated as random in psychology• Random responses but fixed RTs seems contradictory• Conclusion 1: Just as responses, RTs on
test items should be treated as realizationsof random variables
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Item Completion, Response,and RT
• Rasch (1960) models for misreadings andreading speed– Poisson-gamma framework– Same notation and terminology for parameters in both types of models
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
“To which extent the two difficulty parameters … and the two ability parameters … run parallel is a questionto be answered by empirical results, and at present weshall leave it open.” (Rasch, 1960, p. 42)
Item Completion, Response,and RT Cont’d
• Notion of equivalent scores of speed tests (Gulliksen, 1960; Woodbury (1951, 1963): – Total time on a fixed number of items– Number of items correct in a fixed time interval
• Three types of variables required to describetest behavior:– Tij: response time (person j and item i)
– Uij: response
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Item Completion, Response,and RT Cont’d
• Three sets of variables (cont’d)
– Dij: item completion (design variable)
• Uij and Dij have different distributions
– Same holds for their sums• NU: number-correct scores
• ND: number of items completed
• Equivalence only when Pr{Uij=1|Dij=1}=1for all items and persons
• Test design – Fixed tests
– Adaptive tests
– Test accommodations
• And many more
Item Completion, Response,and RT Cont’d
• Distinction between speed and power test makesno sense; all test are hybrids
• Conclusion 2: Tij, Uij, and Dij are randomvariables with different distributions. The same holds for their sums: total time (T), numbercorrect (NU), and number completed (ND). Except for discreteness, T and ND are inversely related. (We’ll assume T and NU to beindependent!)
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
RT and Speed
• Speed and time are no equivalent notions• Generally, speed is a rate of change of some measure with respect to time, e.g.,
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Distance traveledSpeed of motion
Time
RT and Speed Cont’d
• For achievement testing, an appropriate notion of speed is cognitive speed:
• Fundamental equation:
Amount of cognitive laborSpeed
Time
** iβj
ijt
Amount of labor required(“time intensity”) by item i
Speed ofperson j Response time of
person j on item i
RT and Speed Cont’d
• Lognormal RT model (van der Linden, 2006)– Log transformation to remove skewness from RT
distributions
– Addition of random term
* *ln ln lnij i j i jt
ln ,ij i j it 2(0, )i iN
RT and Speed Cont’d
• Lognormal RT model:
Speed
Time intensity
Discrimination
2α 1( ) exp [α (ln (β τ ))]
22i
ij i ij i j
ij
f t tt
RT and Speed Cont’d
• Conclusion 3: RT and speed are different concepts related through a fundamental equation. RT models with a speed parameter should also have an item parameter for their amount of cognitive labor (or time intensity)
Speed and Ability
• Speed-accuracy tradeoff in psychology is same as a speed-ability tradeoff in achievement testing– Negative within-person correlation between τ and θ
– Change of speed required for tradeoff to become manifest
• Traditional IRT view of a person’s ability is of θ as a scale point, not as a function θ=θ(τ)– Effective ability level
• Test design – Fixed tests
– Adaptive tests
– Test accommodations
• And many more
Speed and Ability Cont’d
• At group level, any correlation between ability and speed may occur• Basic assumption: constancy of speed during
the test– Constant speed implies constant ability (ceteris
paribus)
• In practice, speed and ability always fluctuate somewhat, but fluctuations should be minorand unsystematic
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Speed and Ability Cont’d
• Conclusion 4: Speed and ability are related through a distinct function θ=θ(τ) for each test taker. The function itself need not be corporated into the response and RT models. But these models do require (fixed) parameters for the effective ability and speed of the test takers.
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
RT and Item Difficulty
• Descriptive research and speed-accuracy tradeoff suggest correlation between RT and item difficult– Item difficulty parameter in RT model?
– Counterexample
• Item parameters in response and RT models are for different item effects (on probability of correct response and time, respectively)
• Test design – Fixed tests
– Adaptive tests
– Test accommodations
• And many more
RT and Item Difficulty Cont’d
• Latent vs. manifest effect parameters– Danger of reification of latent effects
• Conclusion 5: RT models require item parameters for their time intensity but difficulty parameters belong in response models
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Dependences between Responses and RTs
• Descriptive vs. experimental studies• However, these studies necessarily involve
data aggregation across items and/or persons– Spurious correlations due to hidden sources of covariation (item and person parameters)
• Marginal vs. conditional independence between responses (spurious correlation, Simpson’s paradox, etc.)
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Dependences between Responses and RTs Cont’d
• Conclusion 6: Regular test behavior is characterized by three different types of conditional (or “local”) independence,namely between– responses on different items– between RTs on different items– between responses and RTs on the same item
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Dependences between Responses and RTs Cont’d
• For these conditional independencies to holdfor an entire test, constant speed is a necessary condition
• Empirical results
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Hierarchical Model ofResponses and RTs
• Distinct models for responses and RTs for a fixed person and item– Regular IRT model– E.g., lognormal model for RTs– Models should have
• parameters for effective ability and speed
• parameters for item difficulty and time intensity
• conditional independence
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Hierarchical Model ofResponses and RTs Cont’d
• Second-level models for dependences between – ability and speed across persons– difficulty and time intensity across items
• Multivariate normal distributions (possibly after parameter transformation)
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Hierarchical Model ofResponses and RTs Cont’d
• Bayesian treatment of modeling framework– Parameter estimation and model fit analysis with MCMC (Gibbs sampler)– Plug-and-play approach– Calibration of items with respect to RT parameters is straightforward– R package available upon request (Fox, Klein Entink, & van der Linden, 2007; Klein Entink,
Fox, & van der Linden, 2009)
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Applications to Testing Problems
• Test design
• Adaptive testing– Item selection– Differential speededness
• Detection of cheating– Item memorization and preknowledge– Collusion
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Applications to Testing Problems
• Use of RTs as collateral information in parameter estimation
• Cognitive research on problem solving
• Etc.
• Test design – Fixed tests
– Adaptive tests
– Test accommodations
• And many more
No RT Model
• Descriptive studies in educational testing– Correlation between responses and RTs– Regression of RT on item and person attributes
• Word counts, IRT item parameters, etc.• Number-correct scores; ability estimates
• Experimental studies in psychology– Manipulation of task or conditions
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
• Experimental reaction-time research (cont’d)
– Speed-accuracy tradeoff (Luce, 1986)– Plot of proportion of correct responses against RT
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
No RT Model Cont’d
t
p
• Problems– Spurious correlations between observed RTs– Speed-accuracy tradeoff is not a between-person phenomenon
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
No RT Model Cont’d
Stroop Test
Green
Stroop Test Cont’d
Blue
• RTs of two arbitrary students on a quantitative reasoning test
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Spurious Relations
Subject 1: 22, 19, 40, 43, 27, 27, 45, 23, 14, … Subject 2: 26, 38, 101, 57, 37, 21, 116, 44, 10, …
• RTs of two arbitrary students on a quantitative reasoning test
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Spurious Relations Cont’d
Subject 1: 22, 19, 40, 43, 27, 27, 45, 23, 14, … Subject 2: 26, 38, 101, 57, 37, 21, 116, 44, 10, …
r= .89
• RTs of two arbitrary students on a quantitative reasoning test
• Responses of same students
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Spurious Relations Cont’d
Subject 1: 22, 19, 40, 43, 27, 27, 45, 23, 14, … Subject 2: 26, 38, 101, 57, 37, 21, 116, 44, 10, …
Subject 1: 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, … Subject 2: 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,…
r= .89
r= .20
• RTs of two arbitrary students on a quantitative reasoning test
• Responses of same students
• Test design – Fixed tests
– Adaptive tests
– Test accommodations
• And many more
Spurious Relations Cont’d
Subject 1: 22, 19, 40, 43, 27, 27, 45, 23, 14, … Subject 2: 26, 38, 101, 57, 37, 21, 116, 44, 10, …
Subject 1: 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, … Subject 2: 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,…
r= .21
• Rasch’s (1960) models for reading speed
• Exponential models – Oosterloo (1975); Scheiblechner (1979)
• Gamma models– Maris (1993); Pieters & van der Ven (1982)
• Weibull models– Tatsuoka & Tatsuoka (1980)
Distinct Models for RTs
• Poisson distribution of number of readingerrors a in a text of N words
• Gamma distribution of reading time for textof N words
Rasch’s Models
( )Pr{ } , with
!
ij
ij
a
ij iij ij
ij j
a N e Na
1( )( } , with =
( 1)!ij ij
Nt ij ij j
ij ij iji
tp t N e
N
• This type of model mostly motivated by attempts to build speed-accuracy tradeoff in response model• Response surface in Thurstone (1937)• Logistic models
– Roskam (1987; 1997); Verhelst, Verstralen &Janssen (1997)
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Response Models with RT Parameters
• We also have RT models that incorporate response parameters– E.g., lognormal models by Gaviria (2005) and Thissen (1982)
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
Response Models with RT Parameters
Thurstone’s Response Surface
Roskam’s Model (1997)
1(θ ) 1 exp(θ ln )i j j ij ip t b
RT
“Speed-accuracytradeoff” Item
difficulty
Ability
• Models for underlying psychological processes– Diffusion models– Models for sequential and parallel processing
• Experimental data– Standardized task– Assumption of exchangeable subjects
• No subject or item parameters
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
RT Models in Mathematical Psychology
RT and Speed
375229
+
375229 58 39
+
Time: 9 sec 12 sec
Item 1 Item 2
Speed: ? ?
Speed-Ability Tradeoff
Speed
Abi
lity
Within-person relation
Speed
Abi
lity
Lower ability
Higher ability
Speed-Ability Tradeoff Cont’d
Speed-Ability Tradeoff Cont’d
Effective speed
Speed τ
θ=θ(
τ) Effective ability
Speed-Ability Tradeoff Cont’d
Speed
Abi
lity
x
x xx
x xx
xx
Speed-Ability Tradeoff Cont’d
Speed
xx
x x
xx
x
xx
Abi
lity
Speed-Ability Tradeoff Cont’d
Speed
xx
x
x
x
x
x
x
x
Abi
lity
RT and Item Difficulty
375229
+
375229 58 39
+
Item 1 Item 2
Person (ability)
Response RT
Item (difficulty)
Item( time
intensity)
Person(speed)
Person (ability)
Response RT
Item (difficulty)
Item( time
intensity)
Person(speed)
Distributionof Item
Parameters
Distribution of Person
Parameters
Test Design
• So far, issues of test speededness have been dealt with intuitively, with post hoc evaluationof time limits
• Alternatively, the time parameters of the items can be used to assemble a test to have a prespecified level of speededness– Example for LSAT
• Test design – Fixed tests– Adaptive tests– Test accommodations
• And many more
New Test Equally Speeded as Reference Test
τ=0
New testReference test
Adaptive Testing
• Application 1: use responses and RTs during the test to select the next item– Posterior predictive density of responses on candidate item given previous responses
and RTs
– Example for LSAT (simulation)
• Application 2: select items to prevent speededness of test– Example for ASVAB
• Test design – Fixed tests
– Adaptive tests
– Test accommodations
• And many more
Response and RTs in Adaptive Testing
Item: ai,bi,ci
Uij Tij
Person:j
Person:j
Item:i, i
Population:,
Item Domain:abc, abc
0
0.1
0.2
0.3
0.4
0.5
0.6
-2 -1 0 1 2
Mean Square Error in Ability Estimates
MSE
MSE
θ θ
n=10 n=20
No RTsρ=.2
ρ=.80.0
0.1
0.2
0.3
0.4
0.5
0.6
-2 -1 0 1 2
0
1000
2000
3000
4000
-0.6 -0.3 0 0.3 0.6
Time Used to Complete Test(Without Constraint)
Tim
e Limit (39 min)
=2
=-2
Speed
Time Used to Complete Test(With Constraint)T
ime
0
1000
2000
3000
4000
-0.6 -0.3 0 0.3 0.6
=-2
=2
Speed
Limit (39 min)
Time Used to Complete Test(With Constraint) T
ime
0
1000
2000
3000
4000
-0.6 -0.3 0 0.3 0.6
=-2
=2
Speed
Limit (34 min)
Time Used to Complete Test (With Constraint) T
ime
0
1000
2000
3000
4000
-0.6 -0.3 0 0.3 0.6
=-2
=2
Speed
Limit (29 min)
Detection of Cheating
• Item memorization and preknowledge– Check actual RTs on suspicious item against expectation based on (i) its time parameters and
(ii) estimation of speed on other items
– Baysian residuals
– Case Study for GMAT
• Test design – Fixed tests
– Adaptive tests
– Test accommodations
• And many more
Detection of Cheating Cont’d
• Types of collusion– Sign language– Intra/internet– Wireless communication
• Collusion between test takers may manifest itself as correlation between their response times (RT)
Detection of Cheating Cont’d
• However, observed RTs always correlate because the time intensity of the items varies from one item to the next (see earlier example of spurious correlation)!
• Therefore, correlation between RTs of pairs of test takers should be analyzed under a model for their bivariate distribution
Detection of Cheating Cont’d
• Bivariate lognormal model for RTs by test takers j and k on item i
22 2
22
1( , ) exp ( )
2(1 )2 1i
ij ik ij ij ik jk ikjkij ik jk
f t tt t
with [ln )]ij i ij i jt
Detection of Cheating Cont’d
• Example for test of quantitative reasoning
Case Study for GMAT Cont’d
• Example 1: RT patterns with 15 flagged items– Test taker spent most time on Items 1-18 and
then rushed through 19-27 – No cheating but serious time management
problem – Observe RT on Item 2, which is quite time
intensive but the residual RT is barely aberrant!
RT Pattern with 15 Flagged Items
-4
-2
0
2
4
6
8
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Item
Residual Log RT
Observed RT in Min.
Case Study for GMAT Cont’d
• Example 2: Observed vs. residual RTs (no flagged items!)– This case illustrates need of RT modeling and
analysis of residual RTs – Observed RTs suggest same time management
problem as in the preceding example but the pattern almost disappears for the residual RTs
Observed vs. Residual RTs
-4
-2
0
2
4
6
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Item
Residual Log RT
Observed RT in Min
Case Study for GMAT Cont’d
• Example 3: Suspicious item– Large negative residual (-4.66) for Item 14– RT of 12.3 seconds (expected RT under the model
was 88.9 seconds!)– Test taker had correct response but very low
estimated ability relative to item difficulty– Four other test takers with same behavior on same
item!
Suspicious Item
-5
-3
-1
1
3
5
1 3 5 7 9 11 13 15 17 19 21 23 25
Item
Residual Log RTObserved RT in Min