Applied Psych Test Design: Part C - Use of Rasch scaling technology
-
Upload
kevin-mcgrew -
Category
Business
-
view
2.340 -
download
3
description
Transcript of Applied Psych Test Design: Part C - Use of Rasch scaling technology
The Art and Science of Test Development—Part C
Test and item development: Use of Rasch scaling technology
The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock
Kevin S. McGrew, PhD.
Educational Psychologist
Research DirectorWoodcock-Muñoz Foundation
Part A: Planning, development frameworks & domain/test specification blueprints
Part B: Test and Item Development
Part C: Use of Rasch Technology
Part D: Develop norm (standardization) plan
Part E: Calculate norms and derived scores
Part F: Psychometric/technical and statistical analysis: Internal
Part G: Psychometric/technical and statistical analysis: External
The Art and Science of Test Development
The above titled topic is presented in a series of sequential PowerPoint modules. It is strongly recommended that the modules (A-G) be viewed in sequence.
The current module is designated by red bold font lettering
Important note: For the on-line public versions of this PPT module certain items,
information, etc. is obscured for test security or proprietary reasons…sorry
Use Rasch (IRT) scaling to evaluate the complete pool of items and to develop the Norming and Publication tests
Structural (Internal) Stage of Test Development
Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities)
Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence?
Method and concepts Internal domain studies Item/subscale intercorrelations Item response theory (IRT)
Characteristics of strong test validity program
• Moderate item internal consistency• Items/measures are representative of the empirical
domain• Items fit the theoretical structure
Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of cognitive abilities – Gv domain & 3 selected narrow Gv abilities
Gv
Item Scale Development via Rasch technology
Measurement or empirical domain
Rasch scale and evaluate the complete pool of items to develop Norming and Publication tests
Low ability/easy items
High ability/difficult items
Recall that Block Rotation items have 2 possible correct answers. Therefore there is a scoring question:
• Should items be scaled as 0/1 (need both correct to receive 1)?
• Should items be scales as 0/1/2 ?
Item data can be Rasch-scaled with both scoring systems and then select one that provides best reliability, etc.
We decided to go with 0/1/2 scoring sytem
Important understanding regarding 0/1 and multiple point (0/1/2) scoring systems when using Rasch/IRT
0 1
1 “step”
1 20
1 “step” 1 “step”
Therefore – think of 2-step items as
two 0/1 items
Dichotomous (0/1) item scoring
Multiple point (0/1/2) item scoring
Think of the items as now having been placed in their proper position on an equal interval ruler or yardstick….each item is a “tick” mark along the latent trait scale
Rasch IRT “norms” (calibrates) the scale !
A major advantage/feature of a large Rasch IRT-scaled item pool……..
Once you have a large Rasch IRT-scaled item pool, you can develop different and customized scales that place people on the same underlying scale
• CAT (computer adaptive testing)
• Different and unique forms of the test
A major advantage/feature of a large IRT-scaled item pool……..
Norming test Publicationtest
Possible special Research Edition tests
All three tests have items on the same scale (W-scale)
Although different number of items in each test, the obtained person ability W-score ‘s are equivalent, but differ in degree of precision (reliability)
Average difference in “gaps” between items on respective scales is called “item density”
W-scale is equal interval metric
Easy
Hard
Items are assignedW-difficulties
People are assigned W-ability scores
2 Major Rasch results
Rasch puts person ability and item difficulty on the same scale (W scale)
Person W-ability scores
2 Major Rasch results
Item W-difficulties
Select and order items
for Publication test based
on inspection of Rasch results
Block RotationNorming test
(n=44 items; n = 4,722 norm subjects)
Block RotationPublication test
(n = 37 items; n = 4,722 norm subjects)
Block Rotation: Final Rasch with
norming testn = 37 norming
itemsn = 4722 norm
subjects
Measure order and fit statistics
table
Used to select items with
specified item density
Block Rotation: Final Rasch with norming
testn = 37 norming
itemsn = 4722 norm
subjects
Distribution of Block Rotation
W-ability scores in norm
sample
Complete range
(including extremes) of
Block Rotation W-
scores is 432-546
Majority of Block
Rotation norm sample obtained W-scores from
480-520
Recall Block Rotation scoring system is 0/1/2—Items have “steps”
1 20
1 “step” 1 “step”
Multiple point (0/1/2) item scoring
Block Rotation: Final Rasch with
norming testn = 37 norming
itemsn = 4722 norm
subjects
Item map with “steps”
displayed for items
Blue area represents majority of
norm sample subjects Block
Rotation W-scores
1 “step” 1 “step”
Item 1 (0/1/2) step structure
Block Rotation: Final Rasch with
norming testn = 37 norming
itemsn = 4722 norm
subjects
Item map with “steps”
displayed for items
Blue area represents majority of
norm sample subjects Block
Rotation W-scores
Very good test scale coverage for majority of population
Excellent “bottom” or “floor” for test scale
Adequate “top” or “ceiling” for test scale
Block Rotation: Final Rasch with
norming testn = 37 norming
itemsn = 4722 norm
subjects
Item map with “steps”
displayed for items
Red area represents the
complete range (including
extremes) of sample Block Rotation W-
scores
Good test scale coverage for complete range of population
BLKROT: Floor (rs=1) & ceiling (rs=max) plot
0 10 20 30 40 50 60 70 80 90100
110
120130
140150
160170
180190
200210
220230
240250
260270
280290
300
camos
430
470
510
550
Re
f W +
/- 3
SD
's
Block Rotation Rasch floor/ceiling results confirmed by formal+-3SD floor/ceiling analysis (24-300 months of age)
Block Rotation Rasch floor/ceiling results confirmed by formal+-3SD floor/ceiling analysis (300 - 1200 months of age)
BLKROT: Floor (rs=1) & ceiling (rs=max) plot
300330
360390
420450
480510
540570
600630
660690
720750
780810
840870
900930
960990
10201050
1080111
011
4011
701200
camos
430
470
510
550
Re
f W +
/- 3
SD
's
Person W-ability scores
2 Major Rasch results
Item W-difficulties
Block RotationNorming test
(n=44 items; n = 4,722 norm subjects)
Block RotationPublication test
(n = 37 items; n = 4,722 norm subjects)
Program generates final RS to W-ability scoring table
Block Rotation: Final Rasch with
norming test
n = 37 norming items
n = 4722 norm subjects
Raw score to W-score
“scoring table”
Note: Total raw score points is 74 for 37 items. These are 2-step items.
37 items x 2 steps = 74 total possible points
Block Rotation Norming Test
n=44 items
44 items x 2 steps = raw scores from
0 to 88 on the Rasch-based scoring table (the equal interval
Visualization-Vz measurement “ruler”
or “yardstick”)
88
87
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
0
545.7
539.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
437.8
431.6
Raw Score W-score
Block Rotation Norming test(n=44 items)
88
87
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
0
545.7
539.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
437.8
431.6
Raw Score W-score
545.7
539.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
437.8
431.6
Block Rotation Norming test (n=44 items)Block Rotation Publication test n = 37 items)
74
73
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
0
Raw Score W-score
Block Rotation Norming and
Publication tests, although having
different number of items (and total
Raw Scores), are on the same underlying
measurement scale (ruler)
Person W-ability scores
2 Major Rasch results
Item W-difficulties
Program generates final RS to W-ability
scoring table
Result: All norm subjects with Block Rotation scores (n = 4,722) now have scores on equal interval W-score Block Rotation
Norming test(n=44 items; n = 4,722
norm subjects)
Block RotationPublication
test (n = 37 items)
Person W-ability scores
2 Major Rasch results
Item W-difficulties
Block RotationNorming test
(n=44 items; n = 4,722 norm subjects)
Block RotationPublication
test (n = 37 items)
Program generates
final RS to W-ability scoring
table
Result: All norm subjects with Block
Rotation scores (n = 4,722) now have
scores on equal interval W-score
These Block Rotation W-scores are then
used for developing test “norms” and
completing technical manual analysis and
validity research
546
432
Graphic display of distribution of Block
Rotation person abilities
These Block Rotation W-scores are then used for developing test
“norms” and validity research
Block Rotation Summary: Final
Rasch for Publication test – graphic item map
n = 37 norming items (0-74 RS
points)n = 4,722 norm
subjects
Pub. TestW-score
scale
Recall early warning to expect the unexpected and the non-linear “art and science” of test
development
Last minute question raised (prior to formal production) of Block Rotation test:
Should the blocks be shaded/colored instead of being black and white?
Would adding shading/color change the nature of the task?
What to do?
Answer: Do a study—gather some empirical data to help make decision. The question should be
answered empirically – you should not assume that colorizing items will make no difference
Special Block Rotation no-color vs color group administration study completed
Special Block Rotation no-color vs color group administration study completed
Sample size plan - approx 300+ subjects
3 groups spanning the complete range of Block Rotation ability
• 2nd – 4th graders – approx. 100+• 7th – 11th graders – approx 100+• College students – approx 100+
•Final total sample was 380 subjects
Group administration version of test
Two forms of test constructed from complete set of ordered (scaled) items
• White version – even items• Colored version – odd items
Analyses – Rasch analysis and comparison of respective item difficulties and mean score comparison between versions
Conclusion – adding color did NOT change the psychometric characteristics of the items/test – therefore print the final test with colored items
Two sample items
Final Block Rotation Publication Test Constructedn = 37 (0/1/2) items—Raw Scores from 0-74
Rasch (IRT) is a magnificent tool for evaluating and constructing tests with flexibilty during the entire process. Embrace IRT methods in applied test development (vs CTT methods)
Important to remember you are calibrating the scale and not norming the test during this phase). Samples with rectangular distributions of ability are critical.
Carefully inspect the Rasch results (esp., measure order table) and determine if you have enough easy and difficulty items or need more items at certain places along the scale. Then use “linking/anchor” technology to add in new items.
Item fit is a relative matter involving “reasonably acceptable approximate fit”. Don’t blindly follow black and white item fit rules from text-books and articles. The “real world” of test development is not an ivory tower exercise. Follow 3-basic Rasch assumptions (unidimensionality; equal discrimination; local independence) “within reason” (Woodcock).
Many tests claim to use the Rasch model (Rasch “name dropping”), but only use for item analyses and do not harness the advantages of the underlying Rasch ability scale (e.g., W-scale) for improved test construction and score interpretation procedures (e.g., RPI’s).
Maintaining a master item pool
Norming-calibration tests
Linking/equating (alternate forms) tests
Adding new items to master item pool (use of anchor items from master item pool)
Checking for possible item bias (DIF – differential item function)
Creating and using shortened special purpose versions of tests (norming tests; research edition tests; tests for special populations)
Flagging potentially poor examiners via empirical “person fit” statistics report
Computer adaptive testing (CAT)
End of Part C
Additional steps in test development process will be presented in subsequent modules as they are developed