1 Class 4 Psychometric Characteristics Part I: Sources of Error, Variability, Reliability,...
-
Upload
pierce-skinner -
Category
Documents
-
view
214 -
download
0
Transcript of 1 Class 4 Psychometric Characteristics Part I: Sources of Error, Variability, Reliability,...
1
Class 4
Psychometric Characteristics Part I: Sources of Error, Variability, Reliability,
Interpretability October 12, 2006
Anita L. StewartInstitute for Health & Aging
University of California, San Francisco
2
Overview of Class 4
Concepts of error Basic psychometric characteristics
– Variability
– Reliability
– Interpretability
3
Components of an Individual’s Observed Item Score
(NOTE: Simplistic view)
Observed true item score score
= + error
4
Components of Variability in Item Scores of a Group of Individuals
Observed true score score variance variance
Total variance (Variation is the sum of all observed item scores)
= + errorvariance
5
Combining Items into Multi-Item Scales
When items are combined into a scale score, error cancels out to some extent– Error variance is reduced as more items
are combined
– As you reduce random error, amount of “true score” variance increases
6
Sources of Error
Subjects Observers or interviewers Measure or instrument
7
Measuring Weight in Pounds of Children: Weight without shoes
Observed scores is a linear combination of many sources of variation for an individual
8
Measuring Weight in Pounds of Children: Weight without shoes
Scale ismiscalibrated
True weight
Amount of water
past 30 min
Weightof clothes
Observed weight
Person weighing children
is not very precise
= + +
+ +
9
Measuring Weight in Pounds of Children: Weight without shoes
Scale ismiscalibrated
+1 lb
True weight80 lbs
Amount of water
past 30 min+.25 lb
Weightof clothes
+.75 lb
Observed weight83 lbs
Person weighing children
is not very precise+1 lb
= + +
+ +
83 = 80 +.25 +.75 +1 +1
10
Sources of Error
Weight of clothes– Subject source of error
Person weighing child is not precise– Observer source of error
Scale is miscalibrated– Instrument source of error
11
Measuring Depressive Symptoms in Asian and Latino Men
Unwillingnessto tell
interviewer
“True” depression
Hard to choose onenumber on the 1-6
response choice scale
Observed depression
score
Measurenot culturally
sensitive
= +
+ +
12
Measuring Depressive Symptoms in Asian and Latino Men
Unwillingnessto tell
interviewer-3
“True” depression
16
Hard to choose onenumber on the 1-6
response choice scale+2
Observed depression
score13
Measurenot culturally
Sensitive-2
= +
+ +
13 = 16 +2 -3 -2
13
Return to Components of an Individual’s Observed Item Score
Observed true item score score
= + error
14
Components of an Individual’s Observed Item Score
Observed true item score score
= + error random
systematic
15
Sources of Error in Measuring Weight
Weight of clothes– Subject source of random error
Scale is miscalibrated– Instrument source of systematic error
Person weighing child is not precise– Observer source of random error
16
Sources of Error in Measuring Depression
Hard to choose one number on 1-6 response scale– Subject source of random error
Unwillingness to tell interviewer– Subject source of systematic error (underreporting
true depression) Instrument is not culturally sensitive (missing
some components)– Instrument source of systematic error
17
Memory Errors – From Cognitive Psychology
Error remembering “when” and “how often” something occurred within some time frame
Memory and emotion – tend to remember– positive more than negative experiences– more emotionally intense than neutral experiences
Memory for threatening, sensitive events is more error prone than non-threatening events
AA Stone et al. (eds), The Science of Self-Report,London: Lawrence Erlbaum, 2000.
18
Overview
Concepts of error Basic psychometric characteristics
– Variability
– Reliability
– Interpretability
19
Variability
Good variability– All (or nearly all) scale levels are represented– Distribution approximates bell-shaped normal
Variability is a function of the sample– Need to understand variability of measure of
interest in sample similar to one you are studying Review criteria
– Adequate variability in a range that is relevant to your study
20
Common Indicators of Variability
Range of scores (possible, observed) Mean, median, mode Standard deviation (standard error) Skewness % at floor (lowest score) % at ceiling (highest score)
21
Range of Scores
Especially important for multi-item measures Possible and observed Example of difference:
– CES-D possible range is 0-30– Wong et al. study of mothers of young children:
observed range was 0-23» missing entire high end of the distribution (none had high
levels of depression)
22
Mean, Median, Mode
Mean - average Median - midpoint Mode - most frequent score In normally distributed measures, these are
all the same In non-normal distributions, they will vary
23
Mean and Standard Deviation
Most information on variability is from mean and standard deviation– Can envision how it is distributed on the
possible range
24
Normal Distributions(Or Approximately Normal)
Mean, SD tell the entire story of the distribution + 1 SD on each side of the mean = 64%
of the scores
25
Skewness
Positive skew - scores bunched at low end, long tail to the right
Negative skew - opposite pattern Coefficient ranges from - infinity to + infinity
– the closer to zero, the more normal Test whether skewness coefficient is significantly
different from zero– thus depends on sample size
Scores +2.0 are cause for concern
26
Skewed Distributions
Mean and SD are not as useful – SD often goes out beyond the maximum or
minimum possible
27
Ceiling and Floor Effects: Similar to Skewness Information
Ceiling effects: substantial number of people get highest possible score
Floor effects: opposite Not very meaningful for continuous scales
– there will usually be very few at either end More helpful for single-item measures or
coarse scales with only a few levels
28
… to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)?
0
10
20
30
40
50
Not at all Slightly Moderately Quite a bit Extremely
%
49% not limited at all (can’t improve)
29
… to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)?
0
10
20
30
40
50
Not at all Slightly Moderately Quite a bit Extremely
%
49% not limited at all (can’t improve)
30
SF-36 Variability Information in Patients with Chronic Conditions (N=3,445)
Physicalfunction
Role-physical
Mental health
Vitality (energy)
0-100 0-100 0-100 0-100
Mean 80 75 71 54
SD 27 41 21 22
Skewness - .99 - .26 - .83 - .24
% floor < 1 24 <1 <1
% ceiling 19 37 4 <1
McHorney C et al. Med Care. 1994;32:40-66.
31
SF-36 Variability Information in Patients with Chronic Conditions (N=3,445)
Physicalfunction
Role-physical
Mental health
Vitality (energy)
0-100 0-100 0-100 0-100
Mean 80 75 71 54
SD 27 41 21 22
Skewness - .99 - .26 - .83 - .24
% floor < 1 <1 <1
% ceiling 19 4 <1
McHorney C et al. Med Care. 1994;32:40-66.
24
37
32
Reasons for Poor Variability
Low variability in construct being measured in that “sample” (true low variation)
Items not adequately tapping construct– If only one item, especially hard
Items not detecting important differences in construct at one or the other end of the continuum
Solutions if one is in the process of developing measures: add items
33
Advantages of multi-item scales revisited
Using multi-item scales minimizes likelihood of ceiling/floor effects
When items are skewed, multi-item scale “normalizes” the skew
34
Percent with Highest (Best) Score:MOS 5-Item Mental Health Index
Items (6 pt scale - all of the time to none of the time): – Very nervous person - 34% none of the time– Felt calm and peaceful - 4% all of the time– Felt downhearted and blue - 33% none of the time– Happy person - 10% all of the time– So down in the dumps nothing could cheer you up – 63%
none of the time Summated 5-item scale (0-100 scale)
– Only 5% had highest scoreStewart A. et al., MOS book, 1992
35
Overview
Concepts of error Basic psychometric characteristics
– Variability
– Reliability
– Interpretability
36
Reliability
Extent to which an observed score is free of random error– Produces the same score each time it is administered (all else
being equal) Population-specific; reliability increases with:
– sample size– variability in scores (dispersion)– a person’s level on the scale
37
Components of Variability in Item Scores of a Group of Individuals
Observed true score score variance variance
Total variance (Variation is the sum of all observed item scores)
= + errorvariance
38
Reliability Depends on True Score Variance
Reliability is a group-level statistic Reliability:
– Reliability = 1 – (error variance)– Reliability is:
Proportion of variance due to true score Total variance
39
Reliability Depends on True Score Variance
Proportion of variance due to true score Total variance
Reliability = Total variance – error variance .70 = 100% - 30%
40
Reliability Depends on True Score Variance
Reliability of .70 means 30% of the variancein the observed score is explainedby error
Reliability = total variance – error variance
Proportion of variance due to true score Total variance
41
Importance of Reliability
Necessary for validity (but not sufficient)– Low reliability attenuates correlations with other
variables (harder to detect true correlations among variables)
– May conclude that two variables are not related when they are
Greater reliability, greater power – Thus the more reliable your scales, the smaller
sample size you need to detect an association
42
Reliability Coefficient
Typically ranges from .00 - 1.00 Higher scores indicate better reliability
43
How Do You Know if a Scale or Measure Has Adequate Reliability?
Adequacy of reliability judged according to standard criteria– Criteria depend on type of coefficient
44
Types of Reliability Tests
Internal-consistency Test-retest Inter-rater Intra-rater
45
Internal Consistency Reliability: Cronbach’s Alpha
Requires multiple items supposedly measuring same construct to calculate
Extent to which all items measure the same construct (same latent variable)
46
Internal-Consistency Reliability
For multi-item scales Cronbach’s alpha
– ordinal scales Kuder Richardson 20 (KR-20)
– for dichotomous items
47
Minimum Standardsfor Internal Consistency Reliability
For group comparisons (e.g., regression, correlational analyses)– .70 or above is minimum (Nunnally, 1978)– .80 is optimal– above .90 is unnecessary
For individual assessment (e.g., treatment decisions)– .90 or above (.95) is preferred (Nunnally, 1978)
48
Internal-Consistency Reliability Can be Spurious
Based on only those who answered all questions in the measure– If a lot of people are having trouble with the
items and skip some, they are not included in test of reliability
49
Internal-Consistency Reliability is a Function of Number of Items in Scale
Increases with the number of items Very large scales (20 or more items) can
have high reliability without other good scaling properties
50
Example: 20 item Beck Depression Inventory (BDI)
BDI 1978 version (past week)– reliability .86
– 3 items correlated < .30 with other items in the scale
Beck AT et al. J Clin Psychol. 1984;40:1365-1367
51
Test-Retest Reliability
Repeat assessment on individuals who are not expected to change
Time between assessments should be:– Short enough so no change occurs– Long enough so subjects don’t recall first response
Coefficient is a correlation between two measurements For single item measures, the only way to test
reliability
52
Appropriate Test-Retest Coefficients by Type of Measure
Continuous scales (ratio or interval scales, multi-item Likert scales):– Pearson
Ordinal or non-normally distributed scales:– Spearman– Kendall’s tau
Dichotomous (categorical) measures:– Phi– Kappa
53
Minimum Standards for Test-Retest Reliability
Significance of a test-retest correlation has NOTHING to do with the adequacy of the reliability
Criteria: similar to those for internal consistency
– >.70 is desirable
– >.80 is optimal
54
Observer or Rater Reliability
Inter-rater reliability (across two or more raters)– Consistency (correlation) between two or more
observers on the same subjects (one point in time)
Intra-rater reliability (within one rater)– A test-retest within one observer– Correlation among repeated values obtained by the
same observer (over time)
55
Observer or Rater Reliability
Sometimes Pearson correlations are used - correlate one observer with another– Assesses association only
.65 to .95 are typical correlations >.85 is considered acceptable
McDowell and Newell
56
Association vs. Agreement When Correlating Two Times or Ratings
Association is degree to which one score linearly predicts other score
Agreement is extent to which same score is obtained on second measurement (retest, second observer)
Can have high correlation and poor agreement– If second score is consistently higher for all
subjects, can obtain high correlation– Need second test of mean differences
57
Hypothetical Scores on 4 Subjects by 2 Observers
1
2
3
4
5
6
7
S1 S2 S3 S4
Subjects
58
Example of Association and Agreement
Scores by observer 1 are exactly 2 points above scores by observer 2– Correlation (association) would be perfect
(r=1.0)
– Agreement is poor (no agreement on score in all cases - a difference of 2 between scores on each subject
59
Intraclass Correlation Coefficient for Testing Inter-rater Reliability (Kappa) Coefficient indicates level of agreement of two
or more judges, exceeding that which would be expected by chance
Appropriate for dichotomous (categorical) scales and ordinal scales
Several forms of kappa:– e.g., Cohen’s kappa is for 2 judges, dichotomous
scale Sensitive to number of observations,
distribution of data
60
Interpreting Kappa: Level of Reliability
<0.00
.00 - .20
.21 - .40
.41 - .60
.61 - .80
.81 - 1.00
Poor
Slight
Fair
Moderate
Substantial
Almost perfect
.60 or higher is acceptable (Landis, 1977)
61
Reliable Scale?
NO! There is no such thing as a “reliable” scale We accumulate “evidence” of reliability in a
variety of populations in which it has been tested
62
Reliability Often Poorer in Lower SES Groups
More random error due to Reading problems Difficulty understanding complex
questions Unfamiliarity with questionnaires and
surveys
63
Advantages of multi-item scales revisited
Using multi-item scales improves reliability
Random error is “canceled out” across multiple items
64
Overview
Concepts of error Basic psychometric characteristics
– Variability
– Reliability
– Interpretability
65
Interpretability of Scale Scores: What does a Score Mean?
Meaning of scores What are the endpoints? Direction of scoring - what does a high score
mean? Compared to norms - is score average, low, or
high compared to norms?
Single items, more easily interpretableMulti-item scales, no inherent meaning to scores
66
Endpoints
What is minimum and maximum possible?– To enable interpretation of mean score
Endpoints of summated scales depend on number of items & number of response choices– 5 items, 4 response choices = 5 - 20
– 3 items, 5 response choices = 3 - 15
67
Direction of Scoring
What does a high score mean? Where in the range does this mean score
lie?– Toward top, bottom?
– In the middle?
68
Descriptive Statistics for 3193 Women
M (SD) Min Max
Age 46.2 (2.7) 44.0 52.9
Activity 7.7 (1.8) 3.0 14.0
Stress 8.6 (2.9) 4.0 19.0
Avis NE et al. Med Care, 2003;41:1262-1276
69
Sample Results: Mean Scores in a Sample of Older Adults
Physical functioning 45.0Sleep 28.1Disability 35.7
Mean
70
Example of Table Labeling Scores: Making it Easier to Interpret
Physical functioning 45.0Sleep 28.1Disability 35.7
* All scores 0-100
Mean*
71
Example of Table Labeling Scores: Making it Easier to Interpret
Physical functioning (+) 45.0Sleep (-) 28.1Disability (-) 35.7
* All scores 0-100 (+) indicates higher score is better health(-) indicates lower score is better health
Mean*
72
Solutions
Can include in label (+) or (-)– Can label scale so that higher score is more
of “label” Can easily put score range next to label if
they differ in one table
73
Mean Has to be Interpreted Within the Possible Range
M SD
Parents’ harsh discipline practices* Interviewers’ ratings of mother 2.55 .74 Husbands’ reports of wife 5.32 3.30
*Note: high score indicates more harsh practices
74
Mean Has to be Interpreted Within the Possible Range
M SD
Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55 .74 Husbands’ reports of wife (1-7) 5.32 3.30
*Note: high score indicates more harsh practices
75
Mean Has to be Interpreted Within the Possible Range
M SD
Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55 .74 Husbands’ reports of wife (1-7) 5.32 3.30
Interviewer: 1 2 3 4 5
Husband: 1 2 3 4 5 6 7
*Note: high score indicates more harsh practices
2.55
5.32
76
Mean Has to be Interpreted Within the Possible Range: Adding SD Information
M SD
Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55 .74 Husbands’ reports of wife (1-7) 5.32 3.30
Interviewer: 1 2 3 4 5
Husband: 1 2 3 4 5 6 7
*Note: high score indicates more harsh practices
2.55
5.32
77
Transforming a Summated Scale to 0-100 Scale
Works with any ordinal or summated scale Transforms it so 0 is the lowest possible and
100 is the highest possible Eases interpretation across numerous scales
100 x (observed score - minimum possible score)
(maximum possible score - minimum possible score)
78
Homework for Next Class
Complete rows in matrix for your two measures– Rows 13-18: Nature of samples on which it
has been tested, data quality
– Rows 19-26: Variability, reliability, interpretability
79
Next Class (Class 5)
Guest lecture: Steve Gregorich Factor analysis
80
Two Readings for Next Week
Selected by Steve Gregorich– Kline
– Mulaik
Suggest reading them ahead to be able to ask questions