Reliability, Validity, & Scaling

Reliability• Repeatedly measure unchanged things.• Do you get the same measurements?• Charles Spearman, Classical

Measurement Theory.• If perfectly reliable, then corr between true

scores and measurements = +1.• r < 1 because of random error.• error symmetrically distributed about 0.

True Scores and Measurements• Reliability is the squared correlation between

true scores and measurement scores.

• Reliability is the proportion of the variance in the measurement scores that is due to differences in the true scores rather than due to random error.

222

2

2

2

TMET

T

M

TXX rr

• Systematic error– not random– measuring something else, in addition to the

construct of interest• Reliability cannot be known, can be

estimated.

Test-Retest Reliability• Measure subjects at two points in time.• Correlate ( r ) the two sets of

measurements.• .7 OK for research instruments• need it higher for practical applications

and important decisions.• M and SD should not vary much from Time

1 to Time 2, usually.

Alternate/Parallel Forms• Estimate reliability with r between forms.• M and SD should be same for both forms.• Pattern of corrs with other variables should

be same for both forms.

Split-Half Reliability

• Divide items into two random halves.• Score each half.• Correlate the half scores.• Get the half-test reliability coefficient, rhh • Correct with Spearman-Brown

hh

hhsb r

rr

12

Cronbach’s Coefficient Alpha

• Obtained value of rsb depends on how you split the items into haves.

• Find rsb for all possible pairs of split halves.• Compute mean of these.• But you don’t really compute it this way.• This is a lower bound for the true reliability.• That is, it underestimates true reliability.

Maximized Lambda4

• This is the best estimator of reliability.• Compute rsb for all possible pairs of split

halves.• The largest rsb = the estimated reliability.• If more than a few items, this is

unreasonably tedious.• But there are ways to estimate it.

Construct Validity• To what extent are we really

measuring/manipulating the construct of interest?

• Face Validity – do others agree that it sounds valid?

Content Validity• Detail the population of things (behaviors,

attitudes, etc.) that are of interest.• Consider your operationalization of the

construct (the details of how you proposed to measure it) as a sample of that population.

• Is your sample representative of the population – ask experts.

Criterion-Related Validity• Established by demonstrating that your

operationalization has the expected pattern of correlations with other variables.

• Concurrent Validity – demonstrate the expected correlation with other variables measured at the same time.

• Predictive Validity – demonstrate the expected correlation with other variables measured later in time.

• Convergent Validity – demonstrate the expected correlation with measures of other constructs.

• Discriminant Validity – demonstrate the expected lack of correlation with measures of other constructs.

Scaling• Scaling = construction of instruments for

measuring abstract constructs.• I shall discuss the creation of a Likert-

scale, my favorite type of scale.

Likert Scales• Define the Concept• Generate Potential Items

– About 100 statements.– On some, agreement indicates being high on

the measured attribute– On others, agreement indicates being low on

the measured attribute

Likert Response Scale– Use a multi-point response scale like this:

1. People should make certain that their actions never intentionally harm others even to a small degree.

Strongly DisagreeDisagreeNeutralAgreeStrongly Agree

Evaluate the Potential Items• Get judges to evaluate each item on a 5-point

scale– 1 – Agreement = very low on attribute– 2 – Agreement = low on attribute– 3 – Agreement tells you nothing– 4 – Agreement = high on attribute– 5 – Agreement = very high on attribute

• Select items with very high or very low means and little variability among the judges.

Alternate Method of Item Evaluation

• Ask some judges to respond to the items in the way they think someone high in the attribute would respond.

• Ask other judges to respond as would one low in the attribute.

• Prefer items that best discriminate between these two groups

• Also ask judges to identify items that are unclear or confusing.

Pilot Test the Items

• Administer to a sample of persons from the population of interest

• Conduct an item analysis (more on this later)

• Prefer items which have high item-total correlations

• Consider conducting a factor analysis (more on this later)

Administer the Final Scale

• on each item, response which indicates least amount of the attribute scored as 1

• next least amount response scored as 2• and so on• respondent’s total score = sum of item

scores or mean of item scores• dealing with nonresponses on some items• reflecting items (reverse scoring)

Item Analysis• You believe the scale is unidimensional.• Each item measures the same thing.• Item scores should be well correlated.• Evaluate this belief with an item analysis.

– is the scale internally consistent?– if so, it is also reliable.– are there items that do not correlate well with

the others?

Item Analysis of Idealism Scale

• Bring KJ-Idealism.sav into PASW.• Available at

http://core.ecu.edu/psyc/wuenschk/SPSS/SPSS-Data.htm



• Click Analyze, Scale, Reliability Analysis.

• Select all ten items and scoot them to the Items box on the right.

• Click the Statistics box.

• Check “Scale if item deleted” and then click Continue.

• Back on the initial window, click OK.• Look at the output.• The Cronbach alpha is .744, which is

acceptable.

Reliability Statistics

.744 10

Cronbach'sAlpha N of Items

Item-Total StatisticsItem-Total Statistics

32.42 23.453 .444 .71832.79 22.702 .441 .71732.79 21.122 .604 .69032.33 22.436 .532 .70532.33 22.277 .623 .69532.07 24.807 .337 .73334.29 24.152 .247 .74932.49 24.332 .308 .73633.38 22.063 .406 .72533.43 24.650 .201 .755

Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10

Scale Mean ifItem Deleted

ScaleVariance if

Item Deleted

CorrectedItem-TotalCorrelation

Cronbach'sAlpha if Item

Deleted

Troublesome Items

• Items 7 and 10 are troublesome.• Deleting them would increase alpha.• But not by much, so I retained them.• Item 7 stats are especially distressing:• “Deciding whether or not to perform an act

by balancing the positive consequences of the act against the negative consequences of the act is immoral.”

What Next?

• I should attempt to rewrite item 7 to make it more clear that it applies to ethical decisions, not other cost-benefit analysis.

• But this is not my scale,• And who has the time?

Scale Might Not Be Unidimensional

• If the items are measuring two or more different things, alpha may well be low.

• You need to split the scale into two or more subscales.

• Factor analysis can be helpful here (but no promises).

Reliability, Validity, & Scaling

Documents

Transcript of Reliability, Validity, & Scaling