Reliability Psych 395 - DeShon. How Do We Judge Psychological Measures? Two concepts: Reliability...

Reliability

Psych 395 - DeShon

How Do We Judge Psychological Measures?

Two concepts: Reliability and Validity Reliability: How consistent is the assessment over

time, items or raters? How reproducible are the measurements? How much measurement error is involved?

Validity: How well does an assessment measure what it is supposed to measure? How accurate is our assessment?

– An assessment is valid if it measures what it purports to measure.

Correlation Review

Some Data from the 2005 Baseball Season

Team Payroll Win % ERA Attendance

Yankees 208.31 (1st) 58.6% 4.52 4.09

Red Sox 123.51 (2nd) 58.6% 4.74 2.85

White Sox 75.18 (13th) 61.1% 3.61 2.34

Tigers 69.09 (15th) 43.8% 4.51 2.02

Devil Rays 29.68 (30th) 41.4% 5.39 1.14

Questions We Might Ask …

How strongly is payroll associated with winning percentage?

How strongly is payroll associated with making the playoffs?

How can we answer these?

Option 1: Plot the Data

Option 2: Quantify the Association with the Correlation Coefficient

The Correlation Coefficient

Credited to Karl Pearson (1896) Measures the degree of linear association

between two variables. Ranges from -1.0 to 1.0 Sign refers to direction

– Negative: As X increases Y decreases– Positive: As X increases Y increases

One Formula

Symbolized by r Covariance of X and Y Divided by the

Product of the SDs of X and Y.

XY

X Y

covr

s s

Calculation of r for Payroll (X) and Winning Percentage (Y)

covXY = 1.13

sX = 34.23

sY = .07

47.40.2

13.1

)07.0)(23.34(

13.1cov

YX

XY

ssr

Calculation of r for Payroll (X) and Making Post-Season (Y)

Y coded so that 1=Playoffs 0=No covXY = 8.24

sX = 34.23

sY = .45

53.42.15

24.8

)45.0)(23.34(

24.8cov

YX

XY

ssr

Examples of CorrelationsSource: Meyer et al. (2001)

Associations r

Test Anxiety and Grades -.17

SAT and Grades in College .20

GRE Quant. and Graduate School GPA .22

Quality of Marital Relationships and Quality of Parent-Child Relationships

.22

Alcohol and Aggressive Behavior .23

Height and Weight .44

Gender and Height .67

Commonly Used Rule of Thumb

+/- .10 is Small +/- .30 is Medium +/- .50 is Large Use these with care. This guidelines only

provide a loose framework for thinking about the size of correlations

Sources: Cohen (1988) and Kline (2004)

r=0

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

r=.10

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

r=.20

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

r=.30

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

r=.40

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

r=.50

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

r=.60

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

r=.70

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

r=.80

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

r=.90

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

r=1.0

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

observed

tru

e

Now Back to Reliability

Classical Test Theory

X = T + E

where

X = Observed Score

T = True Score

E = Error score

Consider the Construct of Self-Esteem

Global self-esteem reflects a person’s overall evaluation of value and worth.

William James (1890) argued that self-esteem was the result of an individual’s perceived successes divided by their pretensions

Rosenberg (1965) defined global self-esteem as an individual’s overall judgment of adequacy

We can’t directly observe self-esteem

Measuring Self-Esteem

We can ask people questions that reflect individual differences in self-esteem.

– “I feel that I have a number of good qualities”– “I see myself as a person with high self-esteem”

We assume that a “hidden” self-esteem variable causes people to respond to these questions.

We do not want to assume that these items are perfect indicators of an individual’s level of self-esteem.

Classical Test Theory

X = T + E

where

X = Observed Score

T = True Score

E = Error score

Classical Test Theory Assumptions

1. True scores and errors are uncorrelated (independent)

2. Errors across people average to zero

3. Across repeated measurements, a person’s average score is ≈ equal to his/her true score.

Thinking about Total Variability

If X = T + E, then:

var (X) = var (T) + var (E)

Reliability Coefficients

Reliability coefficients reflect the proportion of true score variance to observed score variance

Therefore reliabilities range from 0.0 (no true score variance) to

1.0 (all true-score variance)

var( )

var( )xx

Tr

X

Classic Definition of Reliability

The ratio of true score variance to total score variance.

Test 1: Total Variance = 10; True Score Variance = 9.

Test 2: Total Variance = 20; True Score Variance = 15.

Which Test is More Reliable?

Reliability

More technical: To what extent do observed scores reflect true scores?

How consistent is the assessment?

Three Kinds of Reliability

Internal Consistency (Content)– Random error affects responses to items on an

assessment

Test-Retest (Time)– The construct stays the same. However, random

errors vary from one occasion to the next.

Inter-Rater (Observer Biases)

Internal Consistency

Use a 5-item measure of Self-Esteem.– 1. I feel that I am a person of worth, at least on an equal

basis with others.– 2. I feel that I have a number of good qualities.– 3. All in all, I am inclined to feel that I am a failure.– 4. I am able to do things as well as most other people.– 5. I feel I do not have much to be proud of.

Response Options (1 = Strongly Disagree to 5 = Strongly Agree)

Internal Consistency

Correlate All Items (N = 450)

Item 1 Item 2 Item 3 Item 4 Item 5

Item 1 -

Item 2 .70 -

Item 3 .38 .45 -

Item 4 .50 .51 .41 -

Item 5 .32 .25 .43 .25 -

Summary Statistics of those Correlations

Average: .42 Standard Deviation: .13 Minimum: .25 (Items 4 & 5) Maximum: .70 (Items 1 & 2) Standardized Alpha = .78 Alpha is an index of how strongly the items

on a measure are associated with each other.

Coefficient Alpha

Where you need (1) the # of items (called k) and (2) the average inter-item correlation. This formula yields the standardized alpha.

1 ( 1)ij

ij

k r

k r

Coefficient Alpha ()

Coefficient Alpha versus Split-half reliability Estimates…

Split-Half Reliability – Divide the items on the assessment into 2 halves and then correlate the two halves.

Problem: Estimates fluctuate depending on what items get split into which halves.

Alpha is the average of all possible split-half reliabilities.

Sample Matrix

Item 1 Item 2 Item 3 Item 4 Item 5 Item 6

John 4 3 5 5 3 2

Paul 4 5 5 3 4 4

Ringo 2 2 2 1 2 3

George 4 4 3 2 5 4

Real Results

10 Item Measure of Self-Esteem for 451 women. Correlate the average of the odd number items with

the average of the even number items: r = .79 Correlate average first five items with the average of

the last five items: r = .67 Average Inter-Item r = .46 Standardized Alpha = .89

Caveats about Coefficient Alpha ….

Recall – what goes into the Alpha calculation:– Number of items– Average inter-item correlation

There are at least two things to think about when considering Coefficient Alpha…– Length of the Assessment– Dimensionality

Pay Attention to the Length of the Assessment.

Constant average inter-item correlation (e.g. .420) but increase the number of items….

Items Standardized Alpha

5 .78

6 .81

7 .84

8 .85

9 .87

10 .88

100 .99

Now let’s use something like an average inter-item correlation of .15

Items Standardized Alpha

5 .47

10 .64

15 .73

20 .78

25 .82

30 .84

100 .95

With enough items it is possible to achieve very high alpha coefficients…

Dimensionality of the Measure

Let’s Get To The Same Average Inter-Item Correlation in Two Ways

Example from Schmitt (1996)

Item Pool #1 (Average inter-item r = .5)

Item 1. 2. 3. 4. 5. 6.

1. 1.0

2. .8 1.0

3. .8 .8 1.0

4. .3 .3 .3 1.0

5. .3 .3 .3 .8 1.0

6. .3 .3 .3 .8 .8 1.0

Item Pool #2 (Average inter-item r = .5)

Item 1. 2. 3. 4. 5. 6.

1. 1.0

2. .5 1.0

3. .5 .5 1.0

4. .5 .5 .5 1.0

5. .5 .5 .5 .5 1.0

6. .5 .5 .5 .5 .5 1.0

Let’s Calculate Alphas…

Item Pool #1: Average Inter-item r = .50, number of items = 6.

Item Pool #2: Average Inter-item r = .50, number of items = 6.

Standardized Alpha for Item Pool #1 = Standardized Alpha for Item Pool #2 = .86

Same alphas but the underlying correlation matrices are quite different…

Alpha does NOT index unidimensionality

What is unidimensionality?

Unidimensionality can be rigorously defined as the existence of one latent trait underlying the set of items (Hattie, 1985, p. 152).

Simply put, all of the items forming the instrument all measure just one thing.

Turns out that 100% “pure” unidimensionality is hard to achieve for personality and attitude measures.

Try to get items that are as close as possible to a unidimensionality set.

A Few Tips

Think about the construct Pay attention to the number of items on a

scale and the average item correlation. Always look at the inter-item correlation

matrix. Motto: An essential ingredient in the research

process is the judgment of the scientist. (Jacob Cohen, 1923-1998).

Question: What is a good alpha level?Answer: It depends….

Reliability Standards

Reliability Standards– .7 for research– .9 for actual decisions

But…– “Does a .50 reliability coefficient stink? To answer this

question, no authoritative source will do. Rather, it is for the user to determine what amount of error variance he or she is willing to tolerate, given the specific circumstances of the study.” Pedhazur and Scmelkin (1991, p. 110)

Test-Retest Reliability

The extent to which scores at one time point do not perfectly correlate with scores at another time point is an indicator of error

Correlation is an estimate of the reliability ratio

This assumes the underlying construct is stable.

Test-Retest Reliability

What Time Interval? Long enough that memory biases are not present but short enough that there is no expectation of true change.

Cattell et al (1970, p. 320): “When the lapse of time is insufficient for people themselves to change.”

Watson (2004) suggested 2-weeks.

Inter-Rater Reliability

Just like test-retest reliability Correlation of ratings from 2 or more judges Correlation is an estimate of the reliability

ratio

Question: What is one undesirable consequence of measurement error?

Researchers are often concerned about attenuation in predictor-criterion associations due to measurement error.

Assume that measures of X and Y have alphas of .60 and .70, respectively. An estimate of the upper limit on the observed correlation between X and Y is .65

Take the square root of the product of the two reliabilities

Measure 1 Measure 2 Upper Limit

.50 .85 .65

.60 .85 .71

.70 .85 .77

.80 .85 .82

.90 .85 .87

Correcting Correlations for Attenuation

rr

r rc

xy

xx yy

rxy = observed correlation between x and yrxx and ryy = reliability coefficients of x and y

Appling the Formula

Reliability

Measure 1

Reliability

Measure 2

Observed

Correlation

Corrected

.50 .60 .40 .73

.60 .70 .40 .62

.70 .80 .40 .53

.80 .90 .40 .47

.90 .90 .40 .44

Standard Error of Measurement

Estimating the precision of individual scores

Standard Error of Measurement = Standard deviation of the error around any individual’s true score

2 SEM captures95% of the error

Calculation of the Standard Error of Measurement (SEM)

1x xxSEM s r

xs xxr= SD of test scores = test reliability

10 1 .84 4SEM

10 1 .19 9SEM

good reliability low SEM

poor reliability high SEM

Standard Error of Measurement Assumptions

• a reliability coefficient based on an appropriate measure

• the sample appropriately represents the population

1x xxSEM s r

Confidence Bands

There are additional complexities involved in setting confidence bands around observed scores but we won’t cover them in PSY 395 (see Nunnally & Bernstein, 1994, p. 259)

SEM Confidence Interval– 95% Confidence: Z = 1.96 (Often round to 2)– 68% Confidence: Z = 1.0

SEMZScoreObservedCI Confidence

Consider 2 Tests

Case 1

The CAT (Creative Analogies Test) has 100 items. Assume the SEM of this test is 10.

Amy scored 75 The 95% Band = Score (2 *SEM) So the 95% Confidence Band around her

score is 55 to 95

Case 2

The CAT-2 (Creative Analogies Test V.2) also has 100 items. Assume the SEM of this test is 2

Amy scored 75 95% Confidence Band around her score is

71 to 79 Why? Recall the 95% Band = Score (2 *SEM)

Which test should be used to make decisions about Graduate School

Admission? Why?

Decisions….Decisions…

TRUTH

Doesn’t Have “it” Has “it”

Test Decision

Doesn’t Have “it” Correct

False Negative

Has “it” False Positive

Correct

Cut Scores

Cut scores are set values that determine who “passes” and who “fails” the test.– Commonly used for licensure or certification (bar

exam, medical licensure, civil service)

What is the impact of the standard error of measurement on interpreting cut scores?

The smaller the SEM the better. Why?

Reliability Psych 395 - DeShon. How Do We Judge Psychological Measures? Two concepts: Reliability...

Documents

Transcript of Reliability Psych 395 - DeShon. How Do We Judge Psychological Measures? Two concepts: Reliability...