Reliability & Validity More is Better Properties of a “good measure” –Standardization...

Reliability & Validity

• More is Better• Properties of a “good measure”

– Standardization– Reliability– Validity

• Reliability– Inter-rater– Internal– External

• Validity– Criterion-related– Face– Content– Construct

Another bit of jargon…

In sampling…

We are trying to represent a population of individuals

We select participants

The resulting sample of

participants is intended to

represent the population

In measurement…

We are trying to represent a domain of behaviors

We select items

The resulting scale/test of items is intended to

represent the domain

For both “more is better” “more give greater representation”

Whenever we’ve considered research designs and statistical conclusions, we’ve always been concerned with “sample size”

We know that larger samples (more participants) leads to ...

• more reliable estimates of mean and std, r, F & X2

• more reliable statistical conclusions

• quantified as fewer Type I and II errors

The same principle applies to scale construction - “more is better”

• but now it applies to the number of items comprising the scale

• more (good) items leads to a better scale…

• more adequately represent the content/construct domain

• provide a more consistent total score (respondent can change more items before total is changed

much)

Desirable Properties of Psychological Measures

Interpretability of Individual and Group Scores

Population Norms

Validity

Reliability

Standardization

Reliability (Agreement or Consistency)

Inter-rater or Inter-observers reliability• do multiple observers/coders score an item the same way ?• critical whenever using subjective measures• dependent upon Standardization

Internal reliability -- do the items measure a central “thing”• Cronbach’s alpha α = .00 – 1.00 higher is better• more correlated items & more items improve higher α

External Reliability – stability of scale/test scores over time

• test-retest reliability – correlate scores from same test given 3-18 weeks apart

• alternate forms reliability – correlate scores from two “versions” of the test

Assessing

Item corrected alpha if item-total r deleted

i1 .1454 .65

i2 .2002 .58

i3 -.2133 .71

i4 .1882 .59

i5 .1332 .68

i6 .2112 .56

i7 .1221 .60

Coefficient Alpha = .58 Tells the for this set of items

Correlation between each item and a total comprised of all the other items

• negative item-total correlations indicate either...

• very “poor” item• reverse keying problems

What the alpha would be if that item were dropped

• drop items with alpha if deleted larger than alpha

Usually do several “passes” rather that drop several items at once.

Assessing


i1 .0854 .65

i2 .2002 .58

i3 -.2133 .71

i4 .1882 .59

i5 .0832 .68

i6 .0712 .56

i7 .0621 .60

Coefficient Alpha = .58

Pass #1

All items with “-” item-total correlations are “bad”

• check to see that they have been keyed correctly

• if they have been correctly keyed -- drop them

• i3 would likely be dropped

Assessing


i1 .0812 .73

i2 .2202 .68

i4 .1822 .70

i5 .0877 .74

i6 .2343 .64

i7 .0621 .78

Coefficient Alpha = .71

Pass #2, etc

Look for items with alpha-if-deleted values that are substantially higher than the scale’s alpha value

• don’t drop too many at a time

• probably i7

• probably not drop i1 & i5

• recheck on next “pass”

• it is better to drop 1-2 items on each of several “passes”

Validity (Consistent Accuracy) Criterion-related Validity -- does test correlate with “criterion”?

• statistical -- requires a criterion that you “believe in”• predictive, concurrent, postdictive validity

Face Validity -- do the items come from “domain of interest” ? non-statistical -- decision of “target population”

Content Validity -- do the items come from “domain of interest”?

non-statistical -- decision of “expert in the field”

Construct Validity -- does test relate to other measures it should?• nonstatistical – does measure match the theory of the construct?• statistical -- Discriminant validity

• convergent validity - +/-r with selected tests as should ?• divergent validity – r=0 correlate with others as should ?

“Is the test valid?”

Jum Nunnally (one of the founders of modern psychometrics) claimed this was “silly question”! The point wasn’t that tests shouldn’t be “valid” but that a test’s validity must be assessed relative to…

• the construct it is intended to measure

• the population for which it is intended (e.g., age, level)

• the application for which it is intended (e.g., for classifying folks into categories vs. assigning themquantitative values)

So, the real question is, “Is this test a valid measure of this construct for this population in this application?”

That question can be answered!

Criterion-related Validity

Do the test scores correlate with criterion behavior scores??

NowBefore Later

predictive -- test taken now predicts criterion measured later• want to estimate what will happen before it does• e.g., your GRE score (taken now) predicts grad school (later)

When criterion behavior occurs

concurrentpostdictive predictive

Test taken now

concurrent -- test taken now “replaces” criterion measured now• often the goal is to substitute a “shorter” or “cheaper” test• e.g., the written drivers test replaces road test

postdictive – test taken now captures behavior & affect of before• most of the behavior we study “has already happened”• e.g., adult memories of childhood feelings or medical history

Conducting a Predictive Validity Study

example -- test designed to identify qualified “front desk personnel” for a major hotel chain

-- 200 applicants - and 20 position openings

A “proper” predictive validity study…

• give each applicant the test (and “seal” the results)

• give each applicant a job working at a front desk

• assess work performance after 6 months (the criterion)

• correlate the test (predictor) and work performance (criterion)

Anybody see why the chain might not be willing to apply this design?

Substituting concurrent validity for predictive validity

• assess work performance of all folks currently doing the job

• give them each the test

• correlate the test (predictor) and work performance (criterion)

Problems?

• Not working with the population of interest (applicants)

• Range restriction -- work performance and test score variability are “restricted” by this approach

• current hiring practice probably not “random”

• good workers “move up” -- poor ones “move out”

• Range restriction will artificially lower the validity coefficient (r)

Applicant pool -- target population

Selected (hired) folks

• assuming selection basis is somewhat reasonable/functional

Sample used in concurrent validity study

• worst of those hired have been “released”

• best of those hired have “changed jobs”

What happens to the sample ...

Cr it

erio

n -

job

perf

o rm

ance

Predictor -- interview/measure

What happens to the validity coefficient -- r

Applicant pool

r = .75

Sample used in validity study

r = .20

Hired Folks

Face Validity

Does the test “look like” a measure of the construct of interest?

• “looks like” a measure of the desired construct to a member of the target population

• will someone recognize the type of information they are responding to?

• Possible advantage of face validity ..

• If the respondent knows what information we are looking for, they can use that “context” to help interpret the

questions and provide more useful, accurate answers

• Possible limitation of face validity …

• if the respondent knows what information we are looking for, they might try to “bend & shape” their answers to what

they think we want -- “fake good” or “fake bad”

Target population members

Content Experts

Target population members assess Face Validity

Content experts assess Content Validity

Researchers – “should evaluate the validity evidence provided about the scale, rather than the scale items !!

– unless they are truly a content expert”

Researchers

“Continuum of content expertise”

Content Validity

Does the test contain items from the desired “content domain”?

• Based on assessment by “subject matter experts” (SMEs) in that content domain

• Is especially important when a test is designed to have low face validity

• e.g., tests of “honesty” used for hiring decisions

• Is generally simpler for “achievement tests” than for “psychological constructs” (or other “less concrete” ideas)

• e.g., it is a lot easier for “math experts” to agree whether or not an item should be on an algebra test than it is for “psychological experts” to agree whether or not an items should be on a measure of depression.

• Content validity is not “tested for”. Rather it is “assured” by the informed item selections made by experts in the domain.

Construct Validity• Does the test correspond with the theory of the construct & dos it interrelate with other tests as a measure of this construct should ?

• We use the term construct to remind ourselves that many of the terms we use do not have an objective, concrete reality.

• Rather they are “made up” or “constructed” by us in our attempts to organize and make sense of behavior and other psychological processes

• attention to construct validity reminds us that our defense of the constructs we create is really based on the “whole package” of how the measures of different constructs relate to theory and to each other

• So, construct validity “begins” with content validity (are these the right types of items) and then adds the question, “does this test relate as it should to other tests of similar and different constructs?

The statistical assessment of Construct Validity …

Discriminant Validity

• Does the test show the “right” pattern of interrelationships with other variables? -- has two parts

• Convergent Validity -- test correlates with other measures of similar constructs

• Divergent Validity -- test isn’t correlated with measures of “other, different constructs”

• e.g., a new measure of depression should …

• have “strong” correlations with other measures of “depression”

• have negative correlations with measures of “happiness”

• have “substantial” correlation with measures of “anxiety”

• have “minimal” correlations with tests of “physical health”, “faking bad”, “self-evaluation”, etc.

Evaluate this measure of depression….

New Dep Dep1 Dep2 Anx Happy PhyHlth FakBad

New Dep

Old Dep1 .61

Old Dep2 .49 .76

Anx .43 .30 .28

Happy -.59 -.61 -.56 -.75

PhyHlth .60 .18 .22 .45 -.35

FakBad .55 .14 .26 .10 -.21 .31

Tell the elements of discriminant validity tested and the “conclusion”

Evaluate this measure of depression….

New Dep Dep1 Dep2 Anx Happy PhyHlth FakBad

New Dep convergent validity (but bit lower than r(dep1, dep2)

Old Dep1 .61

Old Dep2 .49 .76 more correlated with anx than dep1 or dep2

Anx .43 .30 .28 corr w/ happy about same as Dep1-2

Happy -.59 -.61 -.56 -.75 “too” r with PhyHlth

PhyHlth .60 .18 .22 .45 -.35 “too” r with FakBad

FakBad .55 .14 .26 .10 -.21 .31

This pattern of results does not show strong discriminant validity !!

Population Norms

In order to interpret a score from an individual or group, you must know what scores are typical for that population

• Requires a large representative sample of the target population

• preferably random, research-selected & stratified

• Requires solid standardization both administrative & scoring

• Requires great inter-rater reliability (if subjective items)

The Result ??

A scoring distribution of the population.

• lets us identify “normal,” “high” and “low” scores

• lets us identify “cutoff scores” to define important populations and subpopulations (e.g., 70 for MMPI & 80 for WISC)

Desirable Properties of Psychological Measures

Standardization Administration & Scoring

ReliabilityInter-rater, Internal Consistency, Test-Retest & Alternate Forms

ValidityFace, Content, Criterioin-Related, Construct

Population NormsScoring Distribution & Cutoffs

Interpretability of Individual and Group Scores

Reliability & Validity More is Better Properties of a “good measure” –Standardization...

Documents

Transcript of Reliability & Validity More is Better Properties of a “good measure” –Standardization...