Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State...

46
Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University

Transcript of Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State...

Page 1: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Assessment Concepts

Dr. Julie Esparza BrownSped 512: Diagnostic AssessmentWeek 4Portland State University

Page 2: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.
Page 3: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Normal Distribution

Symmetrical Unimodal. No Skew. No Kurtosis. You always know the percent of the

distribution in any part of the normal curve.

Page 4: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Raw Scores

Raw scores convey very little meaning unless referenced to some standard,

Page 5: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.
Page 6: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Percentiles (Relative Standing)

The percent of people in the comparison group who scored at or below the score of interest.

Example: Billy obtained a percentile rank of 42. This means that Billy performed as well or better

than 42% of children his age on the test. Or, 42% of children Billy’s age scored at or below

Billy’s score. Or, Billy is number 42 in a line of 100 people.

Page 7: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Advantages of Percentiles Ranks

Percentile ranks are one of the best types of score to report to consumers of a child’s relative standing compared to other children.

Scores indicate how well a student performed compared to the performance of some reference group,

Percentile ranks are Ordinal Scale (values ordered from worst to best but differences between adjacent values are unknown) data, It is not meaningful to calculate the mean or standard

deviation of percentiles.

Page 8: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.
Page 9: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Standard Scores (Relative Standing)

Standard scores are scores of relative standing with a set, fixed, predetermined mean and standard deviation.

Standard Score

Mean Standard Deviation

Z 0 1

T 50 10

IQ 100 15

SB Subtest

50 8

WISC-III 10 3

Page 10: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

WJ-III Scores

WJ-III uses standard scores with a mean of 100 and a standard deviation of 15.

A person earning a score of 85 would be one standard deviation below the mean.

A person earning a score of 115 would be one standard deviation above the mean.

Standard scores are equal-interval scores so they can be combined (e.g., added or averaged).

Page 11: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Age & Grade Equivalents (Developmental Scale)

There are problems with using these scores

Identical age equivalents can mean different task performance.

Page 12: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Problems with Grade and Age Equivalent Scores

1. Systematic misinterpretation: students who earn an AE of 12.0 has answered as many questions as the average for children of 12. They have not necessarily performed as a 12 year old could.

2. Implication of a false standard of performance: equivalent scores are constructed so that 50% of any age or grade group will perform below or above age or grade level.

3. Tendency for scales to be ordinal, not equal interval: age and grade equivalent scores are ordinal, not equal interval: they should not be added or multiplied.

Source: Salvia, Ysseldyke & Bolt (2009)

Page 13: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Age & Grade Equivalents (Developmental Scale)

Maria got an age equivalent of 2-0 on a test means:

Maria obtained the same number correct as the estimated mean of children 2 years and 0 months of age,

It does NOT mean:

Maria performed like an average 2 year old on the test.

Page 14: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Age & Grade Equivalents (Developmental Scale)

John got a grade equivalent of 3.5 on a test means:

John obtained the same number correct as the estimated mean of children 5th month of 3rd grade.

It does NOT mean:

John is able to do 3.5 grade level work.

Bottom Line – Do not use grade or grade level scores.

Page 15: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.
Page 16: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.
Page 17: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.
Page 18: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.
Page 19: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Scales of Measurement

Page 20: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Nominal Scale (Name)

A scale of measurement in which there is no inherent relationship among adjacent values.

Each number reflects an arbitrary category label rather than an amount of a variable.

Nominal Scales are used to indicate classification, category, or group.

Examples Football jersey numbers Group 1 Group2 Diagnostic categories

Page 21: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Ordinal Scale (Order) A scale on which values of measurement are ordered from

best to worst or from worst to best; on ordinal scales, the differences between adjacent values are unknown.

Ordinal Scales provide order and ranking information (1st, 2nd, 3rd, etc.)

Are used to indicate when one value has more or less of something than another.

The central tendency of an ordinal attribute can be represented by its mode or its median, but the mean cannot be defined.

Examples: Rank in high school class Percentile rank Age and grade equivalent Results of a horse race

Page 22: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Interval Scale (Interval/Distance)

Interval Scales provide distance (interval) information.

Differences have meaning. Equal differences in the numbers correspond to equal differences in the attributes.

Most data in education will be interval scale data.

Examples: IQ scores Test scores Rating scales

Page 23: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Ratio Scale (Ratio/Absolute 0)

A scale of measurement in which the difference between adjacent values is equal and in which there is a logical and absolute zero.

Ratio Scales provide absolute amount information.

Examples: Counts of behavior income

Page 24: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Measures of Central Tendency

Page 25: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Mean (Most useful)

Mean – the average of all the scores in the distribution.

Appropriate for Equal Interval and Ratio Scales.

Not appropriate for skewed distributions.

Page 26: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Median (Next most useful)

Median – the middle score of a distribution. Appropriate for Ordinal, Equal Interval, and Ratio

Scales. Most appropriate when distribution is skewed, 50% of scores are above the median, 50% of

scores are below the median. Example

Arrange scores in order from largest to smallest (or vice versa)

If N is odd, the middle score is the median. If N is even, the average of the two middle scores is the

median.

Page 27: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Mode (Least useful)

The Mode is the most frequently occurring score.

Appropriate for Nominal, Ordinal, Equal Interval, and Ratio Scales.

Generally used in a very rough sense to get a feel for “the peak of the mountain.”

Page 28: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Measures of Spread or Variability

Page 29: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Standard Deviation

Standard Deviation (S) indicates the spread or variability of a distribution; the square root of the variance.

Appropriate only for equal interval and ratio scales.

Also used as a unit of measurement.

Page 30: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Technical Adequacy of Instruments

Page 31: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

The Reliability Coefficient

An index of the extent to which observations can be generalized; the square of the correlation between obtained scores and true scores on a measure.

The proportion of variability in a set of scores that reflects true differences among individuals.

If there is relatively little error, the ratio of true-score variance to obtained-score variance approaches a reliability index of 1.0 (perfect reliability)

If there is a relatively large amount of error, the ratio of true-score variance to obtained-score variance approaches .00 (total unreliability).

We want to use the most reliable tests available.

Page 32: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Standards for Reliability

If test scores are to be used for administrative purposes and are reported for groups of individuals, a reliability of .60 should be the minimum. The relatively low standard is acceptable because group means are not affected by a test’s lack of reliability.

If weekly (or more frequent) testing is used to monitor pupil progress, a reliability of .70 should be the minimum. This relatively low standard is acceptable because random fluctuations can be taken into account when a behavior or skill is measured often.

Page 33: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Standards for Reliability

If the decision being made is a screening decision, there is still a need for higher reliability. For screening devices, a standard of .80 is recommended.

If a test score is to be used to make an important decision concerning an individual student (such as special education placement), the minimum standard should be .90.

Page 34: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Standard Error of Measurement

SEM is another index of test error. It is the average standard deviation of error distributed around a

person’s true score. The difference between a student’s actual score and their highest or

lowest hypothetical score. We generally assess a student once on a norm-referenced test so we

do not know the test taker’s true score or the variance of the measurement error that forms the distribution around that person’s true score.

We estimate the error distribution by calculating the SEM. The general formula SEM equals the standard deviation of the

obtained scores, multiplied by the square root of 1 minus the reliability coefficient.

When the SEM is relatively large, the uncertainty that the student’s true score will fall within the range is large; when the SEM is relatively small, the uncertainty is small.

Page 35: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Confidence Interval

The range of scores within which a person’s true score will fall with a given probability.

Since we can never know a person’s true score, we can estimate the likelihood that a person’s true score will be found within a specified range of scores called the confidence interval.

Confidence intervals have two components: Score range Level of confidence

Page 36: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Confidence Interval Score range: the range within which a true score is likely

to be found A range of 80 – 90 tells us that a person’s true score is likely to

be within that range Level of confidence: tells us how certain we can be that

the true score will be contained within the interval If a 90% confidence interval for an IQ is 106 – 112, we can be

90% sure that the true score will be contained within that interval. It also means that there is a 5% chance the true score is higher

than 112 and a 5% chance the true score is lower than 106. To have greater confidence would require a wider confidence

interval. You will have a choice of confidence intervals on

Compuscore. You can choose the 90 percent option but the default is set at 68%.

Page 37: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Validity

“The degree to which evidence and theory support the interpretation of test scores entailed by proposeed uses of tests” (APA Standards, 1999, p. 9)

Validity is the most fundamental consideration in evaluating and using tests.

Page 38: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Validity

“A test that leads to valid inferences in general or about most students may not yield valid inferences about a specific student…First, unless a student has been systematically acculturated in the values, behavior, and knowledge found in the public culture of the United States, a test that assumes such cultural information is unlikely to lead to appropriate inferences about that student…

Page 39: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Validity

Second, unless a student has been systematically instructed in the content of an achievement test, a test assuming such academic instruction is unlikely to lead to appropriate inferences about the student’s ability to profit from instruction. It would be inappropriate to administer a standardized test of written language (which counts misspelled words as errors) to a student who has been encouraged to use inventive spelling and reinforced for doing so. It is unlikely that the test results would lead to correct inferences about that student’s ability to provide from systematic instruction in spelling” (Salvia, Ysseldyke, & Bolt, 2009, p. 63.)

Page 40: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Types of Validity

Content validity Criterion-related validity Construct validity

Page 41: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Content Validity A measure of the extent to which a test is an adequate

measure of the content it is designed to cover; content validity is established by examining three factors: Appropriate of type of items included Comprehensiveness of item sample The way in which the ietms assess the content

It is assessed by an overview of the items by trained individuals who make judgments about the relevancy of the items and the unambiguity of their formulation.

This is especially important in achievement testing and one under debate.

There is an emerging consensus that the methods used to assess student knowledge should closely parallel those used in instruction.

Page 42: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Criterion-related Validity

The extent to which performance on a test predicts performance in a real-life situation.

Usually expressed as a correlation coefficient called a validity coefficient.

Two types of criterion-related validity: Concurrent validity Predictive validity

Page 43: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Concurrent Validity A measure of how accurately a person’s current

test score can be used to estimate a score on a criterion measure.

We look to see if the test presents evidence of content validity and elicits test scores corresponding closely (correlating significantly) to judgments and scores from other achievement tests that are presumed to be valid, we can conclude that there is evidence for a test’s criterion-related validity.

Page 44: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Predictive Criterion-related Validity

A measure of the extent to which a person’s current test scores can be used to estimate accurately what that person’s criterion scores will be at a later time.

Concurrent and predictive validity differ in the time at which scores on the criterion measure are obtained.

If we are developing a test to assess reading readiness, we can ask: Does knowledge of a student’s score on the reading test allow an accurate estimation of the student’s actual readiness for instruction? How do we know that our test really assesses reading readiness?

The first step is to find a valid criterion measure and if an assessment has content validity and corresponds to another measure, we can conclude the test is valid.

Page 45: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

Construct Validity

The extent to which a procedure or test measures a theoretical trait or characteristics.

Especially important for measures of process such as intelligence/cognition.

To provide evidence of construct validity, an author must rely on indirect evidence and inference.

To gauge construct validity a test develop accumulates evidence that the test acts in the way it would if it were a valid measure of a construct.

As the research evidence accumulates, the developer can make a stronger claim to construct validity.

Page 46: Assessment Concepts Dr. Julie Esparza Brown Sped 512: Diagnostic Assessment Week 4 Portland State University.

The Bottom Line…

“Test users are expected to ensure that the test is appropriate for the specific students being assessed.”

Salvia, Ysseldyke & Bolt, 2009, p. 71