Anp.item Analysis

INTRODUCTION

Education aims at imparting new knowledge. Every educational

institute or system is guided by this goal. To determine whether these goals are met

and to know to what extent these aims are met we need to have a carefull appraisal

and study. This is the significance of evaluation in any educational system. The

realization of goals and objectives in the educative process is based on the accuracy

of the judgements and inferences made by decision makers in every stage.

TERMINOLOGY

Evaluation :it is a process of determining to what extent the educational objectives

are being realized. – RALPH TYLER

Evaluation includes both qualitative and quantitative means. quantitative description

of the pupil achievement and qualitative description of pupils ability ,value

judgements about achievements and abilities.

Achievement test: It is an important an important tool in school evaluation and has

a great significance in measuring instructional progress and progress of student in

the subject area.

Definition of Achievement test: Any test that measures the attainment of

accomplishments of an individual after a period of training or learning.

- N M DOWNIE.

ITEM ANALYSIS The last step in the construction of a test is appraising the test or item analysis.

Item Analysis describes the statistical analyses which allow measurement of the

effectiveness of individual test items. An understanding of the factors which govern

effectiveness (and a means of measuring them) can enable us to create more

effective test questions and judgment. As an educator you should be able to

recognize the most critical pieces of data from an Item Analysis Report and evaluate

whether or not that item needs revision.

ITEM ANALYSIS or APPRAISING THE TEST

It is the procedure used to judge the quality of an item. To ascertain

whether the questions /items do their job effectively, a detailed test and analysis has

to be done before a meaningful and scientific inference about the test can be made

in terms of its validity ,reliability, objectivity and usability.

Item analysis is a post administration examination of a test

(Remmers, Gage and Rummel, 1967: 267).

Item analysis is a process which examines student’s responses to individual test items

(questions) in order to assess the quality of those items and of the test as a whole.

Item analysis is especially valuable in improving items which will be used again in later

tests, but it can also be used to eliminate ambiguous or misleading items in a single

test administration. In addition, item analysis is valuable for increasing instructors'

skills in test construction, and identifying specific areas of course content which need

greater emphasis or clarity.

ITEM STATISTICS:-Item statistics are used to assess the performance of individual

test items on the assumption that the overall quality of a test derives from the

quality of its items.

The ITEM ANALYSIS report provides the following item information.

Item Number. This is the question number taken from the student answer

sheet, and the Key Sheet.

Mean and S.D. The mean is the "average" student response to an item. It is

computed by adding up the number of points earned by all students for the

item, and dividing that total by the number of students.

The standard deviation, or S.D., is a measure of the dispersion of student scores on

that item, that is, it indicates how "spread out" the responses were. The item

standard deviation is most meaningful when comparing items which have more than

one correct alternative and when scale scoring is used.

means. the mean total test score (minus that item) is shown for students who

selected each of the possible response alternatives.

ITEMS can be analysed qualitatively ,in terms of their content and form, and

quantitatively, in terms of their statistical properties.

Qualitative analysis of items

Content validity: It is the degree to which the test contains a representative

sample of the material taught in the course and is determined against the course

content.

Evaluation of items in terms of effective item writing procedures.

Quantitative analysis

Measurement of item difficulty .

Measurement of item discrimination.

Both the reliability and the validity of any test depends ultimately on the

characteristics of its items. High reliability and validity can be built in to attest in

advance through item analysis. Tests can be improved through the selection,

substitution or revision of items.

CONCEPTS OF MEASUREMENT:

VALIDITY :It refers to the appropriateness, meaningfulness, and usefulness of

inferences made from test scores. Validity is the judgment made about a test’s ability

to measure what it is intended to measure.( according to the standards for

educational and psychological testing, by American psychological association,

American education research association, national council on measurement in

education, 1985)

This judgment is based on categories of evidence: content-related evidence; criterion

–related: and construct-related.

RELIABILITY: ability of a test to give dependable and consistent scores. A judgement

about reliability can be made based on the extent to which two similar measures

agree.

Jacobs & Chase(1992) recommended a minimum test length of 25 multiple-choice

questions with an item difficulty sufficient to ensure heterogeneous performance of

the group. Reliability is measured on a scale of 0.1-1. For classroom tests reliability of

0.7-0.8 is acceptable.

Reliability coefficients theoretically range in value from zero (no reliability) to 1.00

(perfect reliability).

High reliability means that the questions of a test tended to "pull together." Students

who answered a given question correctly were more likely to answer other questions

correctly. If a parallel test were developed by using similar items, the relative scores

of students would show little change.

Low reliability means that the questions tended to be unrelated to each other in

terms of who answered them correctly. The resulting test scores reflect peculiarities

of the items or the testing situation more than students' knowledge of the subject

matter.

Reliability Interpretation

0.90 and above : Excellent reliability; at the level of the best standardized tests

0.80 -0.90: Very good for a classroom test

0.70 - 0.80: Good for a classroom test; in the range of most.

0.60 - 0.70 : Somewhat low. This test needs to be supplemented by other measures

(e.g., more tests) to determine grades. There are probably some items which could

be improved.

0.50 -0 .60: Suggests need for revision of test, unless it is quite short (ten or fewer

items). The test definitely needs to be supplemented by other measures (e.g., more

tests) for grading.

0.50 or below : Questionable reliability, and it needs revision.

OBJECTIVITY : A test is said to be objective , when the scorer’s personal judgement

does not affect the scoring. It is a prerequisite of reliability and validity.

USABILITY or PRACTICABILITY : It is overall simplicity for both constructor and

learner.it depends on various factors like ease of administrability, scoring ,

interpretation and economy. It is an important criterion used for assessing the value

of a test.

TYPES OF ITEM ANALYSISThere are three main types of Item Analysis: Item Response Theory, Rasch

Measurement and Classical Test Theory.

1. CLASSICAL TEST THEORY : (traditionally the main method used in the

United Kingdom) it utilises two main statistics - Facility and Discrimination.

Facility is essentially a measure of the difficulty of an item, arrived at by dividing the

mean mark obtained by a sample of candidates and the maximum mark available.

As a whole, a test should aim to have an overall facility of around 0.5, however it is

acceptable for individual items to have higher or lower facility (ranging from 0.2 to

0.8)

Discrimination measures how performance on one item correlates to performance in

the test as a whole. There should always be some correlation between item and test

performance, however it is expected that discrimination will fall in a range between

0.2 and 1.0.

2. RASCH MEASUREMENT: Rasch measurement is very similar to IRT1 - in that it

considers only one parameter (difficulty) and the ICC is calculated in the same way.

When it comes to utilising these theories to categorise items however, there is a

significant difference. If you have a set of data, and analyse it with IRT1, then you

arrive at an ICC that fits the data observed. If you use Rasch measurement, extreme

data (e.g. questions which are consistently well or poorly answered) is discarded and

the model is fitted to the remaining data.

3.ITEM RESPONSE THEORY

Item Response Theory (IRT) assumes that there is a correlation between the

score gained by a candidate for one item/test (measurable) and their overall ability on

the latent trait which underlies test performance . Critically, the 'characteristics' of an

item are said to be independent of the ability of the candidates who were

sampled.Item Response Theory comes in three forms: IRT1, IRT2, and IRT3 reflecting

the number of parameters considered in each case.

For IRT1, only the difficulty of an item is considered,

(difficulty is the level of ability required to be more likely to correctly answer the

question than answer it wrongly).

For IRT2, difficulty and discrimination are considered,

(discrimination is how well the question is at separating out candidates of similar

abilities).

IRT3, difficulty, discrimination and chance are considered,

(chance is the random factor which enhances a candidates probability of success

through guessing.

IRT can be used to create a unique plot for each item (the Item Characteristic Curve -

ICC). The ICC is a plot of Probability that the Item will be answered correctly against

Ability.

ITEM CHARACTERISTIC CURVES (ICC) AND ITEM RESPONSE THEORY (IRT):

Test developers the concepts associated with IRT for item analysis. In essence, IRT

relates each test item's performance to a complex statistical estimate of the test

taker's knowledge or ability on the measured construct.

A basic characteristic of IRT is an ICC.

Item Characteristic Curves (ICC)

An ICC is a graphical representation of the probability of answering an item

correctly with the level of ability on the construct being measured. The shape of the

ICC reflects the influence of the three factors:

1) The item's difficulty;

2) The item's discriminatory power;

3) The probability of answering correctly by guessing.

Increasing the difficulty of an item causes the curve to shift right - as candidates need

to be more able to have the same chance of passing.

Increasing the discrimination of an item causes the gradient of the curve to increase.

Candidates below a given ability are less likely to answer correctly, whilst candidates

above a given ability are more likely to answer correctly.

Increasing the chance raises the baseline of the curve.

IRT logically assumes that individuals who have high scores on the test

have greater ability than those who score low on the test. With this in mind it can be

concluded that the greater the slope of the ICC the better the item is at

discriminating between high and low test performers.

Difficulty, on an ICC, is operationally defined by the point at which the curve

indicates a chance probability of 0.5 (a 50-50 chance) for answering the item

correctly. The higher the level of ability needed to obtain a 0.5 probability (curve

shifted to the right) the more difficult the item.

Item Difficulty :

The difficulty level of item is defined as the proportion or percentage of the

examinees or individuals who answered the items correctly.

- (Singh, 1986: and Remmers, Gage and Rummel, 1967).

According to J.P. Guilford " the difficulty value of an item is defined as the

proportion or percentage of the examinees who have answered the item correctly"

-(Freeman, 1962:112-113 and Sharma, 2000)

The proportion of examinees who answered the item correctly. It is frequently also

called the p-value.

Item difficulty is most commonly measured by calculating the percentage of

test-takers who answer the item correctly {p value for an item = (%of people

responding correctly) / (% of people taking the test)}.

Jacobs & chase (1992) recommended that most items in the test be

approximately p=0.5 (50% chance) to help ensure that questions separate learners

from non-learners(a good discrimination index). The upper limit of the item difficulty

is 1.00( 100% students answered the question correctly). The lower limit of item

difficulty depends on the number of possible responses and the probability of

guessing the answer.( for ex. For an item with 4 response or options ,p=0.25). Thus,

most test developers seek to develop tests where the average difficulty scores is

about 0.5. Often items with difficulty levels between 0 - 0.2 and 0.8 - 1.0 are

discarded because they are either too difficult or too easy, respectively. They are not

differentiating the population.

In a test it is customary to arrange items in order of difficulty, so that test

takers begin with relatively easy items and proceed to items of increasing difficulty .

this arrangement gives the test takers confidence in approaching the test and also

reduces the likelihood of wasting time in items beyond their ability and neglecting

the items they can correctly complete.

Item Discrimination :

Item discrimination refers to the way an item differentiates students who know

the content from those who do not know.A measure of how well an item

distinguishes between students who are knowledgeable and those who are not.

Definition: According to Marshall Hales (1972) the discriminating power of the item

may be defined as the extent to which success or failure on that item indicates the

possession of the achievement being measured.

Discrimination can be measured as a point biserial correlation. If a question

discriminates well,the point biserial correlation will be highly positive for the correct

answer and negative for the distracters.

The Discrimination Index;

A discrimination index (D) is calculated To measure how well a test item separates

those test takers who show a high degree of a skill, knowledge, attitude, or

personality from those who have low skill, knowledge, etc.,.

This index compares, for each test item, the performance of those who scored the

best (U – upper group) with those who scored the worst (L – lower group).

1)Rank-order your test scores from lowest to highest.

2) The upper 25-35% and the lower 25-35% form your analysis groups.

3) Calculate the percentage of test-takers passing each item in both groups {i.e., U =

(# of uppers who responded correctly) / (Total number in the Upper group); L = (# of

lowers who responded correctly) / (Total number in the lower group)};

4) D = U – L

Item Discrimination indices range from -1.0 to 1.0. The higher the value, the better

that choice is able to discriminate between strong and weak students.The logic of the

D statistic is simple. Tests are more difficult for those who score poorly (lower

group). If an item is measuring the same thing as a test, then the item should be

more difficult for the lower group. The D statistic provides a measure of each item’s

discriminating power with respect to the upper and lower groups. On the basis of

discriminating power, items are classified into three types (Sharma, 2000: 201).

Positive Discrimination: If an item is answered correctly by superiors (upper groups)

and but not answered correctly by inferiors (lower group) such item possess positive

discrimination.

Negative Discrimination: An item answered correctly by inferiors (lower group) but

not answered correctly by the superiors (upper groups) such item possess negative

discrimination

Zero Discrimination: If an item is answered correctly by the same number of

superiors as well as inferiors examinees of the same group. The item cannot

discriminate between superior and inferior examinees. Thus, the discrimination

power of the item is zero.

Inter-Item Correlations:

The inter-item correlation matrix is another important component of item analysis.

This matrix displays the correlation of each item with every other item. Usually each

item is coded as dichotomous (incorrect = 0, correct = 1), and the resulting matrix is

composed of phi coefficients, that are interpreted much like the Pearson product-

moment correlation coefficients. This matrix provides important information about a

test’s internal consistency, and what could be done to improve it. Ideally each item

should be correlated highly with the other items measuring the same construct.

Items that do not correlate with the other items measuring the same construct can

be dropped without reducing the test’s reliability.

Item-Total Correlations

Point-biserial or item-total correlations assess the usefulness of an item as a

measure of individual differences in knowledge, ability, or personality characteristic.

Here each dichotomous test item (incorrect = 0; correct =1) is correlated with the

person’s total test score. Interpretation of the item-total correlation is similar to that

of the D statistic. A modest positive correlation illustrates 2 things:

1) That the item in question is measuring the same construct as the test;

2) The item is successfully discriminating between those who perform well and those

who perform poorly.

Distractor Analysis

Distractor Difficulties and Distractor Discriminations measure the

proportion of students selecting each wrong answer and the ability of each wrong

answer to distinguish between strong and weak students.With multiple choice tests

there is usually one correct answer and a few wrong answers or distractors. A lot can

be learned from analyzing the frequency with which test-takers choose distractors.

Effective distracters should appeal to the non learner , as indicated by

negative point biserial correlation. Distracters with a point biserial correlation of zero

indicate that the students did not select them and that they need to be revised or

replaced with a more plausible option for students who did not understand the

content.

Consider that perfect multiple-choice questions should have 2 distinctive features:

1) Person's who know the answer pick the correct answer;

2) People who do not know the answer guess among the possible responses. . This

means that each distractor should be equally popular. It also means that ;

the number of correct answers = those who truly know + some random amount.

To account for this, should professors subtract the randomness factor from each

person's score to get a more accurate view of a person's true knowledge.

STEPS IN ITEM ANALYSIS

1.score the exam and sort the results by score.

2.Rank in order of merit and identify high and low groups.

Select an equal number of students from each end, e.g. top 25% (upper 1/4) and

bottom 25% (lower 1/4).

3. For each item count the number of students in each group who answered the item

correctly. For each alternative response type of questions,count the number of

students in each group who choose each alternative

4.Compare the performance of these two groups on each of the test items.

For any well-written item:

--a greater portion of students in the upper group should have selected the correct

answer.

--a greater portion of students in the lower group should have selected each of the

distracter (incorrect) answers.

5.Calculation of difficulty index of a question.

Item difficulty index: percentage of students who get the item correct,

D = R/N *100

R: number of pupils who answered the item correctly

N: total number of students who tried them.

Item Difficulty Level or facility level of a test: it is the index of how easy or difficult the

test is from the point of view of the teachers. It is the ratio of the average score of a

sample on the test to the maximum possible score on the test. The percentage of

students who answered the item correctly.

Difficulty level = average on the test *100

Maximum possible score

Difficulty index = H+L * 100

N

H: number of correct answers to the high group

L: number of correct answers to the low group

N:total number of students in both groups.

In case of objective tests find the facility value

Facility value = number of students answering questions correctly *100

Number of students who have taken the test

High

(Difficult)

Medium

(Moderate)

Low

(Easy)

<= 30% > 30% AND < 80% >=80%

6.Item Difficulty Level: Discussion

Is a test that nobody failed too easy?

Is a test on which nobody got 100% too difficult?

Should items that are “too easy” or “too difficult” be thrown out?

7.Estimating Item Discrimination

Generally, students who did well on the exam should select the correct answer to

any given item on the exam.

The Discrimination Index distinguishes for each item between the performance of

students who did well on the exam and students who did poorly.

For each item, subtract the number of students in the lower group who answered

correctly(RL) from the number of students in the upper group who answered

correctly(RU).

Divide the result by the number of students in one groupN/2.

DI= RU-RL

N/2

Formula -2

DI= no: of HAQ- no: of LAQ

No: of HAG

HAQ=No: of students in high ability group answering questions correctly.

LAQ= no: of students in low ability group answering questions correctly.

HAG= No: of students in high ability group.

The Discrimination Index is listed in decimal format and ranges between -1 and 1

For exams with a normal distribution, a discrimination of 0.3 and above is good; 0.6

and above is very good.

Item Discrimination (D) Item Difficulty

High Medium Low

D =< 0% Review Review review

0% < D < 30% Ok

review

Ok

D >= 30% Ok Ok Ok

ADVANTAGES OF ITEM ANALYSIS.

Helps in judging the worth or quality of a test

Aids in subsequent test revisions.

Increased skill in test construction.

Provides diagnostic value and help in planning future learning activities.

Basis for discussing test results.

Makes decision about the promotion of students to the next higher grade.

Brings about improvement in teaching methods and techniques.

ITEM REVISION

Items with negative or low positive Item Discrimination should either

be revised or deleted from item bank. Distractors with Item Discrimination that are

either positive or negative but too low considering the Item Difficulty, should be

replaced. For an item to be revised successfully, it is often necessary to have at least

one solid distractor that will not be changed. If either all distractors are poor, or none

is particularly strong, delete item and write a new one. Change only pieces of the

item that caused problems. If an item of the test fails, even after revision, it should

be deleted and replaced by a new one.

REVIEW OF LITERATURE

Item analysis for the written test of Taiwanese board certification examination in

anaesthesiology using the Rasch model

AUTHORS:- K.-Y. Chang, M.-Y. Tsou,

ABSTRACT

Background On the written test of board certification examination for

anaesthesiology, the probability of a question being answered correctly is subject to

two main factors, item difficulty and examinee ability. Thus, item analysis can

http://bja.oxfordjournals.org/search?author1=M.-Y.+Tsou&sortspec=date&submit=Submit

http://bja.oxfordjournals.org/search?author1=K.-Y.+Chang&sortspec=date&submit=Submit

provide insight into the appropriateness of a particular test, given the ability of

examinees.

Methods Study subjects were 36 Taiwanese examinees tested with 100 questions

related to anaesthesiology. We used the Rasch model to perform item analysis of

questions answered by each examinee to assess the effects of question difficulty and

examinee ability using a common logit scale. Additionally, we evaluated test

reliability and virtual failure rates under different criteria.

Results The mean examinee ability was higher than the mean item difficulty in this

written test by 1.28 (SD=0.57) logit units, which means that the examinees, on

average, were able to correctly answer 78% of items. The difficulty of items

decreased from 4.25 to −2.43 on the logit scale, corresponding to the probability of

having a correct answer from 5% to 98%. There were 60 items with difficulty lower

than the least able examinee and seven difficult items beyond the most able one.

The agreement of item difficulty between test developers and our Rasch model was

poor (weighted κ=0.23).

Conclusions We demonstrated how to assess the construct validity and reliability of

the written examination in order to provide useful information for future board

certification examinations.

CONCLUSION

Developing , administering, and analyzing though is like monumental task, a

step by step approach simplifies the process.. validating the tests using item ,analysis

makes an effective method of assessing outcomes in a classroom. Item analysis is

especially valuable in improving items which will be used again in later tests, but it

can also be used to eliminate ambiguous or misleading items in a single test

administration. In addition, item analysis is valuable for increasing instructors' skills

in test construction, and identifying specific areas of course content which need

greater emphasis or clarity.

BIBLIOGRAPHY

J C Aggarwal. Essentials Of Examination System. 1st Edition. New Delhi : Vikas

Publishing House Pvt Ltd ;1997. Page No,265-72

J J Guilbert. Handbook For Health Professionals. 6th Edition: Who Offset

Publications ;1992 Page No:477-81

Diane M Billings. Judith A Halstead. Teaching In Nursing , Aguide For Faculty. 1 st

Edition.Usa: W.B Saunders Company;1998 Page No.396-405

K.P Neeraja. Textbook Of Nursing Education. 1st Edition. Newdelhi: Jaypee

Brothers Medical Publishers(P) Ltd;2003 Page No.415-17

Arul jyothi, D.L. Balaji, Pratiksha jagran, Curriculum development.Centum press. New Delhi.

B. Sankaranarayan, B. sindhu, Learning and teaching Nursing. Brainfil.kerala. India

B. T. Basavanthappa, Nursing Education. Jaypee. New Delhi. India.

Loretta E. Heidgerken, Teaching and Learning in Schools of Nursing. 3 rd edition. Konark Pvt ltd.

Vimal g. Thakkar, Nursing and Nursing Education. 2nd edition. Vora medical publications.Mumbai. India.

http://bja.oxfordjournals.org/content/104/6/717.abstract

http://www.scrolla.hw.ac.uk/focus/ia.html

Http://xnet.rrc.mb.ca/tomh/item_analysis.html

http://www.scrolla.hw.ac.uk/focus/ia.html

http://bja.oxfordjournals.org/content/104/6/717.abstract

Anp.item Analysis

Documents

Transcript of Anp.item Analysis