Anp.item Analysis
-
Upload
haritha-k-muraleedharan -
Category
Documents
-
view
79 -
download
0
Transcript of Anp.item Analysis
INTRODUCTION
Education aims at imparting new knowledge. Every educational
institute or system is guided by this goal. To determine whether these goals are met
and to know to what extent these aims are met we need to have a carefull appraisal
and study. This is the significance of evaluation in any educational system. The
realization of goals and objectives in the educative process is based on the accuracy
of the judgements and inferences made by decision makers in every stage.
TERMINOLOGY
Evaluation :it is a process of determining to what extent the educational objectives
are being realized. – RALPH TYLER
Evaluation includes both qualitative and quantitative means. quantitative description
of the pupil achievement and qualitative description of pupils ability ,value
judgements about achievements and abilities.
Achievement test: It is an important an important tool in school evaluation and has
a great significance in measuring instructional progress and progress of student in
the subject area.
Definition of Achievement test: Any test that measures the attainment of
accomplishments of an individual after a period of training or learning.
- N M DOWNIE.
ITEM ANALYSIS The last step in the construction of a test is appraising the test or item analysis.
Item Analysis describes the statistical analyses which allow measurement of the
effectiveness of individual test items. An understanding of the factors which govern
effectiveness (and a means of measuring them) can enable us to create more
effective test questions and judgment. As an educator you should be able to
recognize the most critical pieces of data from an Item Analysis Report and evaluate
whether or not that item needs revision.
ITEM ANALYSIS or APPRAISING THE TEST
It is the procedure used to judge the quality of an item. To ascertain
whether the questions /items do their job effectively, a detailed test and analysis has
to be done before a meaningful and scientific inference about the test can be made
in terms of its validity ,reliability, objectivity and usability.
Item analysis is a post administration examination of a test
(Remmers, Gage and Rummel, 1967: 267).
Item analysis is a process which examines student’s responses to individual test items
(questions) in order to assess the quality of those items and of the test as a whole.
Item analysis is especially valuable in improving items which will be used again in later
tests, but it can also be used to eliminate ambiguous or misleading items in a single
test administration. In addition, item analysis is valuable for increasing instructors'
skills in test construction, and identifying specific areas of course content which need
greater emphasis or clarity.
ITEM STATISTICS:-Item statistics are used to assess the performance of individual
test items on the assumption that the overall quality of a test derives from the
quality of its items.
The ITEM ANALYSIS report provides the following item information.
Item Number. This is the question number taken from the student answer
sheet, and the Key Sheet.
Mean and S.D. The mean is the "average" student response to an item. It is
computed by adding up the number of points earned by all students for the
item, and dividing that total by the number of students.
The standard deviation, or S.D., is a measure of the dispersion of student scores on
that item, that is, it indicates how "spread out" the responses were. The item
standard deviation is most meaningful when comparing items which have more than
one correct alternative and when scale scoring is used.
means. the mean total test score (minus that item) is shown for students who
selected each of the possible response alternatives.
ITEMS can be analysed qualitatively ,in terms of their content and form, and
quantitatively, in terms of their statistical properties.
Qualitative analysis of items
Content validity: It is the degree to which the test contains a representative
sample of the material taught in the course and is determined against the course
content.
Evaluation of items in terms of effective item writing procedures.
Quantitative analysis
Measurement of item difficulty .
Measurement of item discrimination.
Both the reliability and the validity of any test depends ultimately on the
characteristics of its items. High reliability and validity can be built in to attest in
advance through item analysis. Tests can be improved through the selection,
substitution or revision of items.
CONCEPTS OF MEASUREMENT:
VALIDITY :It refers to the appropriateness, meaningfulness, and usefulness of
inferences made from test scores. Validity is the judgment made about a test’s ability
to measure what it is intended to measure.( according to the standards for
educational and psychological testing, by American psychological association,
American education research association, national council on measurement in
education, 1985)
This judgment is based on categories of evidence: content-related evidence; criterion
–related: and construct-related.
RELIABILITY: ability of a test to give dependable and consistent scores. A judgement
about reliability can be made based on the extent to which two similar measures
agree.
Jacobs & Chase(1992) recommended a minimum test length of 25 multiple-choice
questions with an item difficulty sufficient to ensure heterogeneous performance of
the group. Reliability is measured on a scale of 0.1-1. For classroom tests reliability of
0.7-0.8 is acceptable.
Reliability coefficients theoretically range in value from zero (no reliability) to 1.00
(perfect reliability).
High reliability means that the questions of a test tended to "pull together." Students
who answered a given question correctly were more likely to answer other questions
correctly. If a parallel test were developed by using similar items, the relative scores
of students would show little change.
Low reliability means that the questions tended to be unrelated to each other in
terms of who answered them correctly. The resulting test scores reflect peculiarities
of the items or the testing situation more than students' knowledge of the subject
matter.
Reliability Interpretation
0.90 and above : Excellent reliability; at the level of the best standardized tests
0.80 -0.90: Very good for a classroom test
0.70 - 0.80: Good for a classroom test; in the range of most.
0.60 - 0.70 : Somewhat low. This test needs to be supplemented by other measures
(e.g., more tests) to determine grades. There are probably some items which could
be improved.
0.50 -0 .60: Suggests need for revision of test, unless it is quite short (ten or fewer
items). The test definitely needs to be supplemented by other measures (e.g., more
tests) for grading.
0.50 or below : Questionable reliability, and it needs revision.
OBJECTIVITY : A test is said to be objective , when the scorer’s personal judgement
does not affect the scoring. It is a prerequisite of reliability and validity.
USABILITY or PRACTICABILITY : It is overall simplicity for both constructor and
learner.it depends on various factors like ease of administrability, scoring ,
interpretation and economy. It is an important criterion used for assessing the value
of a test.
TYPES OF ITEM ANALYSISThere are three main types of Item Analysis: Item Response Theory, Rasch
Measurement and Classical Test Theory.
1. CLASSICAL TEST THEORY : (traditionally the main method used in the
United Kingdom) it utilises two main statistics - Facility and Discrimination.
Facility is essentially a measure of the difficulty of an item, arrived at by dividing the
mean mark obtained by a sample of candidates and the maximum mark available.
As a whole, a test should aim to have an overall facility of around 0.5, however it is
acceptable for individual items to have higher or lower facility (ranging from 0.2 to
0.8)
Discrimination measures how performance on one item correlates to performance in
the test as a whole. There should always be some correlation between item and test
performance, however it is expected that discrimination will fall in a range between
0.2 and 1.0.
2. RASCH MEASUREMENT: Rasch measurement is very similar to IRT1 - in that it
considers only one parameter (difficulty) and the ICC is calculated in the same way.
When it comes to utilising these theories to categorise items however, there is a
significant difference. If you have a set of data, and analyse it with IRT1, then you
arrive at an ICC that fits the data observed. If you use Rasch measurement, extreme
data (e.g. questions which are consistently well or poorly answered) is discarded and
the model is fitted to the remaining data.
3.ITEM RESPONSE THEORY
Item Response Theory (IRT) assumes that there is a correlation between the
score gained by a candidate for one item/test (measurable) and their overall ability on
the latent trait which underlies test performance . Critically, the 'characteristics' of an
item are said to be independent of the ability of the candidates who were
sampled.Item Response Theory comes in three forms: IRT1, IRT2, and IRT3 reflecting
the number of parameters considered in each case.
For IRT1, only the difficulty of an item is considered,
(difficulty is the level of ability required to be more likely to correctly answer the
question than answer it wrongly).
For IRT2, difficulty and discrimination are considered,
(discrimination is how well the question is at separating out candidates of similar
abilities).
IRT3, difficulty, discrimination and chance are considered,
(chance is the random factor which enhances a candidates probability of success
through guessing.
IRT can be used to create a unique plot for each item (the Item Characteristic Curve -
ICC). The ICC is a plot of Probability that the Item will be answered correctly against
Ability.
ITEM CHARACTERISTIC CURVES (ICC) AND ITEM RESPONSE THEORY (IRT):
Test developers the concepts associated with IRT for item analysis. In essence, IRT
relates each test item's performance to a complex statistical estimate of the test
taker's knowledge or ability on the measured construct.
A basic characteristic of IRT is an ICC.
Item Characteristic Curves (ICC)
An ICC is a graphical representation of the probability of answering an item
correctly with the level of ability on the construct being measured. The shape of the
ICC reflects the influence of the three factors:
1) The item's difficulty;
2) The item's discriminatory power;
3) The probability of answering correctly by guessing.
Increasing the difficulty of an item causes the curve to shift right - as candidates need
to be more able to have the same chance of passing.
Increasing the discrimination of an item causes the gradient of the curve to increase.
Candidates below a given ability are less likely to answer correctly, whilst candidates
above a given ability are more likely to answer correctly.
Increasing the chance raises the baseline of the curve.
IRT logically assumes that individuals who have high scores on the test
have greater ability than those who score low on the test. With this in mind it can be
concluded that the greater the slope of the ICC the better the item is at
discriminating between high and low test performers.
Difficulty, on an ICC, is operationally defined by the point at which the curve
indicates a chance probability of 0.5 (a 50-50 chance) for answering the item
correctly. The higher the level of ability needed to obtain a 0.5 probability (curve
shifted to the right) the more difficult the item.
Item Difficulty :
The difficulty level of item is defined as the proportion or percentage of the
examinees or individuals who answered the items correctly.
- (Singh, 1986: and Remmers, Gage and Rummel, 1967).
According to J.P. Guilford " the difficulty value of an item is defined as the
proportion or percentage of the examinees who have answered the item correctly"
-(Freeman, 1962:112-113 and Sharma, 2000)
The proportion of examinees who answered the item correctly. It is frequently also
called the p-value.
Item difficulty is most commonly measured by calculating the percentage of
test-takers who answer the item correctly {p value for an item = (%of people
responding correctly) / (% of people taking the test)}.
Jacobs & chase (1992) recommended that most items in the test be
approximately p=0.5 (50% chance) to help ensure that questions separate learners
from non-learners(a good discrimination index). The upper limit of the item difficulty
is 1.00( 100% students answered the question correctly). The lower limit of item
difficulty depends on the number of possible responses and the probability of
guessing the answer.( for ex. For an item with 4 response or options ,p=0.25). Thus,
most test developers seek to develop tests where the average difficulty scores is
about 0.5. Often items with difficulty levels between 0 - 0.2 and 0.8 - 1.0 are
discarded because they are either too difficult or too easy, respectively. They are not
differentiating the population.
In a test it is customary to arrange items in order of difficulty, so that test
takers begin with relatively easy items and proceed to items of increasing difficulty .
this arrangement gives the test takers confidence in approaching the test and also
reduces the likelihood of wasting time in items beyond their ability and neglecting
the items they can correctly complete.
Item Discrimination :
Item discrimination refers to the way an item differentiates students who know
the content from those who do not know.A measure of how well an item
distinguishes between students who are knowledgeable and those who are not.
Definition: According to Marshall Hales (1972) the discriminating power of the item
may be defined as the extent to which success or failure on that item indicates the
possession of the achievement being measured.
Discrimination can be measured as a point biserial correlation. If a question
discriminates well,the point biserial correlation will be highly positive for the correct
answer and negative for the distracters.
The Discrimination Index;
A discrimination index (D) is calculated To measure how well a test item separates
those test takers who show a high degree of a skill, knowledge, attitude, or
personality from those who have low skill, knowledge, etc.,.
This index compares, for each test item, the performance of those who scored the
best (U – upper group) with those who scored the worst (L – lower group).
1)Rank-order your test scores from lowest to highest.
2) The upper 25-35% and the lower 25-35% form your analysis groups.
3) Calculate the percentage of test-takers passing each item in both groups {i.e., U =
(# of uppers who responded correctly) / (Total number in the Upper group); L = (# of
lowers who responded correctly) / (Total number in the lower group)};
4) D = U – L
Item Discrimination indices range from -1.0 to 1.0. The higher the value, the better
that choice is able to discriminate between strong and weak students.The logic of the
D statistic is simple. Tests are more difficult for those who score poorly (lower
group). If an item is measuring the same thing as a test, then the item should be
more difficult for the lower group. The D statistic provides a measure of each item’s
discriminating power with respect to the upper and lower groups. On the basis of
discriminating power, items are classified into three types (Sharma, 2000: 201).
Positive Discrimination: If an item is answered correctly by superiors (upper groups)
and but not answered correctly by inferiors (lower group) such item possess positive
discrimination.
Negative Discrimination: An item answered correctly by inferiors (lower group) but
not answered correctly by the superiors (upper groups) such item possess negative
discrimination
Zero Discrimination: If an item is answered correctly by the same number of
superiors as well as inferiors examinees of the same group. The item cannot
discriminate between superior and inferior examinees. Thus, the discrimination
power of the item is zero.
Inter-Item Correlations:
The inter-item correlation matrix is another important component of item analysis.
This matrix displays the correlation of each item with every other item. Usually each
item is coded as dichotomous (incorrect = 0, correct = 1), and the resulting matrix is
composed of phi coefficients, that are interpreted much like the Pearson product-
moment correlation coefficients. This matrix provides important information about a
test’s internal consistency, and what could be done to improve it. Ideally each item
should be correlated highly with the other items measuring the same construct.
Items that do not correlate with the other items measuring the same construct can
be dropped without reducing the test’s reliability.
Item-Total Correlations
Point-biserial or item-total correlations assess the usefulness of an item as a
measure of individual differences in knowledge, ability, or personality characteristic.
Here each dichotomous test item (incorrect = 0; correct =1) is correlated with the
person’s total test score. Interpretation of the item-total correlation is similar to that
of the D statistic. A modest positive correlation illustrates 2 things:
1) That the item in question is measuring the same construct as the test;
2) The item is successfully discriminating between those who perform well and those
who perform poorly.
Distractor Analysis
Distractor Difficulties and Distractor Discriminations measure the
proportion of students selecting each wrong answer and the ability of each wrong
answer to distinguish between strong and weak students.With multiple choice tests
there is usually one correct answer and a few wrong answers or distractors. A lot can
be learned from analyzing the frequency with which test-takers choose distractors.
Effective distracters should appeal to the non learner , as indicated by
negative point biserial correlation. Distracters with a point biserial correlation of zero
indicate that the students did not select them and that they need to be revised or
replaced with a more plausible option for students who did not understand the
content.
Consider that perfect multiple-choice questions should have 2 distinctive features:
1) Person's who know the answer pick the correct answer;
2) People who do not know the answer guess among the possible responses. . This
means that each distractor should be equally popular. It also means that ;
the number of correct answers = those who truly know + some random amount.
To account for this, should professors subtract the randomness factor from each
person's score to get a more accurate view of a person's true knowledge.
STEPS IN ITEM ANALYSIS
1.score the exam and sort the results by score.
2.Rank in order of merit and identify high and low groups.
Select an equal number of students from each end, e.g. top 25% (upper 1/4) and
bottom 25% (lower 1/4).
3. For each item count the number of students in each group who answered the item
correctly. For each alternative response type of questions,count the number of
students in each group who choose each alternative
4.Compare the performance of these two groups on each of the test items.
For any well-written item:
--a greater portion of students in the upper group should have selected the correct
answer.
--a greater portion of students in the lower group should have selected each of the
distracter (incorrect) answers.
5.Calculation of difficulty index of a question.
Item difficulty index: percentage of students who get the item correct,
D = R/N *100
R: number of pupils who answered the item correctly
N: total number of students who tried them.
Item Difficulty Level or facility level of a test: it is the index of how easy or difficult the
test is from the point of view of the teachers. It is the ratio of the average score of a
sample on the test to the maximum possible score on the test. The percentage of
students who answered the item correctly.
Difficulty level = average on the test *100
Maximum possible score
Difficulty index = H+L * 100
N
H: number of correct answers to the high group
L: number of correct answers to the low group
N:total number of students in both groups.
In case of objective tests find the facility value
Facility value = number of students answering questions correctly *100
Number of students who have taken the test
High
(Difficult)
Medium
(Moderate)
Low
(Easy)
<= 30% > 30% AND < 80% >=80%
6.Item Difficulty Level: Discussion
Is a test that nobody failed too easy?
Is a test on which nobody got 100% too difficult?
Should items that are “too easy” or “too difficult” be thrown out?
7.Estimating Item Discrimination
Generally, students who did well on the exam should select the correct answer to
any given item on the exam.
The Discrimination Index distinguishes for each item between the performance of
students who did well on the exam and students who did poorly.
For each item, subtract the number of students in the lower group who answered
correctly(RL) from the number of students in the upper group who answered
correctly(RU).
Divide the result by the number of students in one groupN/2.
DI= RU-RL
N/2
Formula -2
DI= no: of HAQ- no: of LAQ
No: of HAG
HAQ=No: of students in high ability group answering questions correctly.
LAQ= no: of students in low ability group answering questions correctly.
HAG= No: of students in high ability group.
The Discrimination Index is listed in decimal format and ranges between -1 and 1
For exams with a normal distribution, a discrimination of 0.3 and above is good; 0.6
and above is very good.
Item Discrimination (D) Item Difficulty
High Medium Low
D =< 0% Review Review review
0% < D < 30% Ok
review
Ok
D >= 30% Ok Ok Ok
ADVANTAGES OF ITEM ANALYSIS.
Helps in judging the worth or quality of a test
Aids in subsequent test revisions.
Increased skill in test construction.
Provides diagnostic value and help in planning future learning activities.
Basis for discussing test results.
Makes decision about the promotion of students to the next higher grade.
Brings about improvement in teaching methods and techniques.
ITEM REVISION
Items with negative or low positive Item Discrimination should either
be revised or deleted from item bank. Distractors with Item Discrimination that are
either positive or negative but too low considering the Item Difficulty, should be
replaced. For an item to be revised successfully, it is often necessary to have at least
one solid distractor that will not be changed. If either all distractors are poor, or none
is particularly strong, delete item and write a new one. Change only pieces of the
item that caused problems. If an item of the test fails, even after revision, it should
be deleted and replaced by a new one.
REVIEW OF LITERATURE
Item analysis for the written test of Taiwanese board certification examination in
anaesthesiology using the Rasch model
AUTHORS:- K.-Y. Chang, M.-Y. Tsou,
ABSTRACT
Background On the written test of board certification examination for
anaesthesiology, the probability of a question being answered correctly is subject to
two main factors, item difficulty and examinee ability. Thus, item analysis can
provide insight into the appropriateness of a particular test, given the ability of
examinees.
Methods Study subjects were 36 Taiwanese examinees tested with 100 questions
related to anaesthesiology. We used the Rasch model to perform item analysis of
questions answered by each examinee to assess the effects of question difficulty and
examinee ability using a common logit scale. Additionally, we evaluated test
reliability and virtual failure rates under different criteria.
Results The mean examinee ability was higher than the mean item difficulty in this
written test by 1.28 (SD=0.57) logit units, which means that the examinees, on
average, were able to correctly answer 78% of items. The difficulty of items
decreased from 4.25 to −2.43 on the logit scale, corresponding to the probability of
having a correct answer from 5% to 98%. There were 60 items with difficulty lower
than the least able examinee and seven difficult items beyond the most able one.
The agreement of item difficulty between test developers and our Rasch model was
poor (weighted κ=0.23).
Conclusions We demonstrated how to assess the construct validity and reliability of
the written examination in order to provide useful information for future board
certification examinations.
CONCLUSION
Developing , administering, and analyzing though is like monumental task, a
step by step approach simplifies the process.. validating the tests using item ,analysis
makes an effective method of assessing outcomes in a classroom. Item analysis is
especially valuable in improving items which will be used again in later tests, but it
can also be used to eliminate ambiguous or misleading items in a single test
administration. In addition, item analysis is valuable for increasing instructors' skills
in test construction, and identifying specific areas of course content which need
greater emphasis or clarity.
BIBLIOGRAPHY
J C Aggarwal. Essentials Of Examination System. 1st Edition. New Delhi : Vikas
Publishing House Pvt Ltd ;1997. Page No,265-72
J J Guilbert. Handbook For Health Professionals. 6th Edition: Who Offset
Publications ;1992 Page No:477-81
Diane M Billings. Judith A Halstead. Teaching In Nursing , Aguide For Faculty. 1 st
Edition.Usa: W.B Saunders Company;1998 Page No.396-405
K.P Neeraja. Textbook Of Nursing Education. 1st Edition. Newdelhi: Jaypee
Brothers Medical Publishers(P) Ltd;2003 Page No.415-17
Arul jyothi, D.L. Balaji, Pratiksha jagran, Curriculum development.Centum press. New Delhi.
B. Sankaranarayan, B. sindhu, Learning and teaching Nursing. Brainfil.kerala. India
B. T. Basavanthappa, Nursing Education. Jaypee. New Delhi. India.
Loretta E. Heidgerken, Teaching and Learning in Schools of Nursing. 3 rd edition. Konark Pvt ltd.
Vimal g. Thakkar, Nursing and Nursing Education. 2nd edition. Vora medical publications.Mumbai. India.
http://bja.oxfordjournals.org/content/104/6/717.abstract
http://www.scrolla.hw.ac.uk/focus/ia.html
Http://xnet.rrc.mb.ca/tomh/item_analysis.html