Alice CHUANG, MDDepartment of Obstetrics and Gynecology
University of North Carolina-Chapel HillChapel Hill, NC
AOE Basic Teaching Skills CurriculumApril 16, 12:00 PM, Bondurant G010
Fundamentals of Assessment and Grading
APGO Clerkship Directors’ School
Neither I nor my spouse has any financial interests to disclose related to this talk.
Understand reliability and validity Contrast formative and summative evaluation Compare and contrast norm-referenced and
criterion referenced assessments Improve delivery of feedback Understand the NBME exam Be familiar with different testing formats, their
uses and their limitations
Objectives
Validity: Are we measuring what we think we’re measuring Content: Does the instrument measure the depth
and breadth of the content of the course? Does it inadvertently measure something else?
Construct: Does the evaluation criteria or grading construct allow for true measurement of the knowledge, skills or attitudes taught in the course? Is any part of the grading construct irrelevant?
Criterion: Does the outcome correlate with true competencies? Relate to an important current or future events? Is the assessment relevant to future performance?
Terminology
http://pareonline.net/getvn.asp?v=7&n=10
Validity Content: A summative ob/gyn test which
covered only obstetrics Construct: You allow students to use their
textbook for a knowledge-based multiple choice test of foundational information on prenatal care.
Criterion: New Coke v. Old Coke
Examples
Reliability: Are our measurements consistent? The score should be the same no matter when it was taken, who scored it, or when it was scored. Interrater reliability: Is a student’s score consistent
between evaluators? Intrarater reliability: Is a student’s score consistent
with the same rater even if rated under different circumstances?
Scoring rubric: standardized method of grading to increase interrater and intrarater reliability
Terminology
http://pareonline.net/getvn.asp?v=7&n=10
In general, if you repeat the same assessment, will you get the same answer? Interrater: 3 individuals are asked to go to
the beach and estimate how many seagulls they see from 6-7AM and come up with 200, 800 and 1200.
Intrarater: A particular food critic always gives low scores for food quality if the server is female.
Examples:
Poor Candidate0 points
Fair Candidate1 points
Good Candidate2 points
Superior Candidate3 points
Singing Skills
Sings with as much expression as a wet
noodle, cannot identify which tune
candidate is singing, also cannot identify what the lyrics of
song are secondary to poor
pronunciation
Minimally expressive, pitch off
significantly on occasion, diction unclear at times
Very expressive, sings on pitch most
of the time with minor errors, diction
clear most of the time
Artistically expressive, sings on pitch, diction clear
Dancing Skills
Has 2 left feet, unable to learn new steps and continues
to dance like MC Hammer despite
different choreography demonstrated
Missteps despite multiple attempts,
no artistic expression in dance
moves, unable to learn new
choreography after 3 demonstrations
Occasionally missteps, but overall
dance steps are accurate, adapts
choreography fairly rapidly,
Quick and nimble, dances artistically, able to learn new
choreography quickly.
Enthusiasm for show
CHOIR
Freely admits not knowing what GLEE
is
Endorses enjoyment of GLEE, but unable to identify favorite
character
Has watched 70% of GLEE episodes
Has seen every episode of GLEE, all
GLEE albums confirmed in iTUNES library, has been to
GLEE LIVE each summer
Examples: Show Choir Audition Rubric
Formative: on-going assessment, designed to help improve educational program as well as learner progress
Summative: designed to evaluate student overall performance at end of educational phase and evaluate effectiveness of teaching
Formative v. summative assessments
http://fcit.usf.edu/assessment/basic/basica.html
Formative: short multiple choice exam written in house that is pass/fail; answers are reviewed with class at end of testing session
Summative: NBME exam
Examples
ED30: The directors of all courses and clerkship must design and implement a system of formative and summative evaluation of student achievement in each course and clerkship.
Those responsible for the evaluation of student performance should understand the uses and limitation of various test formats, the purposes and benefits of criterion-referenced vs. norm-referenced grading, reliability and validity issues, formative vs. summative assessment, etc….
Formative v. summative assessments
ED31: Each student should be evaluated early enough during a unit of study to allow time for remediation
ED32: Narrative descriptions of student performance and of non-cognitive achievement should be included as part of evaluations in all required courses and clerkships where teacher-student interaction permits this form of assessment.
Formative v. summative assessments
Formative v. summative assessments
Uses for assessments
Formative Summative
PurposeFeedback for learning
Certification/Grading
Breadth of scopeNarrow focus on specific objectives
Broad focus on general goals
Scoring Explicit feedbackOverall performance
Learner affective response
Little anxietyModerate to high anxiety
Target audience Learner Society
Characteristics of feedback
Effective Feedback:• given with the goal of
improvement timely honest respectful clear issue-specific objective supportive motivating action-oriented solution-oriented
Destructive Feedback:• unhelpful accusatory personal judgmental subjectiveIt also undermines the self-esteem of
the receiver leaves the issue unresolved the receiver is unsure how to
proceed.
http://www.expressyourselftosuccess.com/the-importance-of-providing-constructive-feedback/
When you… You give the impression… I would stop… I would recommend…instead
Feedback…from APGO/CREOG 2011
Norm-referenced Purpose is to classify students in order of
achievement from low to high Allow comparisons of students May not give accurate information regarding
student abilities Half of the students should score above
midpoint score and the other half should score below midpoint score
Norm-referenced v. criterion- referenced assessments
Rickets C. A plea for the proper use of criterion-referenced tests in medical assessment. Med Educ, Vol 43, Issue 12.
Criterion-referenced Purpose is to evaluate students knowledge and
skills compared to a pre-determined goal performance level
Gives information about a student’s achievement of certain objectives
Should be possible for everyone to earn a passing score
Norm-referenced v. criterion- referenced assessments
Rickets C. A plea for the proper use of criterion-referenced tests in medical assessment. Med Educ, Vol 43, Issue 12.
Norm-referenced: Soccer tryouts where 11 players are chosen out of 40
Criterion-referenced: Test for driver’s license
Example
Be sure your assessment is appropriately norm-referenced or criterion referenced.
Be sure that your assessment is designed with this in mind.
Most assessments in medical education are criterion-referenced.
Norm-referenced tests should emphasize variability; criterion-referenced tests should emphasize accuracy of tested material.
Norm-referenced v. criterion- referenced assessments
Exams Developed by committees and content experts Same protocol used to build Step 1 and Step 2
In general Subject exams provided to all 130 LCME
accredited medical school is US 8 Canadian medical schools 8 osteopathic medical school 22 international medical schools
NBME
Scaled to have a mean of 70 and SD of 8 based on 9000 first-time test takers from 80+ schools who took exam as end-of-clerkship exam in 1993-94
Scores do not reflect percentage of questions answered correctly.
NBME
A score of 60 in the fourth quarter means that 2% of the examinees in the fourth quarter scored 60 or below!
NBME: What do those scores mean?
Score2011-2012
Total year Q1 Q2 Q3 Q4
93 or above 98 99 98 97 97
92 97 98 98 97 96
86 90 93 91 89 88
80 75 80 77 73 71
78 67 71 69 63 62
74 49 54 51 45 44
70 29 33 32 26 25
62 6 7 6 5 4
60 3 4 4 3 2
NBME: Academic purpose for exam
%
Advanced placement 5
Course/clerkship 95
Year-end 12
Make-up 21
Minimal competence 44
Identify at risk students 23
Practice for USMLE 47
Promotion requirement 37
Review course 1
Student self-assessment 26
Other 4
Total responses: 78
NBME: Weight given the subject exam
Weight given the subject exam
%
1-10% 411-20% 1621-30% 3331-40% 3941-50% 13>50% 0
Total number responding 70
NBME 2008 Clerkship Survey Results
Assessment/Evaluation Method Ob/gyn (%)Computer Case Simulations 0.5
Subject Exam 30
School’s MCQ Exam 9
Observation and evaluation by residents 28
Observation and evaluation by faculty 26
Oral exam 14
OSCE 12
Peer evaluation 1
Standardized patient exam 3
Other 18
Total number responding 81
2004 and 2009 survey of performance guidelines across clerkship
Recommend setting an absolute versus a relative standard for performance Angoff Procedures: item-based, judges provide guess of
minimally proficient examinees that answer each question correctly
Hofstee Method: judges determine minimum and maximum scores for passing and percentage of failures…then plotted against a graph made up of exam score and failure rate
NBME
NBME
Multiple choice exam (MCQ) Objective structured clinical
examination (OSCE) Oral examination Direct observation Simulation Standardized patient Patient/procedure log Medical record reviews Written essay questions
Testing Formats
Casey et al, To the point: reviews in medical education – the Objective Structured Clinical Examination. AJOG, Jan 2009.
Use distractors which could plausibly represent correct answer
Use a question format, not complete-the-statement format
Emphasize higher-level thinking, not strict memorization
Keep option length consistent within a question Balance the placement of the correct answer Use correct grammar Avoid clues to the correct answer Highly reliable and valid for assessing knowledge
Testing format: MCQ
http://testing.byu.edu/info/handbooks/14%20Rules%20for%20Writing%20Multiple-Choice%20Questions.pdf
Examinees rotate through circuit of stations (5-10 minutes each)
One-on-one examination (with examiner or trained or simuated patient)
List of criteria for successful completion of each station
Each station test a specific skill or competency Good for examining higher-order skills, clinical and
technical skills Requires large amount of resources
Testing format: OSCE
Portfolio based: similar to case-based portion of Oral Boards
Poor inter-rater and intra-rater reliability Scores higher when scored live verses on video Teaching students how to do better on oral exam
does not improve scores Practicing oral exams does improve scores Mock public oral exam improves performance Limitations
Halo effect (grade reflects not only performance on exam but also previous experience)
Subconscious consensus grading: examiners take subconscious cues from each other.
Testing format: Oral Exam
Burch & Seggie, 2008; Kearney et al, 2001; Buchard et al, 2007; Jacobsohn et al, 2006
Is an oral exam justified? Is there an advantage? Does the material lend itself to open questioning? How will communication skills, delivery of information
be graded? Will only content be graded? Is the examiner experienced? Will he/she skew grades
in any way? How will you prepare students for the exam? Is there enough time for every student to examine them
adequately? How much prompting/assistance is allowed for oral
examination? How much time will you allow for “thinking?” How will you ensure consistency in these areas for all examinees?
Testing format: Oral Exam
http://testing.byu.edu/info/handbooks/14%20Rules%20for%20Writing%20Multiple-Choice%20Questions.pdf
Formalized criteria Various observers True-to-life clinical setting (versus simulated) Numerical scores Comment anchored Improve reliability with multiple perspectives Consider 360 evaluation (including self,
patient and other staff members)
Testing format: Direct observation
Testing format
MCQ OSCE Direct obs Oral exam
Content +++ ++ + +
Construct +++ ++ + +
Criterion + ++ + +
Reliability +++ ++ + +
Formative Y Y Y Y
SummativeY Y Y Y
Norm-referenced Y N N N
Criterion-referenced Y Y Y Y
Be sure your assessment Provides reliable data Provides valid data Provides valuable data Is feasible Can be incorporated into the systems in
place (hospital, clinic, curriculum, etc) Is consistent with course objectives Utilizes multiple instruments, multiple
assessors and multiple points of assessment Aligns with pre-specified criteria Is fair
General rules of thumb
Lynch and Swing. Key Considerations for Selecting Assessment Instruments and Implementing Assessment Systems. ACGME.
Bond, Linda A. (1996). Norm- and criterion-referenced testing. Practical Assessment, Research & Evaluation, 5(2). Accessed at http://pareonline.net/getvn.asp?v=5&n=2
Burch VC, Seggie JL. Use of a structured interview to assess portfolio-based learning. Med Ed 2008: 42: 894-900.
Burchard K et al. Is it live or is it Memorex? Student oral examinatinos and the use of video for additional scoring. Am J Surg. 193 (2007), 233-236
Casey et al, To the point: reviews in medical education – the Objective Structured Clinical Examination. AJOG, Jan 2009.
Jacobsohn E , Kock PA, Avidan M. Poor inter-rater reliability on mock anesthesia oral examinations.
Kearney RA et al. The inter-rater and intra-rater reliability of a new Canadian oral examinatino format in anesthesia is fair to good. Can J Anesth 2002; 49:3, 232-236.
Lynch and Swing. Key Considerations for Selecting Assessment Instruments and Implementing Assessment Systems. ACGME.
Metheny WP, Espey EL, Bienstock J, et al. To the point: Medical education reviews evaluation in context: Assessing learners, teachers, and training programs. Am J Obstet Gynecol. 2005;192(1):34-37.
Moskal, Barbara M. & Jon A. Leydens (2000). Scoring rubric development: validity and reliability. Practical Assessment, Research & Evaluation, 7(10). Retrieved December 29, 2009 from http://PAREonline.net/getvn.asp?v=7&n=10
Rickets C. A plea for the proper use of criterion-referenced tests in medical assessment. Med Educ, Vol 43, Issue 12.
References
14 Rules for Writing Multiple Choice Questions. Brigham Young University 2001 Annual Conference. Accessed at http://testing.byu.edu/info/handbooks/14%20Rules%20for%20Writing%20Multiple-Choice%20Questions.pdf
Formative vs. Summative Assessments. Classroom Assessment. Accessed at: http://fcit.usf.edu/assessment/basic/basica.html
NBME 2008 Clinical Clerkship Director Survey Results. Accessed at https://portal.nbme.org/web/medschools/home?p_p_id=62_INSTANCE_dOGM&p_p_action=0&p_p_state=maximized&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_62_INSTANCE_dOGM_struts_action=%2Fjournal_articles%2Fview&_62_INSTANCE_dOGM_keywords=&_62_INSTANCE_dOGM_advancedSearch=false&_62_INSTANCE_dOGM_andOperator=true&_62_INSTANCE_dOGM_groupId=1172&_62_INSTANCE_dOGM_searchArticleId=&_62_INSTANCE_dOGM_version=1.0&_62_INSTANCE_dOGM_name=&_62_INSTANCE_dOGM_description=&_62_INSTANCE_dOGM_content=&_62_INSTANCE_dOGM_type=&_62_INSTANCE_dOGM_structureId=&_62_INSTANCE_dOGM_templateId=&_62_INSTANCE_dOGM_status=approved&_62_INSTANCE_dOGM_articleId=817480
Objective Structured Clinical Examination. Wikipedia. Accessed at http://en.wikipedia.org/wiki/Objective_structured_clinical_examination
Reliability and Validity. Classroom Assessment. Accessed at: http://fcit.usf.edu/assessment/basic/basicc.html
Talk about teaching: Significant issues in Oral Examinations. Contributed by Meryl Carlson, Concordia College, Moorhead, MN. Accessed at http://www.cord.edu/faculty/ulnessd/oral/MCarlson/questions.html
References
Top Related