How Small Differences in Assessed Clinical Performance ...

18
Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited. Academic Medicine, Vol. 93, No. 9 / September 2018 1286 Perspective Increasing racial and ethnic diversity in the U.S. physician workforce is a national priority. 1 Improvement in workforce diversity enhances access to health care, reduces health care disparities, and improves quality of care for underserved populations. 2 Yet progress toward achieving a diverse medical workforce has been slow. Racial and ethnic groups underrepresented in medicine (UIM) include African Americans and Latinos, each constituting 4% of the physician population, and Native Americans and Alaska Natives, constituting 0.4% of the physician population, compared with 13%, 17%, and 2%, respectively, of the general population. 1,3,4 To address the need for a more diverse workforce, medical schools have embraced initiatives such as pipeline programs and holistic admissions processes, which balance consideration of academic and nonacademic predictors of success in medical school and beyond. 5,6 These initiatives have been successful at increasing the matriculation rate of UIM students. 7 Unfortunately, while medical schools are becoming more diverse, residency programs in competitive specialties have not shown a similar increase in the diversity of trainees. 8,9 Additionally, the diversity of faculty in academic medical centers has remained low. 10 In essence, while the pipeline of UIM matriculants to medical school is expanding, the pipeline of UIM physicians from medical school to competitive residencies and careers in academic medicine is leaking. Some UIM students choose to pursue residencies in primary care and careers practicing medicine outside academic medical centers. However, researchers have raised concerns that the cause of this leaking pipeline is differential opportunities, created by medical school assessments, for UIM students compared with their peers from groups not underrepresented in medicine (not-UIM; i.e., from racial and ethnic groups that have comparable or greater numbers in the physician population compared with their numbers in the general population). 11 In a survey of residency program directors about metrics used to select applicants for interviews and to rank candidates, respondents demonstrated a preference for those with high scores on standardized examinations. 12 This preference exists despite inconsistent evidence about the ability of exams to predict future success as a physician. 13 Further, higher United States Medical Licensing Examination (USMLE) Step 1 scores are related to careers in academic medicine, which may partially explain the limited number of minority physicians entering faculty career paths. 14 UIM students consistently receive lower scores on major standardized gateway exams, outside and within the medical education pathway, compared with not- UIM students. 15–17 Experts believe that these group differences represent the long-term consequences of structural racism, which have created inequities in education, housing, economic, and social opportunities. 16,18 Thus, assessment or residency selection strategies that rely on standardized exam results systematically Abstract While students entering medical schools are becoming more diverse, trainees in residency programs in competitive specialties and academic medicine faculty have not increased in diversity. As part of an educational continuous quality improvement process at the University of California, San Francisco, School of Medicine, the authors examined data for the classes of 2013– 2016 to determine whether differences existed between underrepresented in medicine (UIM) and not-UIM students’ clinical performance (clerkship director ratings and number of clerkship honors grades awarded) and honor society membership—all of which influence residency selection and academic career choices. This analysis demonstrated differences that consistently favored not-UIM students. Whereas the size and magnitude of differences in clerkship director ratings were small, UIM students received approximately half as many honors grades as not-UIM students and were three times less likely to be selected for honor society membership. The authors use these findings to illustrate the amplification cascade, a phenomenon in which small differences in assessed performance lead to larger differences in grades and selection for awards. The amplification cascade raises concerns about opportunities for UIM students to compete successfully for competitive residency programs and potentially enter academic careers. Using a fishbone diagram, a continuous quality improvement root cause analysis tool, the authors contextualize their institutional results. They describe potential causes of group differences, drawing from the education disparities literature, and propose interventions and future research. They also share countermeasures adopted at their institution and encourage other medical schools to consider similar exploration of their institutional data. How Small Differences in Assessed Clinical Performance Amplify to Large Differences in Grades and Awards: A Cascade With Serious Consequences for Students Underrepresented in Medicine Arianne Teherani, PhD, Karen E. Hauer, MD, PhD, Alicia Fernandez, MD, Talmadge E. King Jr, MD, and Catherine Lucey, MD Acad Med. 2018;93:1286–1292. First published online June 19, 2018 doi: 10.1097/ACM.0000000000002323 Please see the end of this article for information about the authors. Correspondence should be addressed to Arianne Teherani, Office of Medical Education, University of California, San Francisco, School of Medicine, 1855 Folsom St., Suite 200, San Francisco, CA 94143; telephone: (415) 509-4988; e-mail: Teherani@ucsf. edu; Twitter: @arianneteherani. Copyright © 2018 by the Association of American Medical Colleges Supplemental digital content for this article is available at http://links.lww.com/ACADMED/A568.

Transcript of How Small Differences in Assessed Clinical Performance ...

Page 1: How Small Differences in Assessed Clinical Performance ...

Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.

Academic Medicine, Vol. 93, No. 9 / September 20181286

Perspective

Increasing racial and ethnic diversity in the U.S. physician workforce is a national priority.1 Improvement in workforce diversity enhances access to health care, reduces health care disparities, and improves quality of care for underserved populations.2 Yet progress toward achieving a diverse medical workforce has been slow. Racial and ethnic groups underrepresented in medicine (UIM) include African Americans and Latinos, each constituting 4% of the physician population, and Native Americans and Alaska Natives, constituting 0.4% of the physician population, compared with 13%, 17%, and 2%, respectively, of

the general population.1,3,4 To address the need for a more diverse workforce, medical schools have embraced initiatives such as pipeline programs and holistic admissions processes, which balance consideration of academic and nonacademic predictors of success in medical school and beyond.5,6 These initiatives have been successful at increasing the matriculation rate of UIM students.7 Unfortunately, while medical schools are becoming more diverse, residency programs in competitive specialties have not shown a similar increase in the diversity of trainees.8,9 Additionally, the diversity of faculty in academic medical centers has remained low.10 In essence, while the pipeline of UIM matriculants to medical school is expanding, the pipeline of UIM physicians from medical school to competitive residencies and careers in academic medicine is leaking.

Some UIM students choose to pursue residencies in primary care and careers practicing medicine outside academic medical centers. However, researchers have raised concerns that the cause of this leaking pipeline is differential opportunities, created by medical school assessments, for UIM students compared

with their peers from groups not underrepresented in medicine (not-UIM; i.e., from racial and ethnic groups that have comparable or greater numbers in the physician population compared with their numbers in the general population).11 In a survey of residency program directors about metrics used to select applicants for interviews and to rank candidates, respondents demonstrated a preference for those with high scores on standardized examinations.12 This preference exists despite inconsistent evidence about the ability of exams to predict future success as a physician.13 Further, higher United States Medical Licensing Examination (USMLE) Step 1 scores are related to careers in academic medicine, which may partially explain the limited number of minority physicians entering faculty career paths.14 UIM students consistently receive lower scores on major standardized gateway exams, outside and within the medical education pathway, compared with not-UIM students.15–17 Experts believe that these group differences represent the long-term consequences of structural racism, which have created inequities in education, housing, economic, and social opportunities.16,18 Thus, assessment or residency selection strategies that rely on standardized exam results systematically

Acad Med. 2013;93:1286–1292

Abstract

While students entering medical schools are becoming more diverse, trainees in residency programs in competitive specialties and academic medicine faculty have not increased in diversity. As part of an educational continuous quality improvement process at the University of California, San Francisco, School of Medicine, the authors examined data for the classes of 2013–2016 to determine whether differences existed between underrepresented in medicine (UIM) and not-UIM students’ clinical performance (clerkship director ratings and number of clerkship honors grades awarded) and honor society membership—all of which influence

residency selection and academic career choices.

This analysis demonstrated differences that consistently favored not-UIM students. Whereas the size and magnitude of differences in clerkship director ratings were small, UIM students received approximately half as many honors grades as not-UIM students and were three times less likely to be selected for honor society membership.

The authors use these findings to illustrate the amplification cascade, a phenomenon in which small differences in assessed performance lead to larger

differences in grades and selection for awards. The amplification cascade raises concerns about opportunities for UIM students to compete successfully for competitive residency programs and potentially enter academic careers. Using a fishbone diagram, a continuous quality improvement root cause analysis tool, the authors contextualize their institutional results. They describe potential causes of group differences, drawing from the education disparities literature, and propose interventions and future research. They also share countermeasures adopted at their institution and encourage other medical schools to consider similar exploration of their institutional data.

How Small Differences in Assessed Clinical Performance Amplify to Large Differences in Grades and Awards: A Cascade With Serious Consequences for Students Underrepresented in MedicineArianne Teherani, PhD, Karen E. Hauer, MD, PhD, Alicia Fernandez, MD, Talmadge E. King Jr, MD, and Catherine Lucey, MD

Acad Med. 2018;93:1286–1292.First published online June 19, 2018doi: 10.1097/ACM.0000000000002323

Please see the end of this article for information about the authors.

Correspondence should be addressed to Arianne Teherani, Office of Medical Education, University of California, San Francisco, School of Medicine, 1855 Folsom St., Suite 200, San Francisco, CA 94143; telephone: (415) 509-4988; e-mail: [email protected]; Twitter: @arianneteherani.

Copyright © 2018 by the Association of American Medical Colleges

Supplemental digital content for this article is available at http://links.lww.com/ACADMED/A568.

Page 2: How Small Differences in Assessed Clinical Performance ...

Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.

Perspective

Academic Medicine, Vol. 93, No. 9 / September 2018 1287

disadvantage UIM students, which may make it more challenging for them to earn entry into competitive specialties and residency programs and, later, faculty careers.

Residency program directors also reported relying on the number of honors grades earned in medical school clerkships and membership in the Alpha Omega Alpha Honor Medical Society (AOA) to select future residents.12 New literature documents that group differences exist between UIM and not-UIM students in AOA membership that are preferential to not-UIM, white students, independent of exam scores and honors grades.19 Hence, the post-medical-school pipeline for diversifying residency programs and faculty may be affected by differential opportunities to earn honors grades and become eligible for AOA membership. Yet, to the best of our knowledge, the published literature does not document this type of exploration or discuss the potential causes of observed differences and possible interventions.

In this article, we describe our use of educational continuous quality improvement (ECQI) to evaluate the relationship between medical students’ UIM status and clinical performance (clerkship director ratings and the number of clerkship honors grades awarded) and AOA membership. We used a root cause analysis ECQI framework designed to answer the following three-part question: What happened, why did it happen, and what can be done to prevent it from happening again?20 We describe what happened through describing our ECQI method and results. To contextualize our findings, we discuss why it happened through use of a fishbone diagram,21 a root cause analysis tool. We also describe the potential causes of the observed differences, drawing from the literature on educational disparities. Finally, we share ideas on how to prevent inequities by describing possible interventions. Our intent is to prompt others to consider similar exploration of how their institutional procedures may perpetuate existing inequities. This work will, we hope, contribute to a nationwide conversation and further a research agenda on how medical schools can initiate change to equalize opportunities for all learners.

ECQI: Setting and Data Analysis

The University of California, San Francisco, School of Medicine (UCSF) is a research-intensive, urban public medical school. For this ECQI analysis, in 2016 we examined data for all medical students who matriculated at UCSF from 2009 through 2012 (graduating classes of 2013–2016). During the analysis period, UCSF employed holistic review of applications. This holistic review process did not use cutoff scores for grade point averages or Medical College Admission Test (MCAT) scores; instead, committees composed of faculty and students decided which applicants to interview and select on the basis of a balanced evaluation of academic performance, extracurricular activities, personal statements, letters of recommendation, and, finally, interview scores.

A total of 670 UCSF students were included in the analysis. Overall, 360 (53.7%) were women and 177 (26.4%) were UIM. We defined UIM status as described in the Association of American Medical Colleges’ Medical Minority Applicant Registry: “member of a racial or ethnic group historically underrepresented in medicine—African American/Black, Hispanic/Latino, American Indian/Alaska Native or Native Hawaiian/Pacific Islander.”22 UCSF designates applicants as UIM if they select any of the above race and ethnicity categorical values on their medical school application. If the applicant does not select any of these categorical values (e.g., selects another value such as white or Asian), the applicant is considered not-UIM. During the study period, the percentages of male and female medical students enrolled at UCSF were similar in the UIM and not-UIM groups. (Detailed demographics are available in Supplemental Digital Appendix 1 at http://links.lww.com/ACADMED/A568.) UIM students were much more likely than not-UIM students to be the first in their family to attend college.

We compared UIM and not-UIM students’ performance on the following outcome measures:

high-stakes multiple-choice question (HSMCQ) exams: mean composite score of MCAT exam’s Physical Sciences, Biological Sciences, and Verbal Reasoning section scores; first-attempt score on the USMLE Step 1 exam (taken before clerkships); and score on the USMLE Step 2 Clinical

Knowledge (CK) exam (taken after clerkships).

clinical performance: clerkship director ratings in eight domains across seven required clerkships and average number of honors grades received across the seven clerkships; and

AOA status: eligibility and selection.

At the time of the study, per school policy, a maximum of 25% to 30% of students per clerkship were assigned an honors grade. Clerkship grades were awarded based primarily on faculty and resident ratings of students and to a smaller extent on clerkship examination scores. AOA eligibility was defined based on number of weeks of honors in required clerkships. A committee of faculty and residents reviewed the AOA-eligible group and selected up to one-sixth of the class for AOA membership based on service, humanism, and overall contributions to UCSF.

Our main objective was to investigate clinical performance and AOA selection, but we began by examining HSMCQ exam scores to determine whether our UIM students’ performance was similar to what was reported in the literature and to contextualize students’ clinical performance within their overall pre-medical-school and preclinical performance. Similar to existing evidence, we identified differences in HSMCQ exam scores by UIM status at our institution. Thus, when we compared clinical performance and AOA status by UIM status, we conducted two analyses. In the second, we adjusted for USMLE Step 1 score (covariate) because it reflected preclerkship knowledge assessment.23 All data were available from the UCSF Educational Data unit, who merged and deidentified data in preparation for our analysis. The UCSF Committee for Human Research Institutional Review Board approved the research protocol as exempt.

Below, we highlight our significant findings and describe the strength of those findings as computed by effect size calculations.

ECQI: Comparison of UIM and Not-UIM Student Performance

On each of the HSMCQ exams, the overall mean score was lower for UIM students than for not-UIM students.

Page 3: How Small Differences in Assessed Clinical Performance ...

Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.

Perspective

Academic Medicine, Vol. 93, No. 9 / September 20181288

The effect sizes for the differences between UIM and not-UIM students were large for MCAT scores and medium for USMLE Step 1 and Step 2 CK scores. (Detailed results are available in Supplemental Digital Appendix 2 at http://links.lww.com/ACADMED/A568).

On average, compared with not-UIM students, UIM students received slightly lower clerkship director ratings (one-tenth of a point on the clerkship assessment scale) and, subsequently, half as many honors grades across all clerkships. Whereas the magnitude of the relationship between UIM status and clerkship director ratings was small, the magnitude of the relationship between UIM status and clerkship honors grades was medium.

When we adjusted the analysis by controlling for USMLE Step 1 scores, the relationships between UIM status and clerkship director ratings and honors grades were no longer significant. However, the relationships between USMLE Step 1 scores and clerkship director ratings and honors grades were significant, and their magnitude was medium. (Detailed results are available in Supplemental Digital Appendix 3 at http://links.lww.com/ACADMED/A568.)

Following the UCSF policies and AOA criteria detailed above, not-UIM students were approximately four times more likely to be eligible for AOA compared with UIM students. Not-UIM students were approximately three times more likely to be selected for AOA compared with UIM students.

The Amplification Cascade

Our single-institution examination of the association of UIM status and clerkship director ratings, honors grades, and AOA selection demonstrated differences that consistently favored not-UIM students. We observed that group differences narrowed between MCAT scores and subsequent USMLE Step 1 and Step 2 CK scores. Furthermore, group differences between performance on HSMCQ exams as a whole and clerkship director ratings narrowed even more. We believe this supports the idea that HSMCQ exam results tend to represent the long-term consequences of structural inequities in education. These favorable summative

outcomes demonstrate that students with a wide range of backgrounds and exam scores can succeed at a research-intensive medical school, and they support UCSF’s use of holistic admissions.

UIM status accounted for a small amount of the unadjusted (for USMLE Step 1 score) difference in clerkship director ratings, and the magnitude of the difference was small. Unfortunately, this difference was amplified by institutional grading policies and AOA eligibility criteria. As a consequence, UIM students received approximately half as many honors grades across all clerkships compared with not-UIM students and were three times less likely to be selected for AOA membership. This amplification cascade, in which small differences in assessed performance lead to larger differences in grades and selection for awards, raises questions about policies for grading at UCSF and those driving AOA eligibility and selection nationally.19 The downstream consequences of differences in honors grades and AOA selection on opportunities for UIM students to compete successfully for competitive residency programs and potentially enter academic medicine careers are concerning.24

Differences between UIM and not-UIM students in clerkship director ratings and honors grades were no longer observed after we controlled for USMLE Step 1 scores, suggesting that student attributes linked to performance on HSMCQ exams are responsible for the differences in clinical performance assessment. These attributes may include medical knowledge, but they may also include other characteristics that predict higher exam scores, such as socioeconomic status and educational advantage.25,26 Socioeconomic status plays a vital role in student performance.25,27 However, the impact of the interplay between the closely related constructs of socioeconomic status and UIM status in medical education remains unclear.28,29 Further examination of currently used measures of socioeconomic status and the coalescence of socioeconomic status and UIM status should be at the forefront of future research on race/ethnicity and medical education.

Below, we return to our ECQI framework to address the two remaining questions in the root cause analysis: Why did it

happen, and what can be done to prevent it from happening again?20 We discuss the potential evidence-based reasons for our findings in the next section on causes and consequences and then consider how to prevent future occurrences in the subsequent section on countermeasures.

Causes and Consequences of Differential Attainment

Differential attainment is a concept that encompasses how educational outcomes vary for different demographic groups assessed in the same way.28,30 In exploring possible causes of the group differences in assessed clinical performance that initiated the amplification cascade at UCSF, we employed a continuous quality improvement tool known as a fishbone diagram21 to illustrate the potential causes, effects, and consequences of individual (student or faculty/resident rater), interpersonal (student and faculty/resident rater), and cultural and structural factors that may have an impact on the student’s performance and/or the accuracy of the assessment of the student’s performance. The diagram (Figure 1) recognizes that all students, regardless of UIM status, may be subject to factors that negatively affect their performance or the assessment of their performance. Here, we focus on those factors, drawn from the literature, that have a disproportionate negative impact on UIM students and may contribute to the observed differences in assessed clinical performance.

Personal and interpersonal factors

Although personal and interpersonal factors influence all students’ performance, UIM students shoulder additional burdens as they strive to learn and demonstrate their abilities in the clinical environment.31–35 Microaggressions or overt racism directed at the student or others in the environment precipitate the stress response, resulting in impaired critical thinking, lower speech fluency, and sleep disorders that interfere with long-term memory.36–40 Activation of stereotype threat (i.e., environmental cues that make salient negative stereotypes about an individual’s status as a member of a group) has the potential to lead to diminished working memory capacity and performance overmonitoring, resulting in less engagement, lower

Page 4: How Small Differences in Assessed Clinical Performance ...

Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.

Perspective

Academic Medicine, Vol. 93, No. 9 / September 2018 1289

cognitive risk taking, and less acceptance of feedback.31,41 Also, UIM students are often keenly attuned to and feel personally responsible for addressing issues of disparate care provided for minority patients, as a manifestation of race-conscious professionalism.35 Further, the difficulty of finding race- and ethnicity-concordant senior trainees and faculty contributes to feelings of isolation and low social capital.42 The performance of a student preoccupied by these issues may underrepresent his or her true abilities. Over the long term, the cumulative impact of these stressful distractions has the potential to interfere with UIM students’ overall competency development despite their capabilities and contributions.

Personal and interpersonal factors involving interactions with faculty and residents may also contribute to UIM students’ lower clerkship performance assessments. Because unconscious bias manifests in multiple aspects of academic medicine, from hiring to health care delivery, it is likely that unconscious bias also interferes with the accurate assessment of UIM learners by some faculty and residents.43,44 Faculty and residents may be sources of microaggressions in their interactions with UIM students and patients. Although these slights are often

unintended, they can be impactful. In addition to distracting the student in the midst of important clinical discussions, microaggressions from a supervisor damage the trusting relationship needed to engage in workplace learning.32,40,45 In addition, faculty and residents who espouse a colorblind ideology may fail to recognize opportunities to address issues such as stereotype threat, racism, and the impact of microaggressions on performance.46 Ignorance about race-conscious professionalism may lead a not-UIM faculty member or resident to interpret a UIM student’s advocacy for minority patients as uninformed, disrespectful, or unprofessional. Taken collectively, these issues may result in faculty and resident assessments of a UIM student’s performance that are lower than the student’s true level of performance or ability.

Structural and cultural issues

Structural and cultural issues in the clinical learning environment may contribute to group differences in assessed performance. Rapidly rotating team assignments—particularly when teams lack racial or ethnic diversity other than that provided by the student—may exacerbate UIM students’ sense of isolation and lack of belonging.47,48 UIM students with proficiency in another language may preferentially be assigned

to work with patients with low English proficiency49 without institutional recognition that this work often takes more time, particularly if the patients also have low health literacy or lack social support. These structural and cultural issues increase the work of learning and patient care for UIM students to an extent that often is not recognized by faculty and not accounted for in standard assessment strategies.

Moreover, structural issues in the conduct of grading may influence performance assessments. Given the known observed differences between UIM and not-UIM populations in performance on many standardized exams,15–17 institutional policies that afford sizeable weight to standardized knowledge-based clerkship exams (e.g., National Board of Medical Examiners subject exams) in determining honors grades may further perpetuate differences. Unless data exist that correlate clinical performance with exam scores, inclusion of exam results in clerkship grades could be viewed as inequitable. Employing normative grading policies (e.g., awarding honors grades to the top 25% of the class), rather than criterion-based grading policies (e.g., awarding honors grades to all who exceed a specified level of competency), may amplify small differences in assessment scores to large differences

Variable interest

Variable competency

Variable effort

Performance anxiety

Susceptibility to stereotype threat

Exposure to microaggressionsand racism

Variable teaching skills

Variable assessment skills

Insufficient performance observations

Student mistreatment

Personal burnout

Colorblind ideology

Unconscious bias

Commission of microaggressions

Faculty/resident rater factorsStudent factorsCAUSES EFFECTS CONSEQUENCES

School policies and proceduresClinical learning environment

Struggling teams

Competing demands on team

Patient complexity

Different clinical sites

Health care inequities

Response to allegations of discrimination

Management of grade disputes

Normative vs criterion-based grading

Delegation of grading process to departments

Fewer honors grades and

awards

Fewer opportunities:

CareerSpecialty

Residency program

Disproportionate impact on UIM students

Impact on all students

Low social capital/belonging

Selection and training of faculty Extra weighting of

exam scores

Desire for race conscious professionalism

Rapidly changing team assignments

Differential patient assignments

Per

son

alIn

terp

erso

nal

Cu

ltu

ral

Str

uct

ura

l

Figure 1 Fishbone diagram illustrating the causes, effects, and consequences of lower assessed performance in underrepresented in medicine (UIM) students compared with all students.

Page 5: How Small Differences in Assessed Clinical Performance ...

Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.

Perspective

Academic Medicine, Vol. 93, No. 9 / September 20181290

in achievement of honors grades—particularly when range restriction exists in the use of clerkship assessment scales.50

Countermeasures to Differential Attainment

As we have discussed above, UIM students shoulder substantially more burdens than their not-UIM peers, and their increased cognitive and emotional loads of learning and performing are neither commonly addressed in the educational environment nor captured and considered in assessment and grading. Thus, the paradigm of medical

school as a colorblind, equal opportunity educational environment for learning and performing must be challenged. Truly equitable educational environments must take into account the differential experiences of UIM students in contrast to their not-UIM peers.

In an effort to begin to create an equitable educational environment, the leaders of the UCSF Bridges curriculum renewal51,52 (focused on preparing learners for practice in complex systems to improve health care and advance science) began in 2016 to work collaboratively with our institution’s Differences Matters

Initiative53 (focused on efforts to increase equity and inclusion throughout the institution). This collaboration has enabled educators to work with experts in health care disparities and critical race theory as well as to engage a broader community of UIM and not-UIM faculty and staff in identifying potential solutions to the problem and causes of differential attainment identified in our internal study (Figure 1). Achieving a more equitable and inclusive educational environment requires an integrated systems approach to change.54 We have approached change through redesign of the medical school curriculum, learning and assessment

Table 1Countermeasures Adopted, Starting in 2016, by the University of California, San Francisco, School of Medicine to Begin to Address Differential Attainment by Underrepresented in Medicine (UIM) Students in the Clinical Learning Environment

Element Countermeasure Rationale

Curriculum Incorporate a major curricular theme of social justice, beginning with three-day orientation for students with faculty-facilitated small-group discussions about diversity, inclusion, racism, and privilege in education, health, and society, and continuing throughout the curriculum

Creating a culture of inclusion requires all students, regardless of background, to develop competency in these issues. In addition, faculty must develop comfort in working with these issues.

Design and implement a process to review and revise patient panels, case examples, and standardized patient exercises to remove unintended stereotypes and to increase the number of representative examples from UIM groups

Students must see a variety of patients representing different aspects of diversity throughout the curriculum and should interact with them in examples of routine care, not merely when issues of diversity are being considered.

Incorporate teaching throughout the curriculum about the causes of, impacts of, and solutions to health care inequities

Students and faculty must understand how issues of individual bias and structural racism affect seeking, delivery, and outcomes of care for minority populations.

Assessment and grading

Mandate the use of grading committees in all core clerkships Effectively run grading committees can mitigate against unchecked bias by one individual assigning grades.

Restrict the amount of weight afforded to exam scores in grade assignment

Deemphasizing knowledge-based exam scores allows students to focus their attention on gaining the requisite clinical knowledge and skills to provide optimal patient care.

Shift from normative to criterion-based grading for core clerkships and subinternships

Criterion-based grading provides transparency about the level of performance required to achieve a grade and does not push students to compete against each other. Increasing the number of honors grades obtainable (i.e., removing the limit on the number of honors grades received) allows all those who meet the criterion to receive honors.

Faculty Implement small-group coaching program with faculty who have release time to engage with students and develop skills in inclusive teaching

Students develop a supportive mentoring and advising relationship with a faculty member who is aware of and oversees their learning and growth, acts as an advocate for them, and guides them through their medical school journey.

Intentionally recruit UIM faculty to the medical school and to educational leadership roles within the school

UIM students learn about and draw on UIM faculty who serve as role models.

Offer new and adapt all existing faculty development offerings to include content and skill-building activities focused on inclusive teaching (management of stereotype threat, mitigation of unconscious bias and microaggressions, awareness of race- conscious professionalism, addressing racist comments in the moment); require faculty development for all those with responsibility for assessment and grading

Faculty learn how issues of individual bias and structural racism may affect student learning and assessment outcomes and how to ensure that teaching and assessment are conducted in an equitable manner.

Program evaluation

Include race-, ethnicity-, and gender-based evaluations of assessment and grading in all core clerkship program evaluation processes, including the focus on cultural sensitivity and microaggressions

Educational continuous quality improvement requires identification and monitoring of all variables that relate to individual and population success and wellness within the medical education environment.

Track student satisfaction with education and career opportunities by race, ethnicity, gender, and other vectors of diversity

Page 6: How Small Differences in Assessed Clinical Performance ...

Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.

Perspective

Academic Medicine, Vol. 93, No. 9 / September 2018 1291

methods, and faculty development programs so that all constituents (i.e., faculty, leadership) are equally responsible for (1) understanding the issues of racism, bias, and privilege, and how they manifest in the educational and health care environment; (2) recognizing and addressing microaggressions when they occur in our environments; (3) mentoring effectively across racial, ethnic, and other group differences; and (4) redesigning health care delivery systems and educating individuals to address health care disparities. Table 1 summarizes the countermeasures or components of our strategic responses to the issues identified that were developed and started to be implemented in 2016. We are presently investigating the outcomes of these internal strategies as a part of the ECQI process and anticipate that our findings will inform long-term solutions.

Future Research

Given the importance of diversity to quality of care in the U.S. health care system, federal funding should support a research agenda that builds an evidence base for equity in health professions education, not only for UIM learners but also for learners from diverse backgrounds. Future research should examine and seek prevention or intervention strategies for issues such as racism, microaggressions, and stereotype threat triggers that create learning environments unfavorable to optimal performance; for unconscious bias and colorblind ideologies that lead to inaccurate assessments; and for normative ranking and selection strategies that risk amplifying small performance differences of uncertain significance into large consequences that affect careers. Similar to research into prevention of medical errors, research into systems designs that protect vulnerable students from fallible faculty will be critical.

Conclusion

Efforts to diversify the nation’s physician workforce cannot focus only on pipeline programs and holistic admissions processes. Ultimately, mitigating the impact of differential attainment by UIM students on selection for competitive residencies and faculty careers will require national conversations about inclusive learning

environments; equity in performance assessments at medical schools; and equitable, not-colorblind strategies in candidate evaluation and selection at residency programs. Further work is needed to ensure that learning environments are inclusive; instructional methods and assessment tools are equitable; and selection strategies for residency, fellowship, and faculty positions provide fair access to all candidates who are broadly qualified to succeed. A collective commitment by medical schools and teaching hospitals to seek out and address issues of inequity in learning, assessment, advancement, selection, and career opportunities is urgently needed. Successfully addressing these educational challenges will catalyze efforts to create equity in medical education and enable educators to deliver on our social obligation to improve the health of our communities.55

Acknowledgments: The authors wish to thank the University of California, San Francisco, School of Medicine students, staff, and faculty who participated in discussions about and provided feedback on the project and manuscript. They would like to thank Mark Lovett, Elaine Allen, and Judy Quan for their data collection and methodological expertise. They would like to thank Patricia O’Sullivan and Christy Boscardin for their thoughtful feedback on the manuscript.

Funding/Support: None reported.

Other disclosures: None reported.

Ethical approval: The University of California, San Francisco Committee for Human Research Institutional Review Board approved the research protocol as exempt, May 2016, IRB no. 16-19486.

Previous presentations: The results from the analysis were presented institution-wide at the University of California, San Francisco.

A. Teherani is professor, Department of Medicine, education researcher, Center for Faculty Educators, and director of program evaluation, University of California, San Francisco, School of Medicine, San Francisco, California; ORCID: http://orcid.org/0000-0003-2936-9832.

K.E. Hauer is professor, Department of Medicine, and associate dean for assessment, University of California, San Francisco, School of Medicine, San Francisco, California; ORCID: https://orcid.org/0000-0002-8812-4045.

A. Fernandez is professor, Department of Medicine, University of California, San Francisco, School of Medicine, San Francisco, California.

T.E. King Jr is professor, Department of Medicine, and dean, University of California, San Francisco, School of Medicine, San Francisco, California.

C. Lucey is professor of medicine and vice dean for education, University of California, San Francisco, School of Medicine, San Francisco, California.

References 1 Association of American Medical Colleges.

Diversity in the Physician Workforce: Facts and Figures 2014. Washington, DC: Association of American Medical Colleges; 2014. https://www.aamc.org/data/workforce/reports/439214/workforcediversity.html. Accessed May 9, 2018.

2 U.S. Department of Health and Human Services. HHS Action Plan to Reduce Racial and Ethnic Disparities: A Nation Free of Disparities in Health and Health Care. Washington, DC: U.S. Department of Health and Human Services; 2011. https://minorityhealth.hhs.gov/npa/files/Plans/HHS/HHS_Plan_complete.pdf. Accessed May 9, 2018.

3 United States Census Bureau. Facts for Features: Hispanic Heritage Month 2014. Release no. CB14-FF.22. https://www.census.gov/newsroom/facts-for-features/2014/cb14-ff22.html. Published September 2014. Accessed May 9, 2018.

4 Association of American Medical Colleges. Diversity in Medical Education: Facts and Figures 2012. Washington, DC: Association of American Medical Colleges; 2012. https://members.aamc.org/eweb/upload/Diversity in Medical Education Facts and Figures 2012.pdf. Accessed May 9, 2018.

5 Andriole DA, McDougle L, Bardo HR, Lipscomb WD, Metz AM, Jeffe DB. Postbaccalaureate premedical programs to promote physician-workforce diversity. J Best Pract Health Prof Divers. 2015;8:1036–1048.

6 Association of American Medical Colleges. Advancing holistic review initiative. https://www.aamc.org/download/358384/data/holisticreviewbrochure.pdf. Published 2014. Accessed May 9, 2018.

7 Urban Universities for Health. Holistic Admissions in the Health Professions. Washington, DC: Urban University of Health; 2014. http://urbanuniversitiesforhealth.org/media/documents/Holistic_Admissions_in_the_Health_Professions.pdf. Accessed November 13, 2016. [No longer available.]

8 Day CS, Lage DE, Ahn CS. Diversity based on race, ethnicity, and sex between academic orthopaedic surgery and other specialties. J Bone Jt Surg Am Vol. 2010;92(13):2328–2335.

9 Schwartz J, Young M. The evolution of racial, ethnic, and gender diversity in U.S. otolaryngology residency programs. Otolaryngology. 2013;149(1):71–76.

10 Xierali IM, Fair MA, Nivet MA. Faculty diversity in U.S. medical schools: Progress and gaps coexist. AAMC Analysis in Brief. 2016;16(6). https://www.aamc.org/download/474172/data/december2016facultydiversityinu.s.medicalschoolsprogressandgaps.pdf. Accessed May 14, 2018.

11 Association of American Medical Colleges. Underrepresented in medicine definition. https://www.aamc.org/initiatives/urm. Accessed May 14, 2018.

Page 7: How Small Differences in Assessed Clinical Performance ...

Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.

Perspective

Academic Medicine, Vol. 93, No. 9 / September 20181292

12 Green M, Jones P, Thomas JX Jr. Selection criteria for residency: Results of a national program directors survey. Acad Med. 2009;84:362–367.

13 McGaghie WC, Cohen ER, Wayne DB. Are United States Medical Licensing Exam Step 1 and 2 scores valid measures for postgraduate medical residency selection decisions? Acad Med. 2011;86:48–52.

14 Andriole DA, Jeffe DB, Hageman HL, et al. Variables associated with full-time faculty appointment among contemporary U.S. medical school graduates: Implications for academic medicine workforce diversity. Acad Med. 2010;85:1250–1257.

15 Miller C, Stassun K. A test that fails. Nature. 2014;510(7504):303–304.

16 Davis D, Dorsey JK, Franks RD, Sackett PR, Searcy CA, Zhao X. Do racial and ethnic group differences in performance on the MCAT exam reflect test bias? Acad Med. 2013;88:593–602.

17 Kleshinski J, Khuder SA, Shapiro JI, Gold JP. Impact of preadmission variables on USMLE Step 1 and Step 2 performance. Adv Health Sci Educ Theory Pract. 2009;14:69–78.

18 Quinn DM, Cooc N. Science achievement gaps by gender and race/ethnicity in elementary and middle school: Trends and predictors. Educ Res. 2015;44(6):336–346.

19 Boatright D, Ross D, O’Connor P, Moore E, Nunez-Smith M. Racial disparities in medical student membership in the Alpha Omega Alpha Honor Society. JAMA Intern Med. 2017;177:659–665.

20 U.S. Department of Veterans Affairs. VA National Center for Patient Safety: Root cause analysis (RCA). https://www.patientsafety.va.gov/professionals/onthejob/rca.asp. Accessed May 14, 2018.

21 Ishikawa K. Introduction to Quality Control. Tokyo, Japan: Asian Productivity Association Publications; 1991.

22 Association of American Medical Colleges. Medical Minority Applicant Registry (Med-MAR). https://students-residents.aamc.org/choosing-medical-career/article/medical-minority-applicant-registry-med-mar. Accessed May 14, 2018.

23 Grimm L, Yarnold P. Reading and Understanding Multivariate Statistics. Washington, DC: American Psychological Association Press; 2000.

24 Brancati FL, Mead LA, Levine DM, Martin D, Margolis S, Klag MJ. Early predictors of career achievement in academic medicine. JAMA. 1992;267:1372–1376.

25 Woolf K, McManus IC, Potts HW, Dacre J. The mediators of minority ethnic underperformance in final medical school examinations. Br J Educ Psychol. 2013;83(pt 1):135–159.

26 Brosnan C, Southgate E, Outram S, et al. Experiences of medical students who are first

in family to attend university. Med Educ. 2016;50:842–851.

27 Rubin M, Denson N, Kilpatrick S, Matthews KE, Stehlik T, Zyngier D. “I am working-class”: Subjective self-definition as a missing measure of social class and socioeconomic status in higher education research. Educ Res. 2014;43(4):196–200.

28 Woolf K, Potts HW, McManus IC. Ethnicity and academic performance in UK trained doctors and medical students: Systematic review and meta-analysis. BMJ. 2011;342:d901.

29 Lumb AB, Vail A. Comparison of academic, application form and social factors in predicting early performance on the medical course. Med Educ. 2004;38:1002–1005.

30 Regan de Bere S, Nunn S, Nasser M. Understanding differential attainment across medical training pathways: A rapid review of the literature. Final report prepared for the General Medical Council. https://www.gmc-uk.org/-/media/documents/GMC_Understanding_Differential_Attainment.pdf_63533431.pdf. Published 2015. Accessed May 14, 2018.

31 Steele CM, Aronson J. Stereotype threat and the intellectual test performance of African Americans. J Pers Soc Psychol. 1995;69: 797–811.

32 White AA 3rd, Stubblefield-Tave B. Some advice for physicians and other clinicians treating minorities, women, and other patients at risk of receiving health care disparities. J Racial Ethn Health Disparities. 2017;4:472–479.

33 Olayiwola JN. Racism in medicine: Shifting the power. Ann Fam Med. 2016;14:267–269.

34 Paul-Emile K, Smith AK, Lo B, Fernández A. Dealing with racist patients. N Engl J Med. 2016;374:708–711.

35 Powers BW, White AA, Oriol NE, Jain SH. Race-conscious professionalism and African American representation in academic medicine. Acad Med. 2016;91:913–915.

36 Mensah MO. Making all lives matter in medicine from the inside out. JAMA Intern Med. 2017;48(6):1267–1278.

37 Burgess DJ, Warren J, Phelan S, Dovidio J, van Ryn M. Stereotype threat and health disparities: What medical educators and future physicians need to know. J Gen Intern Med. 2010;25(suppl 2):S169–S177.

38 Rosenstein JE. Individual threat, group threat, and racial policy: Exploring the relationship between threat and racial attitudes. Soc Sci Res. 2008;37:1130–1146.

39 Slopen N, Williams DR. Discrimination, other psychosocial stressors, and self-reported sleep duration and difficulties. Sleep. 2014;37:147–156.

40 Lucey CR, Navarro R, King TE Jr. Lessons from an educational never event. JAMA Intern Med. 2017;177:1415–1416.

41 Schmader T, Johns M. Converging evidence that stereotype threat reduces working memory capacity. J Pers Soc Psychol. 2003;85:440–452.

42 Woolf K, Rich A, Viney R, Needleman S, Griffin A. Perceived causes of differential attainment in UK postgraduate medical training: A national qualitative study. BMJ Open. 2016;6:e013429.

43 Girod S, Fassiotto M, Grewal D, et al. Reducing implicit gender leadership bias in academic medicine with an educational intervention. Acad Med. 2016;91:1143–1150.

44 Williams DR, Rucker TD. Understanding and addressing racial disparities in health care. Health Care Financ Rev. 2000;21:75–90.

45 Montenegro R. Microaggressions during medical training—Reply. JAMA. 2016;316:1114.

46 Purdie-Vaughns V, Steele CM, Davies PG, Ditlmann R, Crosby JR. Social identity contingencies: How diversity cues signal threat or safety for African Americans in mainstream institutions. J Pers Soc Psychol. 2008;94:615–630.

47 Woolf K, Potts HW, Patel S, McManus IC. The hidden medical school: A longitudinal study of how social networks form, and how they relate to academic performance. Med Teach. 2012;34:577–586.

48 Vaughan S, Sanders T, Crossley N, O’Neill P, Wass V. Bridging the gap: The roles of social capital and ethnicity in medical student achievement. Med Educ. 2015;49:114–123.

49 Fernandez A, Wang F, Braveman M, Finkas LK, Hauer KE. Impact of student ethnicity and primary childhood language on communication skill assessment in a clinical performance examination. J Gen Intern Med. 2007;22:1155–1160.

50 Kuo LE, Hoffman RL, Morris JB, et al. A milestone-based evaluation system—The cure for grade inflation? J Surg Educ. 2015;72:e218–e225.

51 University of California, San Francisco, School of Medicine. Medial education: Bridges curriculum. http://meded.ucsf.edu/bridges. Accessed May 14, 2018.

52 Hauer KE, Boscardin C, Fulton TB, Lucey C, Oza S, Teherani A. Using a curricular vision to define entrustable professional activities for medical student assessment. J Gen Intern Med. 2015;30:1344–1348.

53 University of California, San Francisco, School of Medicine. Differences Matter initiative. http://differencesmatter.ucsf.edu. Accessed May 14, 2018.

54 Smith DG. Building institutional capacity for diversity and inclusion in academic medicine. Acad Med. 2012;87:1511–1515.

55 Frenk J, Chen L, Bhutta ZA, et al. Health professionals for a new century: Transforming education to strengthen health systems in an interdependent world. Lancet. 2010;376:1923–1958.

Page 8: How Small Differences in Assessed Clinical Performance ...

Differences inNarrative Language in Evaluations ofMedicalStudents by Gender and Under-represented Minority Status

Alexandra E. Rojek, AB1, Raman Khanna, MD, MAS2, Joanne W. L. Yim, PhD3,Rebekah Gardner, MD4, Sarah Lisker, BA1,5, Karen E. Hauer, MD, PhD1, Catherine Lucey, MD1, andUrmimala Sarkar, MD, MPH1,5

1University of California, San Francisco School of Medicine, San Francisco, CA, USA; 2Division of Hospital Medicine, University of California, SanFrancisco, School of Medicine, San Francisco, CA, USA; 3Health Informatics, UCSF Health, University of California, San Francisco, San Francisco, CA,USA; 4Warren Alpert Medical School of Brown University, Providence, RI, USA; 5UCSF Center for Vulnerable Populations, San Francisco, CA, USA.

BACKGROUND: In varied educational settings, narrativeevaluations have revealed systematic and deleterious dif-ferences in language describing women and those under-represented in their fields. Inmedicine, limited qualitativestudies show differences in narrative language by genderand under-represented minority (URM) status.OBJECTIVE: To identify and enumerate text descriptorsin a database of medical student evaluations using natu-ral language processing, and identify differences by gen-der and URM status in descriptions.DESIGN: An observational study of core clerkship evalua-tions of third-year medical students, including data on stu-dent gender, URM status, clerkship grade, and specialty.PARTICIPANTS: A total of 87,922 clerkship evaluationsfrom core clinical rotations at two medical schools in dif-ferent geographic areas.MAIN MEASURES: We employed natural language pro-cessing to identify differences in the text of evaluations forwomen compared to men and for URM compared to non-URM students.KEY RESULTS: We found that of the ten most commonwords, such as Benergetic^ and Bdependable,^ none dif-fered by gender or URM status. Of the 37 words thatdiffered by gender, 62% represented personal attributes,such as Blovely^ appearingmore frequently in evaluationsof women (p < 0.001), while 19% represented competency-related behaviors, such as Bscientific^ appearing morefrequently in evaluations of men (p < 0.001). Of the 53words that differed by URM status, 30% represented per-sonal attributes, such as Bpleasant^ appearing more fre-quently in evaluations of URM students (p < 0.001), and28% represented competency-related behaviors, such asBknowledgeable^ appearing more frequently in evalua-tions of non-URM students (p < 0.001).CONCLUSIONS: Many words and phrases reflected stu-dents’ personal attributes rather than competency-related behaviors, suggesting a gap in implementingcompetency-based evaluation of students.We observed a significant difference in narrative evalua-tions associated with gender and URM status, even

among students receiving the same grade. This findingraises concern for implicit bias in narrative evaluation,consistent with prior studies, and suggests opportunitiesfor improvement.

KEY WORDS: medical education; medical education—assessment/

evaluation; medical student and residency education.

J Gen Intern Med 34(5):684–91

DOI: 10.1007/s11606-019-04889-9

© Society of General Internal Medicine 2019

INTRODUCTION

Core clerkships are a key foundation of medical education forstudents, and the assessments that are associated with theseclerkships are informed by narrative evaluations completed bysupervising physicians during these clerkships. These evalua-tions form the basis of clerkship grades, with the narrativelanguage from evaluations being quoted in Medical StudentPerformance Evaluation (MSPE) letters and recommendationletters, that are a core component of residency applications.1

Inherently, however, narrative language is open to bias and theconsequences that can arise from it.2

The National Academies of Science, Engineering, andMedicine found that in academic settings, subjective evalua-tion criteria are often infiltrated with bias that disadvantageswomen.3 The Association of American Medical Collegesreported that recruitment, evaluation, and promotion processesinvolve implicit and unconscious bias, inhibiting the develop-ment of a diverse medical workforce.4 Research using manualand programmatic approaches to linguistic analyses, such asqualitative coding and automated text analysis, respectively,suggests that narrative evaluations can introduce gender-basedstereotypes, including the perception of women as emotionaland sensitive5–11 that can be detrimental to the advancement ofthe individual being evaluated.12 Furthermore, the conse-quences of subjective assessments may be even more damag-ing to racial and ethnic minorities that are underrepresented inthese fields.13 For example, underrepresented groups in med-icine may be even more Bothered,^ or differentiated in amanner that excludes, marginalizes, or subordinates.14–18 This

An earlier version of this work was presented in Denver, Colorado, at theSociety of General Internal Medicine’s annual meeting in April 2018.

Electronic supplementary material The online version of this article(https://doi.org/10.1007/s11606-019-04889-9) contains supplementarymaterial, which is available to authorized users.

Published online April 16, 2019

684

Page 9: How Small Differences in Assessed Clinical Performance ...

phenomenon may be due to an insufficiently diverse physicianworkforce,19 as well as subject to the reported tendency ofsupervisors to focus on social and cultural factors rather thancompetency-related factors.13

In 1999, the Accreditation Council for Graduate MedicalEducation (ACGME) and the American Board of MedicalSpecialties endorsed competency-based evaluations in an ef-fort to move towards assessment of specific competence do-mains based on behavior rather than personal attributes. TheACGME later introduced milestones as a standardized methodto operationalize the assessment of students’ progress towardsachieving these competencies.20 As a result, American medi-cal schools focus on competency-based assessment.We aim to characterize narrative language differences

among medical student evaluations using natural languageprocessing techniques. Our primary measure is to understandwhether students are described differently by gender as well asunder-represented minority (URM) status using metrics com-monly employed in natural language processing.

METHODS

Design

This study was approved by the University of California, SanFrancisco Institutional Review Board (15-18271) and deemedexempt by the Brown University Institutional Review Board.This is a secondary data analysis of narrative evaluations (text)from two medical schools. We applied natural language pro-cessing to elucidate differences by gender and URM status.

Data Sources

We included data from all third-year core clerkship evaluationsfrom two medical schools affiliated with public and privateacademic institutions in large urban settings, with associatedinformation about student demographics, clerkship specialty,

and grade. Data were collected from 2006 to 2015 at school 1(as identified in Table 1), and from 2011 to 2016 in school 2, toexclude years in which major grading practice changes wereimplemented. At both of these schools, grading choices ineach clerkship were three mutually exclusive choices: non-pass, pass, or honors, with no intermediate options. Onlycomplete cases containing student gender, URM status, clerk-ship specialty, and grade received were used in analyses, witha total of 87,922 evaluations meeting these criteria. Studentsself-identified their ethnicity, and the medical schools deter-mined which racial/ethnic categories were URM.We used thisinstitutional definition of URM status as Black or AfricanAmerican, Hispanic or Latino, and American Indian or AlaskaNative. All other self-identified ethnicities were categorized asnon-URM.Both schools included in this study fully incorporate

ACGME recommendations for Core Competencies for medi-cal student training,20 and these recommendations had beenimplemented before the study period. Additionally, grades forrequired core clerkships were determined by a combination ofclinical ratings with standardized exam scores, where theNational Board of Medical Examiners (NBME) examaccounted for no more than 25% of a grade. At each school,no more than 30% of students received honors in a clerkship.Sample evaluation forms from each institution are available inAppendix Figures 1–2 online. Faculty at each institution weresimilar in composition: in 2015, school 1 had 48% femalefaculty, while school 2 had 45%. At school 1, 8% of facultywere URM, 88% were non-URM, and 4% were unknown; atschool 2, 4% of faculty were URM, 84% were non-URM, and12% were unknown.Data were collected with unique identifiers for each narra-

tive evaluation, without linkages between multiple evaluationsfor a single student across clerkships and time. A breakdownof evaluation composition by grade, gender, URM status, andspecialty is shown in Table 1.

Table 1 Dataset Characteristics

Characteristic Evaluations, N = 87,922, (%) Evaluations, school 1 (%) Evaluations, school 2 (%)

Student genderMale 38,952 (44) 30,431 (43) 8521 (46)Female 48,970 (55) 39,074 (56) 9896 (53)

Student minority statusNon-URM 65,974 (75) 51,933 (74) 14,041 (76)URM 21,948 (25) 17,572 (25) 4376 (23)

Clerkship gradeHonors 28,883 (32) 21,905 (31) 6978 (37)Pass 58,748 (66) 47,332 (68) 11,416 (62)Non-pass 291 (0.3) 268 (0.4) 23 (0.1)

Clerkship specialtyInternal medicine 18,731 (21) 13,271 (19) 5460 (29)Family medicine 8560 (9) 7139 (10) 1421 (7)Surgery 11,049 (12) 8338 (12) 2711 (14)Pediatrics 17,929 (20) 13,686 (19) 4243 (23)Neurology 6366 (7) 5877 (8) 489 (2)Psychiatry 9041 (10) 7712 (11) 1329 (7)Ob/Gyn 9995 (11) 7231 (10) 2764 (15)Anesthesia 6251 (7) 6251 (9) 0 (0)

URM, under-represented minority; Ob/Gyn, obstetrics/gynecology

685Rojek et al.: Differences in Narrative Language in Student EvaluationsJGIM

Page 10: How Small Differences in Assessed Clinical Performance ...

De-identification

The narrative text of evaluations was de-identified in a two-step process. First, a database of names from publicly avail-able US Census and Social Security name records was com-piled,21–23 and the text of evaluations was matched againstthese names, with a second filter of parts-of-speech processingto identify proper nouns. All names identified in this processwere replaced with generic fillers. A subset of the narrativeevaluations was manually verified for complete de-identification.

Parsing

We used an open software trained English language parseravailable from Google for parsing, which uses SyntaxNet,24,25 an open-source neural network framework for TensorFlowmachine learning. This was applied to the narrative evalua-tions both to assist in de-identification as well as attribution ofparts-of-speech tagging and text parsing, which formed thebasis of the dataset used below in the primary analyses.

Analysis

First, we compared the distribution of grades, dichotomized tohonors versus pass, across gender and URM status, as well asclerkship specialty, using Pearson’s chi-squared tests afterapplying the Benjamini-Hochberg procedure for multiple test-ing correction. Second, we examined the length of evalua-tions, quantifying differences in distribution with theWilcoxon-Mann-Whitney test.Next, we generated a list of frequently used descriptors,

defining descriptors as adjectives. The ten most frequent termsdid not differ by gender or URM status when stratified bygrade (Appendix Tables 2a and 2b online). To accuratelycharacterize word frequency, we employed a widely usednatural language processing method known as termfrequency-inverse document frequency (TF-IDF),26 which isa measure of the frequency of a term, adjusted for how rarely it

is used. Here, we defined term-frequency as the frequency of agiven word in an evaluation, and inverse document frequencyas the inverse of the total frequency of the word’s usage acrossall evaluations.We then averaged the TF-IDF value for a givenword by gender and URM status. Examining TF-IDF bygender and URM status allowed us to infer the significanceof a word, and whether this word was used with similar weightand meaning across evaluations. For example, the wordBexcellent^ has a highly positive connotation. However, be-cause it appeared so frequently across all evaluations, it cor-responds to a lower TF-IDF score, thus one particular usage ofthe word Bexcellent^ does not confer much meaning. In con-trast, the word Benergetic^ appeared in fewer evaluationsoverall, so has a higher TF-IDF score, making each usage ofBenergetic^ carry more weight.

TF-IDF has been shown in other work,27 primarily in thefield of information retrieval, to be a superior method com-pared to absolute term frequency as it has the ability to weightthe frequency of terms in a manner that relates theirBimportance.^ As we suspected, the TF-IDF values of themost commonly used words determined by overall frequencywere low, suggesting that their wide usage reflects a range ofmeanings (Appendix Table 3 online). We ranked the descrip-tors that were used in more than 1% of evaluations (a commonthreshold in large text datasets) by TF-IDF score, by gender,and by URM status.

Finally, we reported which descriptors evaluators used dif-ferently between groups by gender and URM status, usingPearson’s chi-squared tests with Benjamini-Hochberg correc-tions. We surveyed this study’s co-authors, who representexperts in medical education as well as clinical faculty, aboutthe descriptors found to be used differently with statisticalsignificance by gender and URM status, asking whether thedescriptors were reflective of Bpersonal attributes^ versusBcompetency-related^ terms, or neither of the above. We thencategorized each word based on majority vote and present thiscategorization in Table 4.

Table 2 Grade Distribution by Gender, URM Status and Specialty

Clerkship Evaluations of women with honors grades (%) Evaluations of men with honors grades (%) p valueInternal medicine 3503 (33) 2790 (33) 0.75Family medicine 1581 (33) 1024 (26) < 0.001Surgery 1829 (30) 1627 (32) 0.01Pediatrics 3505 (35) 2182 (27) < 0.001Neurology 1227 (34) 872 (30) < 0.001Psychiatry 1714 (34) 1121 (27) < 0.001Ob/Gyn 2353 (42) 1457 (32) < 0.001Anesthesia 1164 (31) 934 (36) < 0.001

Evaluations of URM students with honors grades (%) Evaluations of non-URM students with honors grades (%) p valueInternal medicine 792 (17) 5501 (38) < 0.001Family medicine 471 (22) 2134 (33) < 0.001Surgery 414 (15) 3042 (36) < 0.001Pediatrics 788 (17) 4899 (36) < 0.001Neurology 243 (15) 1856 (38) < 0.001Psychiatry 348 (15) 2487 (36) < 0.001Ob/Gyn 657 (24) 3153 (43) < 0.001Anesthesia 401 (26) 1697 (35) < 0.001

686 Rojek et al.: Differences in Narrative Language in Student Evaluations JGIM

Page 11: How Small Differences in Assessed Clinical Performance ...

All analyses were performed with scripts written in Rversion 3.3.0 (2016-05-03). We considered two-sidedp < 0.05 to be significant, after correcting with theBenjamini-Hochberg procedure, a multiple testing correctionusing false discovery rate estimation.28, 29

RESULTS

Grade Distribution

Overall, 32% of evaluations among all students were associatedwith honors grades, with 66% of evaluations associated withpassing grades, and the remainder receiving non-pass grades(Table 1). Women received more honors than men and weremore likely to receive honors in pediatrics, obstetrics/gynecol-ogy, neurology, and psychiatry; menweremore likely to receivehonors in surgery and anesthesia (Table 2). A comparison of

non-URM and URM students showed that evaluations of URMstudents were associated with fewer honors grades than evalu-ations of non-URM students. When stratifying by clerkshipspecialty, URM students received fewer honors grades acrossall specialties. These distributions were comparable at eachschool included in our dataset (data not shown).

Evaluation Length

We looked at evaluation length by gender and URM status,stratified by grade (Appendix Table 1 online). We found thatthe distributions of evaluation length between different groupswere similar and, although statistically significant in someinstances, did not represent meaningful differences.

CommonDescriptors byGenderandURMStatus

Among descriptors that are used in more than 1% of evalua-tions, we examined the highest ranking words as measured byTF-IDF by gender and URM status (Table 3). Here, we foundthat the top ten ranked words were comparable across genderand URM status, suggesting that this measure does not pro-vide sufficient granularity of analysis to assess meaningfuldifferences in narrative evaluations.

Differential Usage of Descriptors by StatisticalSignificance

We found that among all evaluations, there were 37 words thatdiffered by usage betweenmen and women. Sixty-two percent(23/37) of these descriptors represented personal attributes,and of these, 57% (13/23) were used more in evaluations ofwomen. In these evaluations of women, we saw that personalattribute descriptors such as Bpleasant^ were associated withpass grades, while Benergetic,^ Bcheerful,^ and Blovely^ wereneutral in their grade association. Additionally, personal attri-bute descriptors such as Bwonderful^ and Bfabulous^ that wereused more frequently in evaluations of women were alsoassociated with honors grades. In evaluations of men, personalattribute descriptors such as Brespectful^ or Bconsiderate^were neutral in their association with grade, while Bgood^was seen more with pass grades, and Bhumble^was seen morewith honors grades.

Table 3 Important and Unique Descriptors, Among Commonly Used Words

Men (TF-IDF) Women (TF-IDF) Non-URM (TF-IDF) URM (TF-IDF)

Energetic (0.72) Friendly (0.64) Energetic (0.64) Friendly (0.76)Friendly (0.68) Energetic (0.62) Friendly (0.61) Energetic (0.71)Fine (0.55) Dependable (0.58) Fine (0.56) Dependable (0.56)Competent (0.53) Fine (0.56) Knowledgeable (0.53) Fine (0.53)Smart (0.53) Knowledgeable (0.53) Dependable (0.52) Competent (0.53)Knowledgeable (0.52) Personable (0.51) Competent (0.50) Personable (0.52)Technical (0.48) Technical (0.49) Smart (0.49) Technical (0.51)Dependable (0.46) Competent (0.48) Technical (0.48) Knowledgeable (0.50)Personable (0.45) Attentive (0.48) Personable (0.47) Smart (0.49)Attentive (0.44) Smart (0.46) Attentive (0.46) Attentive (0.47)

Among commonly used words (defined as appearing in > 1% of evaluations), importance was measured by term frequency-inverse document frequency,which is a metric of weighting term usage in an evaluation relative to usage in all evaluations; values closest to zero indicate that terms are used nearequally across all evaluations and are deemed less unique

Table 4 Categorization of descriptors by personal attribute vs.competency

a. Personal attribute descriptorsActive Enthusiastic PoisedAffable Fabulous PoliteAssertive Humble RelaxedBright Intelligent ReliableCaring Interesting RespectfulCheerful Lovely SharpClear Mature SocialConsiderate Modest SophisticatedDelightful Motivated TalentedEarnest Nice ThoughtfulEasy-going Open WarmEnergetic Pleasant Wonderful

b. Competency-related descriptorsAdvanced ImpressiveBasic IntegralClinical KnowledgeableCompassionate MedicalComplex RelevantComprehensive ScientificConscientious SmartEfficient SuperiorEmpathic ThoroughExcellent

The descriptors found in Figures 1 and 2 were categorized by a surveyof the co-authors into Bpersonal attribute descriptors^ versusBcompetency-related descriptors^ versus neither category. Those that amajority felt belonged in either of the first two groups are presentedabove, with all remaining descriptors deemed to belong in neither group

687Rojek et al.: Differences in Narrative Language in Student EvaluationsJGIM

Page 12: How Small Differences in Assessed Clinical Performance ...

Of the 37 descriptors we found that differed by gender, only19% (7/37) of these were words that we assigned ascompetency-related descriptors, and of these, 57% (4/7) wereused more in evaluations of women. The descriptorsBefficient,^ Bcomprehensive,^ and Bcompassionate^ wereused more often in evaluations of women and were alsoassociated with honors grades; evaluations of men describedas Brelevant^ were also associated with honors grades.These descriptors that were associated with significantly

different usage between men and women are shown inFigure 1 by their distribution along the x-axis, and the associ-ation any given word has with honors or pass grades isindicated by its distribution along the y-axis. In addition,words represented in Figure 1 (and Figure 2, described below)were found to be of high importance as measured by TF-IDF,with even higher values than the common words reported inTable 3 (data not shown).

Among all evaluations, there are 53 descriptors that differedby their usage between evaluations of URM and non-URMstudents. Thirty percent (16/53) of descriptors represented per-sonal attributes, and of these, 81% (13/16) were usedmore oftento describe non-URM students. The descriptors Bpleasant,^Bopen,^ and Bnice^ were used to describe URM students andwere associated with passing grades. Many personal attributedescriptors used to describe non-URM students, such asBenthusiastic,^ Bsharp,^ or Bbright,^ were neutral in their asso-ciation with grade, while Bmature^ and Bsophisticated^ weremore frequently associated with honors grades.

Of the 53 descriptors that differed byURM status, only 28%(15/53) of these were competency-related descriptors, and100% of these (15/15) were used more in evaluations ofnon-URM students. The competency-related descriptorsBoutstanding,^ Bimpressive,^ and Badvanced^ were more fre-quently associated with honors, while Bsuperior,^Bconscientious,^ and Bintegral^ were neutral in their associa-tion with grade. Of note, all of the descriptors (either personalattribute or competency-related) that were used more frequent-ly in evaluations of non-URM students had either neutralassociations with grade or were associated with honors grades.These descriptors that were associated with significantly dif-ferent usage between URM and non-URM students are shownin Figure 2 by their distribution along the x-axis, along withthe association any given word had with honors or pass gradesas indicated by its distribution along the y-axis.

DISCUSSION

This novel application of natural language processing to whatwe believe is the largest sample of medical student evaluationsanalyzed to-date reveals how students are described differentlyby gender and URM status. We found that across studentevaluations, common, important words were used with similarfrequency across gender and URM status. However, our anal-ysis revealed significant differences in the usage of particularwords between genders, as well as by URM status. While

Fig. 1 Descriptors with statistically significant differences in usage by gender All words were assessed for differential usage between groups ofinterest, with statistical significance defined as p < 0.05. Location of a word point on the men-women axis indicates its preferential use in eithergender. Distance from the y-axis also indicates increased difference from expected word distribution, noting however that all words shown arestatistically significant in their usage by gender. Placement along the pass-honors axis indicates association of a given word with usage in eithermore honors- or pass-graded evaluations. Red-highlighted words identify words that are used more in evaluations of women, while blue-

highlighted words identify words that are used more in evaluations of men. The categorization of these terms by Bpersonal attribute^ versusBcompetency-related^ descriptors can be found in Table 4.

688 Rojek et al.: Differences in Narrative Language in Student Evaluations JGIM

Page 13: How Small Differences in Assessed Clinical Performance ...

terms deemed important by the TF-IDF metric were reflectiveof personal attributes and competence, and were comparableamong genders and URM status, the words with statisticallysignificant differences in usage between these groups indicateinclusion of personal attributes more so than competencies, asdefined in Table 4. Although there were both competency-related and personal attribute descriptors that are used differ-entially between gender and URM statuses, there is a domi-nance of personal attribute descriptors in the words we foundto be used differently between these groups that we believe isimportant in signaling how student performance is assessed.Our study is consistent with previous work examining

differences in grading between genders and URM statusgroups in limited settings. Lee et al. showed that URM stu-dents receive lower grades across clerkships,30 and other workhas shown conflicting effects of student demographics onclerkship grades, including student age.31 Whether other ob-jective measures of academic performance, such as prior stan-dardized test performance or undergraduate GPAs, contributeto clerkship evaluation grades has also been debated.30, 32

However, previous work has been limited in scope, both withrespect to clerkship specialties and small sample sizes. Thebreadth of our data allows for identification of infrequentinstances of differential descriptors that are concerning whenconsidered in the context of the entire population of medicalstudents.In prior studies of narrative evaluations, investigators ex-

amined differential usage of a pre-determined set of words.Women have been shown to be more likely than men to beassociated with words like Bcompassion,^ Benthusiasm,^ andBsensitivity,^while other studies have shown that the presenceof Bstandout^ words, such as Bexceptional^ or Boutstanding,^predicted evaluations of men but not of women.8 Additionalresearch suggests that similar patterns extend beyond therealm of student evaluations.5, 33 A strength of natural lan-guage processing is that we did not have to pre-specify wordsthat might differ; instead, we were able to extract any differingwords without the introduction of additional analytic bias.Despite the intent of clerkship assessment to address com-

petencies by observing behaviors, the differences we found

Fig. 2 Descriptors with statistically significant differences in usage by URM status All words were assessed for differential usage between groupsof interest, with statistical significance defined as p < 0.05. Location of a word point on the Non-URM-URM axis indicates its preferential use byURM status. Distance from the y-axis also indicates increased difference from expected word distribution, noting however that all words shownare statistically significant in their usage by URM status. Placement along the pass-honors axis indicates association of a given word with usagein either more honors- or pass-graded evaluations. Green-highlighted words identify words that are used more in evaluations of URM students,while purple-highlighted words identify words that are used more in evaluations of non-URM students. The categorization of these terms by

Bpersonal attribute^ versus Bcompetency-related^ descriptors can be found in Table 4.

689Rojek et al.: Differences in Narrative Language in Student EvaluationsJGIM

Page 14: How Small Differences in Assessed Clinical Performance ...

among URM and non-URM assessments were more reflectiveof perceived personal attributes and traits. In prior work, Rosset al. found that Black residency applicants were more likely tobe described as Bcompetent^, whereas White applicants morefrequently received standout and ability descriptors, likeBexceptional^.34 Examining our findings of the variation indescriptors used for URM and non-URM students in thecontext of the literature is pertinent considering the discrimi-nation and disparities faced by racial and ethnic minorities astrainees and healthcare providers. Disparities in clerkshipgrades,30 membership in honors societies,35 and promotion36,37 are well-documented. Research on performance assessmentssuggests that small differences in assessment can result inlarger differences in grades and awards received—a phenom-enon referred to as the Bamplification cascade.^ Teherani et al.illustrate the presence of this phenomenon among URM andnon-URM students in undergraduate medical education.38 Theamplification cascade holds major implications for residencyselection and ultimately career progression that can dispropor-tionately affect students from underrepresented groups.Our study has limitations. First, although our sample size is

large, we analyzed evaluations from two medical schools.Second, baseline measures of academic performance or sub-sequent markers of career success were unavailable, whichlimited our ability to extrapolate the effect of these differencesin text beyond the grade received in the clerkship. Third, dueto data limitations, we were unable to link evaluations ofindividual students across clerkships to assess patterns ingrading behaviors and biases, although all students rotatingin core clerkships in the study years were included in thedataset. Fourth, we were unable to assess any interactionbetween evaluator demographics and narrative language dif-ferences, as seen in other studies.8 Fifth, we did not haveaccess to Dean’s Letters, also known as the MSPE, whichare a compilation of comments from individual clerkships andare the direct link between student evaluations and residencyapplications. Finally, we did not examine if narrative differ-ences were more pronounced amongmembers of both groups,such as women who also identify as URM.Despite efforts to standardize medical student evaluations,

the differences in narrative language suggest directions forimprovement of medical education assessment. At a mini-mum, our findings raise questions about the inclusion ofverbatim commentary from these assessments in MSPE lettersused in residency applications, as is the accepted nationalstandard.1 Similarly, our work demonstrates that thecompetency-based evaluation framework37 ostensibly in usefor evaluating medical students remains incompletely imple-mented. Finally, behavioral science research has uncoveredbest practices for reducing bias in evaluations, including com-parative evaluations,39 structured criteria,40, 41 task demonstra-tions, blinded evaluations,42 and evaluator training featuringde-biasing techniques.43 In the future, it may be possible forlanguage processing tools to provide real-time and data-drivenfeedback to evaluators to address unconscious bias. Perhaps it

is time to rethink narrative clerkship evaluations to better serveall students.

Acknowledgements: The authors would like to thank Roy Cherian,Cassidy Clarity, Gato Gourley, Bonnie Hellevig, Mark Lovett, KateRadcliffe, and Alvin Rajkomar.

Corresponding Author: Urmimala Sarkar, MD, MPH; Health Infor-matics, UCSF Health, University of California, San Francisco, SanFrancisco, CA, USA (e-mail: [email protected]).

Funding Information Dr. Sarkar is supported by the NationalCancer Institute (K24CA212294).

Compliance with Ethical Standards:

Conflict of Interest: The authors declare that they do not have aconflict of interest.

Publisher’s Note: Springer Nature remains neutral with regard tojurisdictional claims in published maps and institutional affiliations.

REFERENCES1. Association of American Medical Colleges. Recommendations for Revising

the Medical Student Performance Evaluation (MSPE). May 2017.https://www.aamc.org/download/470400/data/mspe-recommenda-tions.pdf. Accessed December 11, 2018.

2. Biernat M, Tocci MJ, Williams JC. The Language of PerformanceEvaluations: Gender-Based Shifts in Content and Consistency of Judg-ment. Social Psychological and Personality Science. 2012;3(2):186–192.https://doi.org/10.1177/1948550611415693

3. Manning CD, Raghavan P, Schütze H. Introduction to InformationRetrieval. Cambridge Unviersity Press; 2009.

4. Corrice A. Unconscious Bias in Faculty and Leadership Recruitment: ALiterature Review. Association of American Medical Colleges. 2009;9(2).

5. Trix F, Psenka C. Exploring the Color of Glass: Letters of Recommenda-tion for Female and Male Medical Faculty. Discourse & Society.2003;14(2):191–220. https://doi.org/10.1177/0957926503014002277

6. Axelson RD, Solow CM, Ferguson KJ, Cohen MB. Assessing ImplicitGender Bias in Medical Student Performance Evaluations. Eval HealthProf. 2010;33(3):365–385. https://doi.org/10.1177/0163278710375097

7. Galvin SL, Parlier AB, Martino E, Scott KR, Buys E. Gender Bias inNurse Evaluations of Residents in Obstetrics and Gynecology. ObstetGynecol. 2015;126 Suppl 4:7S–12S. https://doi.org/10.1097/AOG.0000000000001044

8. Isaac C, Chertoff J, Lee B, CarnesM. Do students’ and authors’ gendersaffect evaluations? A linguistic analysis of Medical Student PerformanceEvaluations. Acad Med. 2011;86(1):59–66. https://doi.org/10.1097/ACM.0b013e318200561d

9. Schmader T, Whitehead J, Wysocki VH. A Linguistic Comparison ofLetters of Recommendation for Male and Female Chemistry andBiochemistry Job Applicants. Sex Roles. 2007;57(7–8):509–514.https://doi.org/10.1007/s11199-007-9291-4

10. Magua W, Zhu X, Bhattacharya A, et al. Are Female ApplicantsDisadvantaged in National Institutes of Health Peer Review? CombiningAlgorithmic Text Mining and Qualitative Methods to Detect EvaluativeDifferences in R01 Reviewers’ Critiques. Journal of Women’s Health.26(5):560-570. https://doi.org/10.1089/jwh.2016.6021

11. Kaatz A, Magua W, Zimmerman DR, Carnes M. A quantitative linguisticanalysis of National Institutes of Health R01 application critiques frominvestigators at one institution. Acad Med. 2015;90(1):69–75. https://doi.org/10.1097/ACM.0000000000000442

12. Moss-Racusin CA, Dovidio JF, Brescoll VL, Graham MJ, HandelsmanJ. Science faculty’s subtle gender biases favor male students. PNAS.2012;109(41):16474–16479. https://doi.org/10.1073/pnas.1211286109

13. Wilson KY. An analysis of bias in supervisor narrative comments inperformance appraisal. Human Relations 63(12):1903-1933. https://doi.org/10.1177/0018726710369396

690 Rojek et al.: Differences in Narrative Language in Student Evaluations JGIM

Page 15: How Small Differences in Assessed Clinical Performance ...

14. P. Bourdieu. Distinction: A Social Critique of the Judgement of Taste.Cambridge, MA: Harvard University Press; 1984.

15. G. C. Spivak. The Rani of Sirmur: An Essay in Reading the Archives.History and Theory. 1985;24(3):247–272.

16. S. De Beauvoir. The Second Sex. New York: Knopf; 1952.17. L. Weis. Identity formation and the process of Bothering^: Unraveling

sexual threads. Educational Foundations. 1995;9(1):17–33.18. W. I. U. Ahmad. Making Black people sick: ‘Race’, ideology and health

research. In: ‘Race’ and Health in Contemporary Britain. Philadelphia:Open University Press; 1993:12–33.

19. Association of American Medical Colleges. Diversity in the Physician Work-force: Facts&Figures 2014. 2014. http://aamcdiversityfactsandfigures.org/.Accessed December 11, 2018.

20. Holmboe E, Edgar L,Hamstra S. The Milestone Guidebook. AccreditationCouncil for Graduate Medical Education; 2016.

21. US Census Bureau. Frequently Occurring Surnames from the 1990Census. https://www.census.gov/topics/population/genealogy/data/1990_census.html. Published September 15, 2014. Accessed December11, 2018.

22. US Census Bureau. (2000) Frequently Occurring Surnames from theCensus. https://www.census.gov/topics/population/genealogy/data/2000_surnames.html. Published September 15, 2014. Accessed Decem-ber 11, 2018.

23. Social Security Administration. Beyond the Top 1000 Names. PopularBaby Names. https://www.ssa.gov/OACT/babynames/limits.html. Pub-lished 2017. Accessed December 11, 2018.

24. Presta, A., Severyn, A., Golding, A., et al. SyntaxNet: Neural Models ofSyntax. tensorflow; 2018. https://github.com/tensorflow/models.Accessed December 11, 2018.

25. Andor D, Alberti C, Weiss D, et al. Globally Normalized Transition-BasedNeural Networks.; 2016. http://arxiv.org/abs/1603.06042. AccessedDecember 11, 2018.

26. Vijila, S. F., Nirmala, K. Dr. Quantification of Portrayal Concepts using tf-idf Weighting. IJIST. 2013;3(5). https://doi.org/10.5121/ijist.2013.3501

27. Dertat, A. (2011) How to Implement a Search Engine Part 3: Ranking tf-idf. http://www.ardendertat.com/2011/07/17/how-to-implement-a-search-engine-part-3-ranking-tf-idf/. Accessed December 11, 2018.

28. Storey JD, Tibshirani R. Statistical significance for genomewide studies.PNAS. 2003;100(16):9440–9445. https://doi.org/10.1073/pnas.1530509100

29. Noble WS. How does multiple testing correction work? Nat Biotechnol.2009;27(12):1135–1137. https://doi.org/10.1038/nbt1209-1135

30. Lee KB, Vaishnavi SN, Lau SKM, Andriole DA, Jeffe DB. BMaking thegrade:^ noncognitive predictors of medical students’ clinical clerkshipgrades. J Natl Med Assoc. 2007;99(10):1138–1150.

31. Colbert C, McNeal T, Lezama M, et al. Factors associated withperformance in an internal medicine clerkship. Proc (Bayl Univ MedCent). 2017;30(1):38–40.

32. Ogunyemi D, De Taylor-Harris S. NBME obstetrics and gynecologyclerkship final examination scores: predictive value of standardized testsand demographic factors. J Reprod Med. 2004;49(14):978–982.

33. Mueller AS, Jenkins TM, Osborne M, Dayal A, O’Connor DM, AroraVM. Gender Differences in Attending Physicians’ Feedback to Residents:A Qualitative Analysis. Journal of Graduate Medical Education.2017;9(5):577–585. https://doi.org/10.4300/JGME-D-17-00126.1

34. Ross DA, Boatright D, Nunez-Smith M, Jordan A, Chekroud A, MooreEZ. Differences in words used to describe racial and gender groups inMedica l Student Per formance Eva luat ions. PLOS ONE .2017;12(8):e0181659. https://doi.org/10.1371/journal.pone.0181659

35. Boatright D, Ross D, O’Connor P, Moore E, Nunez-Smith M. RacialDisparities in Medical Student Membership in the Alpha Omega AlphaHonor Society. JAMA Intern Med. 2017;177(5):659–665. https://doi.org/10.1001/jamainternmed.2016.9623

36. Jena AB, Khullar D, Ho O, Olenski AR, Blumenthal DM. SexDifferences in Academic Rank in US Medical Schools in 2014. JAMA.2015;314(11):1149–1158. https://doi.org/10.1001/jama.2015.10680

37. Nunez-Smith M, Ciarleglio MM, Sandoval-Schaefer T, et al. Institu-tional Variation in the Promotion of Racial/Ethnic Minority Faculty at USMedical Schools. Am J Public Health. 2012;102(5):852–858. https://doi.org/10.2105/AJPH.2011.300552

38. ATeherani, K E Hauer, A Fernandez, T E King Jr, C Lucey. How SmallDifferences in Assessed Clinical Performance Amplify to Large Differencesin Grades and Awards: A Cascade With Serious Consequences forStudents Underrepresented in Medicine. Acad Med. 2018;93(9):1286–1292.

39. Bohnet, I., van Geen, A., Bazerman, M. When Performance TrumpsGender Bias: Joint vs. Separate Evaluation. Management Science.2015;621225–1234. https://doi.org/10.1287/mnsc.2015.2186

40. Reskin BF, McBrier DB. Why Not Ascription? Organizations’ Employ-ment of Male and Female Managers. American Sociological Review.2000;65(2):210–233. https://doi.org/10.2307/2657438

41. Gawande A. The Checklist Manifesto: How to Get Things Right. New York:Picador; 2011.

42. Goldin C, Rouse C. Orchestrating Impartiality: The Impact of BBlind^Auditions on Female Musicians. American Economic Review.2000;90(4):715–741. https://doi.org/10.1257/aer.90.4.715

43. Babcock L, Loewenstein G. Explaining bargaining impasse: the role ofself-serving biases. Journal of Economic Perspectives. 1997;11(1):109–126. https://doi.org/10.1257/jep.11.1.109

691Rojek et al.: Differences in Narrative Language in Student EvaluationsJGIM

Page 16: How Small Differences in Assessed Clinical Performance ...

Let’s Face It: We Are Biased, and It Should Not Be That WayPaul A. Hemmer, MD, MPH1 and Reena Karani, MD, MHPE2

1Department of Medicine, Uniformed Services University of the Health Sciences, Bethesda, MD, USA; 2Departments of Medical Education,Medicine, and Geriatrics and Palliative Medicine, Institute for Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY, USA.

J Gen Intern Med 34(5):649–51

DOI: 10.1007/s11606-019-04923-w

© Society ofGeneral InternalMedicine (This is aU.S. governmentworkand

not under copyright protection in the U.S.; foreign copyright protection

may apply) 2019

L et’s face it—we’re all biased. For example, in studies ofletters of evaluation and recommendation in academia,

men are more often described with words considered Bstand-out^1 (e.g., excellent, superb, exceptional), and/or related toBability^2 (e.g., talent, intelligent, capacity, analytical), andwith an emphasis on Bresearch^,3 while women are more oftendescribed using words considered Bgrindstone^1, 2 (e.g., hard-working, conscientious, diligent), and with an emphasis onBteaching^;3 other studies have found that women are morecommonly described with Bcommunal^ words (e.g., compas-sionate, caring, empathetic).3

In our educational system, descriptive evaluations oftrainees have been a hallmark of assessment in the clinicalsetting. Gingerich et al.4 explored differing perspectives on therole of assessors (e.g., faculty) (as trainable, fallible [biased],and meaningfully idiosyncratic). In the discussion of fallibil-ity, the authors note that humans are not (and cannot be)passive observers, but activate cognitive processes to helpremember and make sense of their observations, althoughthese cognitive processes are the sources of biases.Two articles in this issue of JGIM address the issue of bias

(gender and race) in the assessment of medical students andresidents. Rojek and colleagues5 conducted an observationalstudy of over 87,000 core clerkship evaluations of medicalstudents at two geographically different medical schools. Theyemployed natural language processing to identify differencesin the text of evaluations of women compared to men and ofunderrepresented in medicine (URIM) compared with non-URIM students. Their analysis demonstrated significant dif-ferences in narrative language associated with gender andURIM status, even among students receiving the same grade.Female students and URIM students were significantly morelikely to be described using personal attributes as opposed tocompetency-based behaviors. Specifically, for words that dif-fered between men and women, 62% represented personalattributes and were found more commonly for women, while19% represented competency-related behaviors and were

more commonly applied to men. For words that differedbetween URIM and non-URIM students, comments aboutpersonal attributes were more common for URIM students,while competency-based comments were more frequent inevaluations of non-URIM students. The study also shed lighton the ten most commonly used words that did not differbetween the groups, providing insight that these words maynot convey meaningful information about a student. The find-ings highlight the presence of bias within the descriptive eval-uation of medical students, raise questions about the validity ofthese assessments in the evaluation process, including theMedical Student Performance Evaluation (MSPE), and serveas a call-to-action for concerted efforts to address and work toeliminate bias in all forms across the learning environment.Klein et al.6 performed a systematic review of the literature

from 1998 to 2018 to examine the presence and influence ofgender bias on resident assessment. Five of the nine studiesuncovered evidence of gender bias, with women receivinglower ratings in domains of performance and differences inqualitative comments. In the largest study,7 these differencesin ratings were equivalent to adding several months of trainingfor women to Bcatch up^ to their male counterparts. In thestudy8 of the qualitative comments made by faculty in theseemergency medicine (EM) programs, the Bideal^ trainee hadmore stereotypically masculine traits, there was discordance inthe feedback provided to struggling trainees (men receivedmore consistent feedback while women received more discor-dant feedback, especially with regard to autonomy and asser-tiveness), and comments about a tension between autonomyand openness to feedback were only found in narratives ofwomen. In other included studies, qualitative comments wom-en received from faculty more often focused on communal orwarmth-based descriptors and less often on agentic (related toagency) or competency-related descriptors. The authors of thisreview concluded that gender bias poses a potential threat tothe integrity of resident assessment and there is a need to seekand understand the source(s) of the bias (e.g., in the assess-ment, the learner, the faculty), as well as the longer termconsequences.Since conducting their review, two articles have addressed

gender bias in GME. In an observational study of one aca-demic internal medicine training program,9 women scoredhigher in two domains (medical interviewing, and interper-sonal and communication skills), although the absolute differ-ences were small (effect sizes of approximately 0.12–0.15). Asecond qualitative study10 of the narrative comments ofPublished online April 16, 2019

649

Page 17: How Small Differences in Assessed Clinical Performance ...

resident trainees across nine surgical specialties at one institu-tion found that men were consistently described by morepositive comments than women. The authors noted that intheir study, B…women were often described as having apotential to succeed while men were simply expected tosucceed.^10 In addition, some phrases were only written aboutwomen, notably in the area of disposition and humanism (e.g.,Balways smiling^). Finally, while there were no differences inthe relative use of communal, grindstone, and ability words,men were more likely to receive Bstandout^ words, with theword Bleader^ used three times more for men than women.The differences in the use of standout words, feedback towomen about assertiveness, and speaking with confidenceare also experienced by medical students.11

Similarly, there are differences in how different racialgroups among medical trainees are described. In a study ofMSPEs12 for 6000 applicants from 134 medical schoolsapplying for 16 residency programs at one institution, Whitestudents were more likely to be described using Bstandout^or Bability^ words while Black students were more likely todescribed as Bcompetent.^ In another single-institution studyrepresenting 4655 medical students from 123 US medicalschools, Black students were less likely to be members ofthe Alpha Omega Alpha Medical Honor Society than Whitestudents, even after controlling for other measures ofachievement.13 In a qualitative study of 27 underrepresented(Black, Hispanic, Native American) residents in GME train-ing,14 themes emerged around microaggressions and bias,being tasked to be ambassadors, and challenges of personaland professional identity formation. The practice of routinelyciting race (about which we typically don’t ask patients) inintroductory statements15 (BThis is a 65 year old African-American woman….^) reflects the difficulty of shiftingviews of race from a long-held and perpetuated biologicperspective to that of a social construct.16 With such deeplyengrained specious beliefs, it is not surprising, but veryconcerning, to see these influences in the descriptions oftrainees.While it would be naïve of us to believe we will eliminate

all bias in assessment programs, we must commit ourselves toaction. We must address the gender and racial bias evident inthe descriptive evaluations faculty provide about trainees,since fairness and equity in the process of assessment (fairnessto trainees, faculty, and society) are essential for the integrityof our learning environment. What does this mean for oureducational, including assessment, processes, programs, andleaders? The answer: Let’s face it. And, we have to change it.Morgan et al.17 outlined a series of steps in addressing

gender-based inequity, which are equally applicable to otherforms of bias. The papers in this issue help with addressing thefirst step, which is to be explicit and name the systematic natureof the bias. In leading and participating in educational pro-grams, understanding and highlighting the words people use intheir assessments are part of making this more explicit, as isempowering trainees to come forward when they read these

words about themselves. Implementing educational interven-tions that focus on dialogue among participants is anothermeans of making the underlying bias explicit.18 Other stepsthe authors suggest include advancing efforts to promote equi-ty, incorporating explicit evidence-based training, and increas-ing transparency on issues such as academic promotion.We would not recommend eliminating or marginalizing

descriptive evaluation of trainees as that would not addressthe fundamental underlying issue of bias (e.g., gender, race).We should be guiding faculty on what to say in their com-ments. We can illustrate the words that are helpful to explaintheir observations (including the 10words shown in the Rojek6

study that might be of limited help). But we should not misleadourselves and think that a new form or a new online trainingprogram will solve the issue. Forms help communicate goalsbut dialogue and conversation lead to understanding. If deci-sions are to be made by someone other than the author aboutwhat comments to include or not include in a final narrative(course, clerkship, rotation, MSPE), we need to ask: Who willmake the decision about what to keep, delete, or change? Howwill the faculty member be provided feedback? How do weknow the person editing is not influenced by their ownbias(es)? Do we have a responsibility to the readers of thenarrative to inform them if changes have been made? Shouldpersonal attributes be eliminated from comments or could theybe desirable to help more fully describe an individual? Do werisk swinging the pendulum too far, yielding descriptions thatrisk becoming too formulaic and impersonal?Collectively, we have to help one another to say what

matters. Our best chance of doing this is to find and take thetime to talk with one another. Ultimately, it is in our dailydialogue and through naming the problem that we will miti-gate bias and not perpetuate it.

Corresponding Author: Paul A. Hemmer, MD, MPH; Department ofMedicine Uniformed Services University of the Health Sciences,Bethesda, MD, USA (e-mail: [email protected]).

Compliance with Ethical Standards:

Conflict of Interest: The authors do not have any conflicts of interestto declare.

Disclaimer: The opinions and assertions expressed herein are thoseof the authors and do not necessarily reflect the official policy or posi-tion of the Uniformed Services University or the Department of Defense.

REFERENCES1. Trix F, Psenka C. Exploring the Color of Glass: Letters of Recommenda-

tion for Female and Male Medical Faculty. Discourse & Society.2003;14(2):191–220.

2. Schmader T, Whitehead J, Wysocki VH. A Linguistic Comparison ofLetters of Recommendation for Male and Female Chemistry andBiochemistry Job Applicants. Sex Roles. 2007;57(7–8):509–514

3. Axelson RD, Solow CM, Ferguson KJ, Cohen MB. Assessing implicitgender bias in medical student performance evaluations. Evaluation &the health professions. 2010;33(3):365–85.

650 Hemmer and Karani: Let’s Face It: We Are Biased, and It Should Not Be That Way JGIM

Page 18: How Small Differences in Assessed Clinical Performance ...

4. Gingerich A, Kogan J, Yeates P, Govaerts M, Holmboe E. Seeing the‘black box’ differently: assessor cognition from three research perspec-tives. Med Educ. 2014;48(11):1055–68. https://doi.org/10.1111/medu.12546.

5. Rojek AE, Khanna R, Yim JWL, Gardner R, Lisker S, Hauer KE, LuceyC, Sarkar U. Differences in Narrative Language in Evaluations of MedicalStudents by Gender and Under-represented Minority Status. J GenIntern Med https://doi.org/10.1007/s11606-019-04889-9

6. Klein R, Julian KA, Snyder ED, Koch J, Ufere NN, Volerman A,Vandenberg AE, Schaeffer S, Palamara K. Gender Bias in ResidentAssessment in Graduate Medical Education: Review of the Literature. JGen Intern Med 2017 10(4):411-415. https://doi.org/10.1007/s11606-019-04884-0

7. Dayal A, O’Connor DM, Qadri U, Arora VM. Comparison of Male vsFemale Resident Milestone Evaluations by Faculty During EmergencyMedicine Residency Training. JAMA Intern Med. 2017;177(5):651–657.

8. Mueller AS, Jenkins TM, Osborne M, Dayal A, O’Connor DM, AroraVM. Gender Differences in Attending Physicians’ Feedback to Residents:A Qualitative Analysis. J Grad Med Educ. 2017;9(5):577–585.

9. Sulistio MS, Khera A, Squiers K, Sanghavi M, Ayers CR, Weng W, KaziS, de Lemos J, Johnson DH, Kirk L. Effects of gender in residentevaluations and certifying examination pass rates. BMC Med Educ.2019;19(1):10. https://doi.org/10.1186/s12909-018-1440-7.

10. Gerull KM, Loe M, Seiler K, McAllister J, Salles A. Assessing genderbias in qualitative evaluations of surgical residents. Am J Surg.2019;217(2):306–313.

11. Treadway N. A Student Reflection on Doctoring With Confidence: Mindthe Gap. Ann Intern Med. 2018;169(8):564–565. https://doi.org/10.7326/M18-0865

12. Ross DA, Boatright D, Nunez-Smith M, Jordan A, Chekroud A, MooreEZ. Differences in words used to describe racial and gender groups in

Med ica l Student Per formance Eva luat ions . PLoS One.2017;12(8):e0181659. https://doi.org/10.1371/journal.pone.0181659.eCollection 2017

13. Boatright D, Ross D, O’Connor P, Moore E, Nunez-Smith M. RacialDisparities in Medical Student Membership in the Alpha Omega AlphaHonor Society. JAMA Intern Med. 2017;177(5):659–665. https://doi.org/10.1001/jamainternmed.2016.9623.

14. Osseo-Asare A, Balasuriya L, Huot SJ, Keene D, Berg D, Nunez-SmithM, Genao I, Latimore D, Boatright D. Minority Resident Physicians’Views on the Role of Race/Ethnicity in Their Training Experiences in theWorkplace. JAMA Netw Open. 2018;1(5):e182723. https://doi.org/10.1001/jamanetworkopen.2018.2723

15. Finucane TE . Mention of a Patient ’s BRace^ in ClinicalPresentations.Virtual Mentor. 2014;16(6):423–427. https://doi.org/10.1001/virtualmentor.2014.16.6.ecas1-1406.

16. Jones CP. Levels of racism: a theoretic framework and a gardener’s tale.Am J Public Health. 2000;90(8):1212–5

17. Morgan AU, Chaiyachati KH, Weissman GE, Liao JM. EliminatingGender-Based Bias in Academic Medicine: More Than Naming theBElephant in the Room^. J Gen Intern Med. 2018;33(6):966–968.https://doi.org/10.1007/s11606-018-4411-0.

18. Bakke K, Sidhar K, Kumagai AK. Exploring Matters of Race throughDialogue in the University of Michigan Medical School’s LongitudinalCase Studies Program. Virtual Mentor. 2014;16(6):442–449. https://doi.org/10.1001/virtualmentor.2014.16.6.medu1-1406.

Publisher’s Note: Springer Nature remains neutral with regard tojurisdictional claims in published maps and institutional affiliations.

651Hemmer and Karani: Let’s Face It: We Are Biased, and It Should Not Be That WayJGIM