Comparison of a traditional meta-analysis versus an...
Transcript of Comparison of a traditional meta-analysis versus an...
1
Comparison of a traditional meta-analysis versus an individual patient data (IPD) meta-
analysis to assess the diagnostic accuracy of the Patient Health Questionnaire-9 (PHQ-9)
Brooke Levis
Department of Epidemiology, Biostatistics and Occupational Health
McGill University, Montreal
Submitted November 2014
A thesis submitted to McGill University in partial fulfillment of the requirements of the degree
of: Master of Science – Epidemiology (Thesis)
Brooke Levis 2014
2
Table of Contents
Title page 1
Table of contents 2
Abstract 5
Résumé 7
Acknowledgements 10
1. Introduction 12
2. Literature review 15
2.1. Depression 15
2.2. Screening 15
2.2.1. What is screening? 15
2.2.2. When is screening appropriate? 15
2.3. Depression screening 16
2.3.1. What is depression screening? 16
2.3.2. Controversy about screening 16
2.4. Diagnostic Accuracy 18
2.4.1. Diagnostic accuracy of depression screening tools 18
2.4.2. Limitations in existing evidence 19
2.4.2.1. Spectrum bias problem 19
2.4.2.2. Small sample sizes, data-driven cutoffs and selective reporting 19
2.5. Traditional meta-analyses 21
2.6. Individual patient data (IPD) meta-analysis 22
2.6.1. General approach 22
3
2.6.2. Advantages compared to traditional meta-analyses 23
3. Methods 25
3.1 Measure 25
3.1.1. Patient Health Questionnaire-9 (PHQ-9) 25
3.2. Data collection 25
3.2.1. Source 25
3.2.2. Inclusion criteria 26
3.2.3. Author contact 26
3.2.4. Ethical approval 26
3.2.5. Transfer of data 26
3.3. Data preparation 27
3.3.1 Extraction 27
3.3.2. Cleaning 27
3.4. Statistical analyses 27
4. Results 29
4.1. Datasets 29
4.2. Comparison of traditional versus individual patient data (IPD) meta-
analysis
29
4.3. Tables and figures 31
4.3.1. Table 1. Characteristics of included studies 31
4.3.2. Table 2. Comparison of meta-analysis results across the two meta-
analytic methods
32
4.3.3. Table 3. Discrepancies in sensitivity and specificity across cutoffs 33
4
4.4.4. Figure 1. ROC curves produced by IPD meta-analysis using all data
vs. traditional meta-analysis using published data only
34
5. Discussion 35
5.1. Main Finding 35
5.2. Patterns in selective outcome reporting 36
5.3. Clinical significance and implications 37
5.4. Implications for research 38
5.5. Limitations 39
6. Summary and final conclusion 42
6.1. Summary 42
6.2. Conclusion 43
7. References 44
8. Appendix 56
8.1 Patient Health Questionnaire-9 (PHQ-9) 57
5
Abstract
Background: Depression accounts for more years lived with disability than any other medical
condition. Major depressive disorder (MDD) may be present in 10-20% of patients in medical
settings. Effective interventions to reduce the burden of depression exist, but most patients with
depression do not receive adequate care. Screening for depression has been recommended to
improve access to depression care. However, studies that have examined the diagnostic accuracy
of depression screening tools typically have used data-driven, exploratory methods to select
optimal cutoffs. Most often, these studies report results from a small range of cutoff points
around whichever cutoff score is most accurate in that given study. When data from these
published studies are combined in meta-analyses, estimates of accuracy for different cutoff
points are often based on data from different studies, rather than having data from all studies for
each possible cutoff point. As a result, traditional meta-analyses may generate exaggerated
estimates of accuracy (i.e., sensitivity and specificity). Individual patient data (IPD) meta-
analyses can address this problem by synthesizing data from all studies for each cutoff.
Objective: To assess the degree to which selective reporting of results from well-performing
cutoff thresholds may bias accuracy estimates in meta-analyses of depression screening tools. To
do this, I examined results from studies of the Patient Health Questionnaire-9 (PHQ-9), a
frequently used depression screening tool, comparing results from a traditional meta-analysis of
published accuracy data to results from an IPD meta-analysis using original patient data from the
same studies.
6
Methods: Authors of studies included in a recently published meta-analysis on the PHQ-9 were
invited to contribute patient-level data. For each dataset, we extracted the PHQ-scores and MDD
diagnoses for each patient. Two sets of statistical analyses were performed: (1) a traditional
meta-analysis where, for each cutoff between 7 and 15, we included data from the studies that
reported accuracy results for the cutoff in the original publication; and (2) an IPD meta-analysis
where, for each cutoff between 7 and 15, we included data from all studies.
Results: We obtained data from 13 of 16 eligible datasets that were included in the original
meta-analysis. Of the 13 studies, 11 (83% of patients) published accuracy results for the
recommended cutoff score of 10 in the original report; accuracy results using traditional meta-
analysis (sensitivity = 0.85, specificity = 0.88) were similar to those using IPD meta-analysis
(sensitivity = 0.87, specificity = 0.88). For other cutoffs, the number of studies that published
accuracy results for the particular cutoff ranged from 3 to 6 (21-46% of all patients) and results
using the two different meta-analytic methods were more discrepant. Cutoffs below the standard
cutoff of 10 tended to underestimate sensitivity in the traditional meta-analysis, whereas cutoffs
above 10 tended to overestimate sensitivity. For instance, for a cutoff of 9, sensitivity was 0.78 in
traditional meta-analysis versus 0.89 using IPD, whereas for a cutoff of 11, sensitivity was 0.92
in traditional meta-analysis versus 0.83 using IPD. For all cutoffs, specificity was similar across
the two meta-analytic methods.
Conclusions: Traditional meta-analyses may exaggerate the diagnostic accuracy of depression
screening tools, especially for cutoffs that are not standard or recommended. IPD meta-analysis
provides a mechanism to obtain unbiased estimates of accuracy.
7
Résumé
Contexte: La dépression provoque plus d’années vécues avec incapacité que toute autre
condition médicale. Un trouble dépressif majeur peut être présent dans 10-20% des patients en
milieu médical. Des interventions efficaces pour réduire le fardeau de la dépression existent,
mais la majorité des patients souffrant de dépression ne reçoivent pas de soins adéquats. Le
dépistage de la dépression a été recommandé pour améliorer l'accès aux soins de la dépression.
Cependant, les études qui ont examiné la précision du diagnostic des outils de dépistage ont
généralement utilisé des méthodes exploratoires axées sur les données pour sélectionner les
seuils optimaux. Souvent, ces études présentent les résultats pour un petit nombre de seuils
autour de celui qui est le plus précis dans l'étude donnée. Lorsque les données de ces études
publiées sont combinées dans les méta-analyses, les estimations de précision pour différents
seuils sont souvent basées sur des données provenant de différentes études, plutôt que d'utiliser
des données de toutes les études pour chaque seuil possible. En conséquence, les méta-analyses
traditionnelles peuvent générer des estimations de précision exagérées. Des méta-analyses sur
données individuelles peuvent améliorer ce problème en synthétisant les données de toutes les
études pour chaque seuil.
Objectif: Évaluer à quel point la publication sélective des résultats de seuils performants peut
biaiser les estimations de la précision du diagnostic dans les méta-analyses des outils de
dépistage de la dépression. Pour ce faire, j’ai examiné les résultats des études du questionnaire
sur la santé du patient-9 (PHQ-9), un outil de dépistage de la dépression fréquemment utilisé, en
comparant les résultats d’une méta-analyse traditionnelle des données de précision publiées aux
8
résultats d’une méta-analyse sur données individuelles en utilisant les données originales des
patients des mêmes études.
Méthodes: Les auteurs des études incluses dans une méta-analyse récemment publiée sur le
PHQ-9 ont été invités à fournir des données au niveau du patient. Pour chaque ensemble de
données, nous avons extrait les scores du PHQ et le diagnostic de trouble dépressif majeur pour
chaque patient. Deux séries d'analyses statistiques ont été réalisées: (1) une méta-analyse
traditionnelle où, pour chaque seuil entre 7 et 15, nous avons inclus les données des études qui
ont publié des résultats de précision pour ce seuil dans la publication originale; et (2) une méta-
analyse sur données individuelles où, pour chaque seuil entre 7 et 15, nous avons inclus les
données de toutes les études.
Résultats: Nous avons obtenu les données de 13 des 16 ensembles de données admissibles de la
méta-analyse originale. Pour le seuil recommandé de 10, 11 des 13 études (83% des patients) ont
publié les résultats de précision du diagnostic dans le rapport initial; les résultats de la précision
utilisant une méta-analyse traditionnelle (sensibilité = 0,85, spécificité = 0,88) étaient semblables
à celles utilisant une méta-analyse sur données individuelles (sensibilité = 0,87, spécificité =
0,88). Pour les autres seuils, le nombre d'études ayant publié des résultats de précision pour le
seuil particulier variait de 3 à 6 (21-46% des patients) et les résultats en utilisant les deux
méthodes de méta-analyse étaient plus discordants. Les seuils inferieures du seuil recommandé
de 10 avaient tendance à sous-estimer la sensibilité dans la méta-analyse traditionnelle tandis que
les seuils supérieurs de 10 avaient tendance à surestimer la sensibilité. Pour tous les seuils, la
spécificité était similaire dans les deux méthodes de méta-analyse.
9
Conclusions: Les méta-analyses traditionnelles peuvent exagérer la précision du diagnostic des
outils de dépistage de la dépression, en particulier pour les seuils qui ne sont pas standard ou
recommandé. Les méta-analyses sur données individuelles fournissent un mécanisme pour
obtenir des estimations de précision non biaisées.
10
Acknowledgements
I would like to acknowledge the tremendous help, support, and guidance that my
supervisor, Dr. Brett Thombs, has provided me with over the past 5 years, and especially during
my Masters training the past 2 years. Amongst so many other things, you have made me an
extremely critical thinker and have taught me the importance of transparency. You are a fabulous
mentor who shows great interest in all of your trainees. I truly appreciate all of the time and
effort you have put into helping me become an independent epidemiologist, and I am thankful
for all of the opportunities you have provided me with to help me develop my career.
I would also like to acknowledge my co-supervisor, Dr. Andrea Benedetti. Thank you for
all of your help in developing my analysis plan and interpreting the results, and for all of your
support and feedback in preparing my manuscript and thesis for submission.
I would like to express my gratitude to McGill University and to the Department of
Epidemiology, Biostatistics and Occupational Health for providing me with a first-class
education. Thank you as well to the staff at the Student Affairs Office for answering all my
questions and for assuring I met all deadlines.
I would also like to thank the external reviewer for their time and effort in evaluating my
thesis. I appreciate the time you have taken away from your schedule to review my thesis and
share your expertise.
Thank you as well to all of the investigators who provided me with their patient-level
data from around the globe. This study would not have been possible without the rich data that
you supplied.
11
I would like to give many thanks to my family and friends for their support, as well as all
my colleagues as the Behavioural Health Research Group. A special thank you to my brother and
colleague Alex; you were extremely helpful in R programming and debugging.
Last, but definitely not least, my most heartfelt thanks go out to my friend, colleague and
roommate, Dr. Linda Kwakkenbos, for her help and support both in and out of the lab. I am
eternally grateful for all of the time you took to answer my many questions, and to proofread just
about anything important, as well as for all the coffee, lunches, and late night work sessions. I
could not have done this without you.
12
1. Introduction
Depression accounts for more years lived with disability than any other medical condition
(Lopez et al., 2006; Mathers et al., 2006; Moussavi et al., 2007; Whiteford et al., 2013). In
medical settings, major depressive disorder (MDD) is present in 5-10% of primary care patients,
including 10-20% of patients with chronic diseases (Evans et al., 2005). There are effective
interventions available to reduce the burden of depression, but most patients with depression do
not receive adequate care (Duhoux et al., 2009; Duhoux et al., 2013). Routine depression
screening has been recommended to improve access to depression care, but is controversial
(Thombs et al., 2012).
Depression screening refers to the use of depression screening tools to identify patients
who may have depression, but who are not seeking treatment for symptoms and whose
depression is not otherwise recognized by their physicians, so that they can be further assessed
and, if appropriate, treated (Raffle & Gray, 2007; UK National Screening Committee, 2000).
For depression screening to be effective, screening tools must be able to accurately distinguish
between patients with and without MDD. There is concern, however, that the diagnostic
accuracy of commonly-used depression screening tools is poorly understood and that existing
evidence on depression screening may overstate what would occur in actual clinical practice
(Thombs et al., 2011; Thombs et al., 2012).
One concern is that selective reporting of results from cutoff thresholds that perform well
in a given study, but not from cutoff thresholds that perform poorly, may be common and may
further inflate estimates of the diagnostic accuracy of depression screening tools. Studies that
have examined the diagnostic accuracy of depression screening tools typically use data-driven,
exploratory methods to select optimal cutoffs, and tend to report results only from a range of
13
cutoff points around whatever cutoff score worked best. When data from these published studies
are combined in traditional meta-analyses, which depend on sample-level results available in
published studies or unpublished reports, estimates of accuracy for different cutoff points are
often based on data from different studies, rather than having data from all studies for each
possible cutoff point. The problem with this is illustrated in a recent meta-analysis of the
diagnostic accuracy of the Patient Health Questionnaire-9 (PHQ-9), (Manea et al., 2012) an
instrument commonly used to screen for MDD in medical settings. In that meta-analysis, for
each possible cutoff score on the PHQ-9, the meta-analysis synthesized sample-level data from
all studies that had published data on sensitivity and specificity for that cutoff score. The
limitations of this method of meta-analyzing are highlighted by the finding that estimates of
sensitivity actually tended to improve as cutoff scores increased, which is a mathematically
impossible result if complete data were used, since as the cutoff score increases, fewer patients
are classified as cases, thus a lower proportion of true cases score above the threshold.
Individual patient data (IPD) meta-analyses (Riley et al., 2010) which synthesize actual
line-by-line patient data from primary studies, rather than only published summary results, is one
way to address the problem of selective reporting of results from only well-performing cutoffs.
This is because in IPD meta-analyses, data can be synthesized for all possible cutoff thresholds
for all included studies. When implemented effectively, IPD meta-analyses are considered by the
Cochrane Collaboration to be the ‘gold standard’ of evidence synthesis (Stewart et al., 2011).
No studies have systematically evaluated how and to what degree the selective reporting
of data from some thresholds, but not others, may bias accuracy estimates of depression
screening tools. Thus, the objective of this study was to systematically assess the manner and
degree to which selective reporting of results from well-performing cutoff thresholds may bias
14
accuracy estimates in meta-analyses of depressing screening tools. To do this, we compared
results produced by conducting a traditional meta-analysis of published accuracy data to results
produced by conducting an IPD meta-analysis using original patient data from the same studies,
using data from the recent meta-analysis on the diagnostic accuracy of the PHQ-9 (Manea et al.,
2012).
15
2. Literature Review
2.1. Depression
Depression accounts for more years lived with disability than any other medical condition
(Lopez et al., 2006; Mathers et al., 2006; Moussavi et al., 2007; Whiteford et al., 2013). In
medical settings, major depressive disorder (MDD) is present in 5-10% of primary care patients,
including 10-20% of patients with chronic diseases (Evans et al., 2005). There are effective
interventions available to reduce the burden of depression, but most patients with depression do
not receive adequate care (Duhoux et al., 2009; Duhoux et al., 2013). Screening for depression
has been recommended to address depression in medical settings.
2.2. Screening
2.2.1. What is screening?
Screening is “the presumptive identification of unrecognised disease or defect by the
application of tests, examinations, or other procedures which can be applied rapidly. Screening
tests sort out apparently well persons who probably have a disease from those who probably do
not. A screening test is not intended to be diagnostic. Persons with positive or suspicious
findings must be referred to their physicians for diagnosis and necessary treatment”
(Commission on Chronic Illness, 1957). Essentially, the purpose of screening is to detect
previously unrecognized cases so that they can be referred for diagnoses and treatment.
2.2.2. When is screening appropriate?
According to the World Health Organization (WHO), screening is appropriate when there
is an important health problem that is prevalent in the population and whose presence would not
likely be detected without screening. The screening tools used must perform well, and there must
16
be effective interventions available to benefit those with the condition. Finally, there needs to be
evidence from randomized controlled trials demonstrating that the benefits of screening
outweigh the harms (Wilson & Jungner, 1968).
2.3. Depression screening
2.3.1. What is depression screening?
Depression screening refers to the use of depression screening tools to identify patients
who may have depression, but who are not seeking treatment for symptoms and whose
depression is not otherwise recognized by their physicians, so that they can be further assessed
and, if appropriate, treated (Raffle & Gray, 2007; UK National Screening Committee, 2000).
Depression screening involves systematically assessing for depression among all individuals in a
given risk group using a standardized test. It is not probing specifically for depression on an
individual level or using depression symptom questionnaires to monitor treatment and relapse.
For depression screening to be effective, screening tools must be able to accurately distinguish
between patients with and without MDD. Accurate assessment of the performance of depression
screening tools, however, requires a large number of patients with and without MDD in order to
accurately estimate sensitivity (percent of patients with MDD correctly identified as likely
depressed) and specificity (percent of patients without MDD correctly identified as likely not
depressed).
2.3.2. Controversy about screening
There is no universally accepted recommendation for depression screening in medical
settings; different countries and groups have made varying and contradicting recommendations
over the years. In 2002, the first major recommendation came out. The United States Preventive
17
Services Task Force (USPSTF) made a recommendation for routine depression screening in
primary care settings where there is enhanced collaborative care available, but not in the absence
of such programs (Pignone et al., 2002). In 2009, the USPSTF updated their evidence review
and, based on evidence from 9 RCTs, recommended screening adults for depression in primary
care settings when staff-assisted depression management programs are available (Grade B
recommendation) (U.S. Preventive Services Task Force, 2009). In 2010, the UK National
Institute for Health and Care Excellence (NICE) put out a guideline on depression management
in primary care stating that there is no evidence that depression screening benefits patients.
Rather than routine screening, they recommended that primary care physicians be alert to
possible depression during encounters with patients (National Collaborating Center for Mental
Health, 2010). In 2013, the Canadian Task Force on Preventive Health Care (CTFPHC) similarly
recommended that physicians be alert, but not routinely screen, for depression in primary care
(Joffres et al., 2013). This recommendation actually reversed a 2005 recommendation to screen
for depression in the context of integrated staff-assisted depression management systems
(MacMillan et al., 2005). A major concern raised by the CTFPHC was that published reports of
the diagnostic accuracy of depression screening tools appear to overstate diagnostic accuracy
compared to what would occur in clinical practice.
In addition to the varying recommendations, a recent systematic review attempted to
evaluate whether there is evidence from randomized controlled trials (RCTs) that depression
screening benefits patients in primary care, using an explicit definition of screening (Thombs et
al., 2014). To be included in the review, a trial needed to compare depression outcomes between
screened and non-screened patients and meet the following 3 criteria: (1) randomize patients
prior to screening, (2) exclude patients that have already been diagnosed or who are already
18
being treated, and (3) offer the same services for depression treatment for patients identified as
depressed from the screener as well as patients identified without using the screener. Based on
these criteria, no screening trials were identified.
2.4. Diagnostic Accuracy
2.4.1. Diagnostic accuracy of depression screening tools
In studies on the diagnostic accuracy of depression screening tools, patient scores from
self-report depression symptom questionnaires are compared to diagnostic status (MDD versus
no MDD) based on a validated diagnostic interview that was designed to reflect Diagnostic and
Statistical Manual of Mental Disorders (DSM) or International Classification of Disease (ICD)
criteria for MDD. To be effective, depression screening tools must be able to accurately
distinguish between patients with and without MDD. Accurate assessment of the performance of
depression screening tools requires a large number of patients with and without MDD in order to
accurately estimate sensitivity and specificity. Sensitivity is the probability that patients with
MDD will be correctly identified as likely depressed by the screener, while specificity is the
probability that patients without MDD will be correctly identified as likely not depressed by the
screener (Altman & Bland, 1994). Sensitivity and specificity are generally regarded as intrinsic
characteristics of a test and are independent of disease prevalence (Brenner & Gefeller, 1997; Li
& Fine, 2011).
Studies of diagnostic accuracy typically seek to identify an optimal cutoff score using
receiver operating characteristic (ROC) curve analysis, by which sensitivity and specificity
associated with all possible cutoff scores are calculated and plotted (Hanley & McNeil, 1982;
19
Swets, 1988). As the required cutoff score for a positive screen increases, sensitivity decreases
(fewer cases reach the necessary threshold), but specificity increases.
2.4.2. Limitations in existing evidence
There is evidence suggesting that the diagnostic accuracy rates that are currently being
reported are exaggerated due to problems such as spectrum bias and the selective reporting of
data driven cutoffs.
2.4.2.1. Spectrum bias problem
Screening tools should detect non-recognized cases. Most existing studies of screening
accuracy, however, include large numbers of patients already being treated for depression. Those
patients were recognized as depressed without screening and would not be screened in clinical
practice, since screening is done to identify undetected cases. A recent review found that only
4% of 197 studies on the diagnostic accuracy of depression screening tools appropriately
excluded patients who were already diagnosed or already undergoing depression treatment
(Thombs et al., 2011). Inclusion of these patients in diagnostic accuracy studies would be
expected to inflate estimates of the number of new cases that would be identified through
screening and estimates of accuracy, since already recognized and treated patients are typically
more easily identified than those who would only be detected via screening.
2.4.2.2. Small sample sizes, data-driven cutoffs and selective reporting
Most existing primary studies are limited by small sample sizes and do not generate
precise estimates. As a result, different studies often generate substantially different ‘optimal’
cutoff scores for the same screening tool. For instance, one review examined 9 small studies that
used the Hospital Anxiety and Depression Scale (HADS) depression screener to detect
20
depression (Meijer et al., 2011). These studies identified optimal cutoff scores ranging from 5-
11, which is a range that is far too wide to be useful in clinical practice.
Selective reporting of results from cutoff thresholds that perform well in a given study,
but not from cutoff thresholds that perform poorly, may be common and may inflate estimates of
the diagnostic accuracy of depression screening tools. Most researchers report accuracy results
for the cutoffs that performed best in the particular studies, and do not report the results for the
cutoffs that did not work well. Because of this, synthesizing results across possible cutoff scores
in traditional meta-analyses, which combine summary results from primary studies, can generate
biased estimates of accuracy. For instance, a recent meta-analysis on the diagnostic accuracy of
the Patient Health Questionnaire-9 (PHQ-9), which is one of the most commonly used
depression screeners, included 18 diagnostic accuracy studies, and, for each possible cutoff, the
authors analyzed all the data available from the publications (Manea et al., 2012). In this paper,
a mathematically impossible result was obtained if all data from the included publications were
considered: as the cutoff increased, sensitivity appeared to increase, which is impossible since
fewer people are being classified as cases. The problem was that because some studies only
reported a small range of cutoffs around the best cutoff, the authors were only able to meta-
analyze a portion of the 18 studies for each cutoff.
Studies in areas outside of mental health have demonstrated that using data-driven cutoff
thresholds to estimate diagnostic accuracy generates overly optimistic estimates, particularly
when sample sizes are small (Ewald, 2006; Goodacre et al., 2005; Leeflang et al., 2008;
Thompson et al., 2006; Whiting et al., 2013). Generally, using the same sample to select an
optimal cutoff score and simultaneously estimate the diagnostic accuracy would be expected to
produce estimates that would not be replicated in actual practice (Rutjes et al., 2006; Whiting et
21
al., 2004). This suggests a high risk of bias when this is done in studies of depression screening
tools, especially since most existing primary studies have been limited by small sample sizes,
specifically by small numbers of cases of MDD.
2.5. Traditional meta-analyses
Traditional meta-analyses have been used to assess the diagnostic accuracy of depression
screening tools. Meta-analyses can overcome problems associated with small sample sizes in
primary studies by combining data across studies. This can be done without bias, however, only
if all relevant outcome data are reported in primary studies (e.g., accuracy data for all relevant
cutoff scores).
There are only a few examples of meta-analyses of depression screening tool accuracy
(Brennan et al., 2010; Gilbody et al., 2007; Hewitt et al., 2009; Manea et al., 2012; Meader et al.,
2014; Mitchell & Coyne, 2007; Mitchell, 2008; Mitchell et al., 2010; Mitchell et al., 2012;
Vodermaier & Millman, 2011; Wancata et al., 2006; Wittkampf et al., 2007). Existing meta-
analyses have handled the issue of selective threshold reporting in 3 different ways: (1) In some
meta-analyses, the authors have explicitly indicated that they have only included one cutoff from
each study. They have used either the cutoff the primary study authors indicated was optimal or a
cutoff that they deemed was most accurate using some alternative method when the primary
study authors did not clearly indicate an optimal cutoff. For instance, in a systematic review by
Wancata et al. (2006), when the primary study authors did not clearly recommend a particular
cutoff, the meta-analysis authors used the cutoff where sensitivity and specificity were closest
together. (2) In other meta-analyses, the authors have stated that they synthesized standard
cutoffs, but have substituted “optimal cutoffs” when standards were not reported in the original
22
studies. This substitution method may be indicated explicitly, or may not have been stated but
was reported to us by the authors of the meta-analysis. For instance, in a meta-analysis by
Meader et al. (2011), the authors attempted to synthesize accuracy data for the standard cutoff
score of 8, when available. However, they were only able to include results from this cutoff for
12 of 27 studies (44% of patients) included in the meta-analysis. Results for the standard cutoff
were not reported in the other included studies, presumably because the standard cutoff had not
performed well. For those studies, the authors used accuracy data from other, better performing
cutoffs instead and combined results quantitatively, even though the studies did not use the same
cutoff threshold. (3) In other meta-analyses, the authors have analyzed outcomes at each cutoff
separately, and for each cutoff, they included studies that reported results for the particular study.
For instance, in the meta-analysis by Manea et al. (2012), the authors included 18 primary
studies in their analyses, and for each cutoff of the Patient Health Questionnaire-9 (PHQ-9), they
included the subset of studies that had reported results for the respective cutoff.
2.6. Individual patient data (IPD) meta-analysis
Individual patient data (IPD) meta-analyses (Riley et al., 2010) which synthesize actual
line-by-line patient data from primary studies, rather than only published summary results, is one
way to address the problem of selective reporting of results from only well-performing cutoffs.
This is because in IPD meta-analyses, data can be synthesized for all possible cutoff thresholds
for all included studies.
2.6.1. General approach
The general approach of an IPD meta-analysis does not differ from a traditional meta-
analysis in terms of defining a research question, establishing study inclusion and exclusion
23
criteria, identifying and screening studies, and analyzing data (Stewart et al., 2011). IPD meta-
analyses are resource intensive in that they require a substantial amount of time to identify and
obtain original data, clarify data-related issues with data providers, and generate a consistent data
format across studies (Ahmed et al., 2012; Riley et al., 2010; Stewart et al., 2011). The quality of
IPD meta-analyses depends on the ability to obtain primary data (Riley et al., 2010).
2.6.2. Advantages compared to traditional meta-analyses
Individual patient data (IPD) meta-analysis (Riley et al., 2010) has the potential to
address shortcomings in existing depression screening research. IPD meta-analyses synthesize
original patient data obtained from researchers responsible for the primary studies, thus allowing
for the analysis of data from all cutoffs for all studies. IPD meta-analyses have particular benefits
when there are limitations in published information or where subgroup analyses are needed, but
cannot be performed from study-level data available in original reports; this is often the case
with depression screening accuracy studies. Because of these advantages, IPD meta-analyses,
when they can be implemented effectively, are considered by the Cochrane Collaboration to be
the ‘gold standard’ of evidence synthesis (Stewart et al., 2011).
In the context of evaluating the diagnostic accuracy of depression screening tools,
assessing all relevant cutoff scores for all studies can eliminate biases in traditional meta-
analyses because either large numbers of datasets are excluded, presumably due to poor accuracy
at standard cutoffs, or because some datasets are used to estimate diagnostic accuracy at some
cutoff scores, whereas other datasets are used at other cutoff scores. In addition, since virtually
all primary studies collect data on current depression treatment (e.g., antidepressant use), IPD
meta-analyses can appropriately exclude already-treated patients. Furthermore, IPD meta-
analyses with large numbers of patients and large numbers of MDD cases can potentially
24
incorporate individual risk factors for depression (e.g., age, sex, medical comorbidity) and study
variables (e.g., study setting, risk of bias factors) that may influence accuracy and clinical
decision-making, but have not been included in traditional meta-analyses, as has been noted by
the CTFPHC (Joffres et al., 2013).
25
3. Methods
3.1. Measure
3.1.1. Patient Health Questionnaire-9 (PHQ-9)
The PHQ-9 (Kroenke & Spitzer, 2002) (see Appendix) is a 9-item measure of depressive
symptoms that is commonly used in medical populations (Gilbody et al., 2007; Wittkampf et al.,
2007). The maximum total score is 27, and higher scores represent increased severity of
depressive symptoms. The standard cutoff score to identify possible depression is 10 (Gilbody et
al., 2007; Kroenke et al., 2001; Wittkampf et al., 2007).
3.2. Data collection
3.2.1. Source
This study was conducted using the set of studies included in the recent Manea et al.
meta-analysis of the PHQ-9 (Manea et al., 2012). We attempted to obtain the original patient
data from all studies included in the meta-analysis in order to re-analyze the data to compare
results produced by traditional meta-analysis versus results of an IPD meta-analysis.
In the meta-analysis by Manea et al., the authors searched electronic databases (Embase,
MEDLINE and PsycInfo) from 1999 to August 2010 for studies that evaluated and reported the
diagnostic accuracy of the PHQ-9 for diagnosing major depressive disorder. The authors
included studies that (1) defined MDD according to standard classification systems (e.g.,
International Classification of Diseases [ICD] or the Diagnostic and Statistical Manual of Mental
Disorders [DSM]), (2) used a standardized diagnostic interview (e.g., Mini International
Neuropsychiatric Interview [MINI], Structured Clinical Interview for DSM Disorders [SCID],
Composite International Diagnostic Interview [CIDI], Diagnostic Interview Schedule [DIS] or
26
Revised Clinical Interview Schedule [CIS-R]), and (3) provided sufficient data to calculate
contingency tables.
3.2.2. Inclusion criteria
We included studies that were comprised of unique patient samples and that published
diagnostic accuracy for MDD for at least one PHQ-9 cutoff.
3.2.3. Author contact
We contacted the authors of the eligible studies and invited them to contribute de-
identified primary data for an individual patient data (IPD) meta-analysis. When we could not
reach an author, we approached co-authors, or others who have worked recently with them.
3.2.4. Ethical approval
Per our approved ethics protocol, when an investigator agreed to contribute primary data,
ethical approval for the inclusion of the dataset was sought from the Research Ethics Committee
of Jewish General Hospital in Montreal. In order to obtain ethical approval, we provided the
Research Ethics Committee with the following documents: (1) a signed copy of a letter of
agreement for participation, (2) a copy of the ethics approval from the original study, (3) a copy
of the consent form used in the original study, and (4) a letter or e-mail from the contributing
author’s research ethics committee stating that IRB approval is not needed for this transfer or, if
necessary, research ethics committee approval of the transfer. In cases where documentation of
the original ethics approval and patient consent forms were not retrievable, ethics approval was
granted if there was other documentation that these documents exist (e.g., publications that
document ethics approval and patient consent).
3.2.5. Transfer of data
27
Once ethical approval for inclusion of a dataset was obtained, original patient data was
sought. Data to be transferred was required to be properly de-identified prior to transfer.
3.3. Data preparation
3.3.1. Extraction
For each included dataset, we extracted the PHQ-scores and MDD diagnoses for each
patient, and information pertaining to weighting. We also extracted study country,
setting/population, and the diagnostic standard used from the original publications.
3.3.2. Cleaning
We reviewed all original publications and compared diagnostic accuracy reported in the
original publications to the accuracy we calculated using the raw datasets. When data and
original publications were discrepant, we resolved discrepancies in consultation with the original
investigators. When 2x2 tables in the primary studies could not be reproduced using the data
provided, we corrected based on the raw data, and confirmed with the authors. For studies where
the original analyses included weights, we replicated the original weighting scheme (Fann et al.,
2005; Lamers et al., 2008; Wittkampf et al., 2009). For studies where the original analyses did
not include weights, but the sample selection method merited weighting, we constructed
appropriate weights (Azah et al., 2005; Yeung et al., 2008). For our analyses, we used the
primary data that was cleaned and verified in collaboration with the primary authors.
3.4. Statistical analyses
Two sets of statistical analyses were performed. First, we performed traditional meta-
analyses where for each PHQ-9 cutoff between 7 and 15 we included data from the studies that
28
included accuracy results for the respective cutoff in the original publication. For instance, if a
study published results for cutoffs 9 through 13, data from this study were included in the meta-
analyses of cutoffs 9 through 13, but not included in the meta-analyses of cutoffs 7,8,14 or 15.
Second, we performed IPD meta-analyses where for each PHQ-9 cutoff between 7 and 15, we
included data from all studies.
For both sets of analyses, bivariate random-effects models were used to identify the
“optimal” cutoff score. Bivariate random-effects meta-analysis models were estimated via
Gaussian Hermite adaptive quadrature, as described in Riley et al., (2008) for each PHQ-9 cutoff
between 7 and 15. This approach models sensitivity and specificity at the same time, accounting
for the inherent correlation between them, and accounts for the precision of estimates within
studies. A random-effects model was used so that sensitivity and specificity were assumed to
vary across primary studies. This model provided us with overall pooled sensitivity and
specificity for each cutoff for the two sets of analyses. By combining information across a range
of cutoffs, we constructed a pooled ROC curve for each set of analyses and show differences in
the results produced by the two methods of meta-analysis (i.e. traditional vs. IPD).
As we were interested in the differences in results produced by including all cutoffs versus only
including just published cutoffs, we used the cleaned raw data for both sets of analysis rather
than using the raw data for the IPD meta-analysis and published summary results for the
traditional meta-analysis. Similarly, since the objective of our study was to examine bias, and not
determine the true diagnostic accuracy of the screening tool, we analyzed the entire IPD dataset
as a whole, and consistent with the original meta-analysis, did not conduct any moderator
analyses. Furthermore, because the focus of this project was to examine the effects of selective
outcome reporting by comparing traditional meta-analysis with IPD meta-analysis, we did not
29
remove patients who were already diagnosed with depression and/or who were currently being
treated for depression, since we would have only been able to do this in the IPD meta-analysis.
30
4. Results
4.1. Datasets
The meta-analysis by Manea et al. included 18 original studies. Of these 18 studies, there
were 17 unique patient samples, 16 of which published diagnostic accuracy results for MDD for
at least one PHQ-9 cutoff and were thus eligible for the current study.
Authors were first invited to contribute data between May and July of 2012. Datasets
were obtained between August 2012 and September 2013, and all discrepancies were resolved in
consultation with the original investigators by March of 2014.
Of the 16 eligible datasets, 13 (80% of all eligible patients; 94% of all eligible MDD
cases) were successfully obtained, for a total sample size of 4589 patients. One of the missing
datasets (Adewuya et al., 2006) was reported by the study’s principal investigator to be lost in a
fire (personal communication, Abiodun Adewuya, October 16, 2012). Another missing dataset
(Kroenke et al., 2001) belonged to a researcher who is deceased. The third and final missing
dataset (Watnick et al., 2005) could not be provided by the principal investigator. Characteristics
of the included datasets can be found in Table 1.
4.2. Comparison of traditional versus individual patient data (IPD) meta-analysis
Of the 13 studies, 11 (83% of patients; 70% of MDD cases) published accuracy results
for the recommended PHQ-9 cutoff score of 10 in the original report; accuracy results using
traditional meta-analysis (sensitivity = 0.85, specificity = 0.88) were similar to those using IPD
meta-analysis (sensitivity = 0.87, specificity = 0.88).
For other cutoffs, the number of studies that published accuracy results for the particular
cutoff ranged from 3 to 6 (21-46% of patients; 14-53% of MDD cases) and results using the two
different meta-analytic methods were more discrepant. For instance, for a cutoff of 9, sensitivity
31
was 0.78 in traditional meta-analysis versus 0.89 using IPD, while for a cutoff of 11, sensitivity
was 0.92 in traditional meta-analysis versus 0.83 using IPD. A comparison of results across the
two meta-analytic methods can be found in Table 2; ROC curves produced by the two meta-
analytic methods are shown in Figure 1.
At PHQ-9 cutoff scores above 10, sensitivity was exaggerated in traditional meta-analysis
compared to IPD, while at cutoffs below 10, sensitivity was underestimated in traditional meta-
analysis compared to IPD. For all cutoffs, specificity was similar across the two meta-analytic
methods. Discrepancies in sensitivity and specificity across cutoffs in relation to the proportion
of data available are shown in Table 3.
32
4.3. Tables and figures
4.3.1. Table 1. Characteristics of included studies
Study, year
Country
Setting/Population Diagnostic
Standard
Published
cutoffs
N total
analyzed
N MDD cases
analyzed
Azah et al., 2005
Malaysia
Family medicine clinic CIDI (ICD-10) 5 to 12 180 30
De lima Osorio et al., 2009 Brazil
Females in primary care SCID (DSM-IV) 10 to 21 177 60
Fann et al., 2005 Unites States
Inpatients with head trauma SCID (DSM-IV) 10 and 12 135 45
Gilbody et al., 2007 United Kingdom
Primary care SCID (DSM-III-R) 9 to 13 96 36
Gjerdingen et al., 2009 Unites States
Mothers registering their newborns for well-child
visits, medical or paediatric clinics
SCID (DSM-IV) 10 438 20
Grafe et al., 2004 Germany
Psychosomatic patients and patients at walk-in clinics
and family practices
SCID (DSM-IV) 10 to 14 521 71
Lamers et al., 2008 Netherlands
Elderly primary care patients with diabetes mellitus
and chronic obstructive pulmonary disease
MINI (DSM-IV) 6 to 8 611 277
Lotrakul et al., 2008 Thailand
Primary care MINI (DSM-IV) 6 to 15 279 19
Stafford et al., 2007 Australia
Hospital settings (coronary artery disease patients) MINI (DSM-IV) 5, 6, and 10 193 35
Thombs et al., 2008 United States
Cardiology outpatients C-DIS (DSM-IV) 4 to 10 1024 224
Williams et al., 2005 Unites States
Stroke patients SCID (DSM-IV) 10 316 106
Wittkampf et al., 2009 Netherlands
Primary care SCID (DSM-IV) 10 and 15 435 77
Yeung et al., 2008 Unites States
Chinese Americans in primary care SCID (DSM-IV) 15 184 37
33
4.3.2. Table 2. Comparison of meta-analysis results across the two meta-analytic methods
Published data (traditional MA) All data (IPD MA)
Cutoff
# of
studies
# of
patients
# mdd
cases Sens
Sens 95%
CI Spec
Spec 95%
CI
# of
studies
# of
patients
# mdd
cases Sens
Sens 95%
CI Spec
Spec 95%
CI
7 4 2094 550 0.85 (0.70-0.94) 0.73 (0.62-0.81) 13 4589 1037 0.97 (0.91-0.99) 0.73 (0.67-0.78)
8 4 2094 550 0.79 (0.63-0.89) 0.78 (0.71-0.85) 13 4589 1037 0.93 (0.85-0.97) 0.78 (0.74-0.82)
9 4 1579 309 0.78 (0.56-0.90) 0.82 (0.75-0.88) 13 4589 1037 0.89 (0.79-0.95) 0.83 (0.80-0.86)
10 11 3794 723 0.85 (0.71-0.93) 0.88 (0.85-0.91) 13 4589 1037 0.87 (0.75-0.94) 0.88 (0.85-0.90)
11 5 1253 216 0.92 (0.58-0.99) 0.90 (0.81-0.95) 13 4589 1037 0.83 (0.68-0.92) 0.90 (0.88-0.92)
12 6 1388 261 0.82 (0.65-0.92) 0.92 (0.87-0.96) 13 4589 1037 0.77 (0.63-0.87) 0.92 (0.90-0.94)
13 4 1073 186 0.82 (0.75-0.87) 0.94 (0.84-0.98) 13 4589 1037 0.67 (0.56-0.77) 0.94 (0.92-0.95)
14 3 977 150 0.71 (0.57-0.83) 0.97 (0.87-0.99) 13 4589 1037 0.59 (0.48-0.70) 0.96 (0.94-0.97)
15 4 1075 193 0.61 (0.52-0.70) 0.98 (0.96-0.99) 13 4589 1037 0.52 (0.42-0.62) 0.97 (0.96-0.98)
34
4.3.3. Table 3. Discrepancies in sensitivity and specificity across cutoffs
Cutoff
% of patients
with published
cutoffs
% of cases
with published
cutoffs
Difference in
sensitivity
(traditional - IPD)
Difference in
specificity
(traditional - IPD)
7 46 53 -0.12 0
8 46 53 -0.14 0
9 34 30 -0.11 -0.01
10 83 70 -0.02 0
11 27 21 0.09 0
12 30 25 0.05 0
13 23 18 0.15 0
14 21 14 0.12 0.01
15 23 19 0.09 0.01
35
4.3.4. Figure 1. ROC curves produced by IPD meta-analysis using all data vs. traditional meta-analysis using published data only
Note: Numbers within ROC curves indicate each of the PHQ-9 cutoffs between 7 and 15
36
5. Discussion
5.1. Main finding
For the PHQ-9 screening tool, cutoff 10 is described as the standard cutoff for identifying
cases of major depression (Gilbody et al., 2007; Kroenke et al., 2001; Wittkampf et al., 2007).
Results for this cutoff were reported in almost all of the studies and accuracy results using
traditional meta-analysis were similar to those using IPD meta-analysis. Results for other cutoffs
were published more haphazardly, and tended to be included when they bordered the particular
study’s optimal cutoff. For the other cutoffs, even those close to 10, less than half of original
studies published accuracy results. While specificity results were similar across the two meta-
analytic methods, sensitivity results were quite discrepant. At cutoff scores above 10, sensitivity
was exaggerated in traditional meta-analysis compared to IPD, while at cutoff scores below 10,
sensitivity was underestimated in traditional meta-analysis compared to IPD. ROC curves using
IPD data were realistic, with specificity increasing and sensitivity decreasing as cutoff scores
increased. ROC curves using the traditional meta-analysis results, however, were implausible,
with sensitivity appearing to increase between cutoffs 9 and 11.
With respect to sensitivity fluctuating more than specificity, the number of cases
decreases substantially as sample sizes decrease (i.e. going from all data to just published data),
thus estimates of sensitivity can greatly fluctuate. There are substantially more non-cases than
cases, however, thus even as sample sizes decrease, estimates of specificity remain stable.
Additionally, while PHQ total scores tend to be normally distributed among MDD cases, PHQ
scores among non-cases tend to be positively skewed, with most of them scoring well below any
of the cutoffs of interest. Thus as cutoffs scores change, most non-cases remain below the cutoff
of interest, while cases can fluctuate on either side of the threshold. Goodacre et al. (2005)
37
compared studies with data-driven cutoffs to studies where cutoffs were chosen a priori and
found something similar: studies with data-driven cutoffs were more likely to report higher
values of sensitivity than studies with cutoffs chosen a priori, however specificity was not
significantly different between the two groups of studies.
5.2. Patterns in selective outcome reporting
Generally, the sum of sensitivity and specificity for a study’s optimal cutoff ranged
between 1.6 and 1.8, regardless of which particular cutoff was optimal. What mainly changed
across studies was the “optimal” cutoff rather than overall level of accuracy. Most studies
reported results for the standard cutoff of 10, but aside from that, patterns reflected one of the
following 3 categorizations: (1) some studies (Gjerdingen et al., 2009; Lotrakul et al., 2008;
Williams et al., 2005) published results for the standard cutoff only, or for a range of cutoffs
surrounding the standard cutoff (neutral cutoff reporting); (2) other studies (Azah et al., 2005;
Lamers et al., 2008; Stafford et al., 2007; Thombs et al., 2008) published results for low cutoffs
that tended to go from the optimal cutoff towards the standard (low cutoff reporting); and (3)
other studies (de Lima Osorio et al., 2009; Fann et al., 2005; Gilbody et al., 2007; Gräfe et al.,
2004; Wittkampf et al., 2009) published results for high cutoffs that tended to go from the
optimal cutoff towards the standard (high cutoff reporting). One of the studies (Yeung et al.,
2008) did not fit any of the above 3 categories, however this was a study where the original
report had incorrectly presented the data in its 2x2 diagnostic accuracy table.
The studies with the low cutoff reporting trend were studies where the PHQ-9 was
poorly sensitive, which resulted in low optimal cutoffs. Reporting tended to include results for
the optimal cutoff and higher cutoffs up until 10, all of which had lower sensitivity than the
38
optimal cutoff. In addition, each cutoff between the optimal and 10 had lower sensitivity than the
sensitivity for the respective cutoff computed using IPD meta-analysis. The studies with the high
cutoff reporting trend, on the other hand, were studies where the PHQ-9 was highly sensitive,
which resulted in high optimal cutoffs. Reporting tended to include results for the optimal cutoff
and lower cutoffs until 10, all of which had higher sensitivity than the optimal cutoff. In addition,
each cutoff between the optimal and 10 had higher sensitivity than the sensitivity for the
respective cutoff computed using IPD meta-analysis.
Studies publishing with the neutral reporting trend include results on both sides of the
optimal cutoff and would not be expected to introduce bias into meta-analysis. Studies
publishing with either the low or high cutoff reporting trend, however, could potentially
introduce bias in meta-analysis. This is because in each study, only results for a range of cutoffs
in the better performing half of the entire range of possible cutoffs are published, which may
explain why sensitivity curves can be flattened or even inverted when there is partial reporting.
5.3. Clinical significance and implications
Depression is a chronic and disabling condition that is the leading global cause of life-
years lived with disability (Lopez et al., 2006; Mathers et al., 2006; Moussavi et al., 2007;
Whiteford et al., 2013). Depression that is not adequately identified and treated is a robust
indicator of poor prognosis among patients with chronic medical comorbidity, of long-term
mental health problems in children and adolescents, and of poor child and family outcomes
among pregnant and postpartum women (Evans et al., 2005; Fergusson & Woodward, 2002;
Shaffer et al., 1996; Weissman et al., 1999; Whitaker et al., 1990; Whitley & Kirmayer, 2008;
Williams et al., 2009; Zelkowitz & Milet, 1996; Zelkowitz & Milet, 2001). Most people with
39
depression, however, do not receive adequate care (Duhoux et al., 2009; Duhoux et al., 2013).
The development and implementation of effective depression identification and management
programs is an urgent priority, both in Canada (Mental Health Commission, 2012) and
internationally (Ngo et al., 2013). Depression screening has been recommended as a solution, but
guidelines and recommendations are sometimes made without full consideration of evidence or
clinical practice realities (Sniderman & Furberg, 2009).
Studies that have examined the diagnostic accuracy of depression screening tools
typically use data-driven, exploratory methods to select optimal cutoffs, and tend to report results
only from a range of cutoff points around whichever cutoff score worked best. As a result,
traditional meta-analyses, which rely on published data, can grossly exaggerate estimates of
diagnostic accuracy, and will not necessarily allow the determination of the best cutoff for
screening. Exaggeration of screening accuracy may lead to misguided enthusiasm for screening
without any real evidence of benefit, and thus, overdiagnosis. By exaggerating its potential
efficacy, it is more likely that screening will be implemented and it will raise the rate of those
being identified and/or treated, without any evidence of improved health.
5.4. Implications for research
Studies on the diagnostic accuracy of screening tests may report results from a single
cutoff or several different possible cutoffs. The STAndards for the Reporting of Diagnostic
accuracy studies (STARD) statement on reporting results from these kinds of studies does not
provide guidance on the range of cutoffs for which results should be reported (Bossuyt et al.,
2003). Researchers should routinely report diagnostic accuracy results for all cutoffs. This is
important because primary studies of diagnostic test accuracy are often conducted in relatively
40
small samples with small numbers of cases, and meta-analytic syntheses may be needed to
confidently assess test accuracy and to ascertain the most appropriate cutoff score for
determining positive test status. For screening tools that include ordinal cutoffs, as is the case for
depression screening tools, guidelines such as STARD should consider including the reporting of
results for all cutoffs as part of their requirements.
Because research is what leads to recommendations, it is imperative that analyses be
conducted in a way that limits the potential for bias. IPD meta-analysis provides another
mechanism to obtain less biased estimates of accuracy. The quality of IPD meta-analyses,
however, depends on the ability to obtain primary data (Riley et al., 2010).
5.5. Limitations
As we were interested in the differences in results produced by including all cutoffs
versus including only published cutoffs, we used the cleaned raw data for both sets of analysis
rather than using the raw data for the IPD meta-analysis and published summary results for the
traditional meta-analysis. Similarly, since the objective of our study was to examine bias, and not
determine the true diagnostic accuracy of the screening tool, we only analyzed the entire IPD
dataset as a whole, and consistent with the original meta-analysis, did not conduct any moderator
analyses.
A downside of the meta-analytic approach used is that the pooled estimates at each cutoff
were extremely correlated with each other. This did not affect the current study too much, since
we were not seeking to determine the true diagnostic accuracy of the screening tool, but will
need to be considered when conducting more conventional IPD meta-analyses.
41
We were unable to acquire data for 3 of the 16 eligible studies included in the original
meta-analysis; however we did obtain a very high percentage of data (80% of the eligible
patients and 94% of the eligible cases). Finally, this was only one example with a relatively small
number of included studies, and will need to be replicated in other comparisons.
42
6. Summary and Final Conclusion
6.1. Summary
When most of the data are published in the original studies, diagnostic accuracy results
using traditional meta-analytic methods and IPD methods are very similar. However, when not
all the data is reported, there can be considerable discrepancies. For the standard PHQ-9 cutoff
score of 10, most studies published accuracy results in the original report and accuracy results
using traditional meta-analysis were similar to those using IPD meta-analysis. For each of the
other cutoffs, however, less than half of the studies published accuracy results for the particular
cutoff in the original reports and results using the two different meta-analytic methods were more
discrepant.
Some studies are more sensitive than others, meaning that across the spectrum of possible
cutoff scores, the percentage of truly depressed patients who score above each particular cutoff is
higher than average. On the other hand, some studies are not very sensitive, thus the cutoff to
identify a likely depression case must be moved lower than usual in order to catch the same
percentage of truly depressed patients as a more sensitive study. The more sensitive a study is,
the higher the optimal cutoff, while the less sensitive a study is, the lower the optimal cutoff.
Studies with a low cutoff reporting trend tend to be studies where low cutoffs are optimal,
whereas studies with the high cutoff reporting trend tend to be studies where high cutoffs are
optimal. In comparison to the true sensitivity values computed by combining results from all
studies, highly sensitive studies (i.e. high optimal cutoffs) tend to overestimate sensitivity at each
cutoff, while less sensitive studies (i.e. low optimal cutoffs) tend to underestimate sensitivity at
each cutoff.
43
In IPD meta-analysis, results for all cutoffs for all studies are included, thus the over-
estimation of sensitivities in some studies and the underestimation in other studies balance out
overall. In traditional meta-analyses, however, where only high performing cutoffs tend to be
reported, and only published results can be included in meta-analysis, sensitivity appears to be
underestimated for low cutoffs and overestimated for high cutoffs, giving rise to flattened or
even inverted ROC curves.
6.2. Conclusion
Current meta-analysis methods seem to exaggerate the diagnostic accuracy of depression
screening tools, especially for cutoffs that are not standard or recommended. Sensitivity is more
severely affected by the selective reporting of results than specificity, due to the smaller
proportion of MDD cases than non-cases in most samples and the fact that PHQ scores tend to be
normally distributed among cases, with most of them lying in the range of cutoffs of interest, but
positively skewed among non-cases, with most of them scoring well below any of the cutoffs of
interest. Sensitivity tends to be underestimated for cutoffs below the standard and overestimated
for cutoffs above the standard.
The degree of exaggeration of accuracy estimates depends on the degree of selective
reporting. IPD meta-analysis provides a mechanism to obtain realistic estimates of depression
screening tool accuracy, which currently appears to be substantially exaggerated. Researchers
should routinely report diagnostic accuracy results for all cutoffs. Guidelines such as STARD
should consider including the reporting of results for all cutoffs as part of their requirements.
44
7. References
Adewuya, A. O., Ola, B. A., & Afolabi, O. O. (2006). Validity of the Patient Health
Questionnaire (PHQ-9) as a screening tool for depression amongst Nigerian university
students. Journal of Affective Disorders, 96(1-2), 89-93.
Ahmed, I., Sutton, A. J., & Riley, R. D. (2012). Assessment of publication bias, selection bias,
and unavailable data in meta-analyses using individual participant data: A database survey.
BMJ (Clinical Research Ed.), 344, d7762.
Altman, D. G., & Bland, J. M. (1994). Diagnostic tests. 1: Sensitivity and specificity. BMJ
(Clinical Research Ed.), 308(6943), 1552.
Azah, N., Shah, M., Juwita, S., Bahri, S., Rushidi., W. M., & Jamil, M. (2005). Validation of the
Malay version brief Patient Health Questionnaire (PHQ-9) among adult attending family
medicine clinics. International Medical Journal, 12, 259-63.
Bossuyt, P. M., Reitsma, J. B., Bruns, D. E., Gatsonis, C. A., Glasziou, P. P., Irwig, L. M., et al.
(2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The
STARD initiative. BMJ (Clinical Research Ed.), 326(7379), 41-4.
Brennan, C., Worrall-Davies, A., McMillan, D., Gilbody, S., & House, A. (2010). The Hospital
Anxiety and Depression Scale: A diagnostic meta-analysis of case-finding ability. Journal
of Psychosomatic Research, 69(4), 371-8.
Brenner, H., & Gefeller, O. (1997). Variation of sensitivity, specificity, likelihood ratios and
predictive values with disease prevalence. Statistics in Medicine, 16(9), 981-91.
45
Commission on Chronic Illness (1957). Chronic illness in the United States: Volume I.
Prevention of chronic illness. Cambridge, Mass: Harvard University Press. p.45
de Lima Osorio, F., Vilela Mendes, A., Crippa, J. A., & Loureiro, S. R. (2009). Study of the
discriminative validity of the PHQ-9 and PHQ-2 in a sample of Brazilian women in the
context of primary health care. Perspectives in Psychiatric Care, 45(3), 216-27.
Duhoux, A., Fournier, L., Gauvin, L., & Roberge, P. (2013). What is the association between
quality of treatment for depression and patient outcomes? A cohort study of adults
consulting in primary care. Journal of Affective Disorders, 151(1), 265-74.
Duhoux, A., Fournier, L., Nguyen, C. T., Roberge, P., & Beveridge, R. (2009). Guideline
concordance of treatment for depressive disorders in Canada. Social Psychiatry and
Psychiatric Epidemiology, 44(5), 385-92.
Evans, D. L., Charney, D. S., Lewis, L., Golden, R. N., Gorman, J. M., Krishnan, K. R., et al.
(2005). Mood disorders in the medically ill: Scientific review and recommendations.
Biological Psychiatry, 58(3), 175-89.
Ewald, B. (2006). Post hoc choice of cut points introduced bias to diagnostic research. Journal of
Clinical Epidemiology, 59(8), 798-801.
Fann, J. R., Bombardier, C. H., Dikmen, S., Esselman, P., Warms, C. A., Pelzer, E., et al. (2005).
Validity of the Patient Health Questionnaire-9 in assessing depression following traumatic
brain injury. The Journal of Head Trauma Rehabilitation, 20(6), 501-11.
46
Fergusson, D. M., & Woodward, L. J. (2002). Mental health, educational, and social role
outcomes of adolescents with depression. Archives of General Psychiatry, 59(3), 225-31.
Gilbody, S., Richards, D., Brealey, S., & Hewitt, C. (2007). Screening for depression in medical
settings with the Patient Health Questionnaire (PHQ): A diagnostic meta-analysis. Journal
of General Internal Medicine, 22(11), 1596-602.
Gjerdingen, D., Crow, S., McGovern, P., Miner, M., & Center, B. (2009). Postpartum depression
screening at well-child visits: Validity of a 2-question screen and the PHQ-9. Annals of
Family Medicine, 7(1), 63-70.
Goodacre, S., Sampson, F. C., Sutton, A. J., Mason, S., & Morris, F. (2005). Variation in the
diagnostic performance of D-dimer for suspected deep vein thrombosis. QJM, 98(7), 513-
27.
Gräfe, K., Zipfel, S., Herzog, W., & et al. (2004). Screening for psychiatric disorders with the
Patient Health Questionnaire (PHQ). Results from the German validation study.
Diagnostica, 50, 171-81.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver
operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
Hewitt, C., Gilbody, S., Brealey, S., Paulden, M., Palmer, S., Mann, R., et al. (2009). Methods to
identify postnatal depression in primary care: An integrated evidence synthesis and value of
information analysis. Health Technology Assessment (Winchester, England), 13(36), 1-145,
147-230.
47
Joffres, M., Jaramillo, A., Dickinson, J., Lewin, G., Pottie, K., Shaw, E., et al. (2013).
Recommendations on screening for depression in adults. CMAJ, 185(9), 775-82
Kroenke, K., & Spitzer, R. L. (2002). The PHQ-9: A new depression diagnostic and severity
measure. Psychiatr Ann, 32, 1-7.
Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The PHQ-9: Validity of a brief depression
severity measure. Journal of General Internal Medicine, 16(9), 606-13.
Lamers, F., Jonkers, C. C., Bosma, H., Penninx, B. W., Knottnerus, J. A., & van Eijk, J. T.
(2008). Summed score of the Patient Health Questionnaire-9 was a reliable and valid
method for depression screening in chronically ill elderly patients. Journal of Clinical
Epidemiology, 61(7), 679-87.
Leeflang, M. M., Moons, K. G., Reitsma, J. B., & Zwinderman, A. H. (2008). Bias in sensitivity
and specificity caused by data-driven selection of optimal cutoff values: Mechanisms,
magnitude, and solutions. Clinical Chemistry, 54(4), 729-37.
Li, J., & Fine, J. P. (2011). Assessing the dependence of sensitivity and specificity on prevalence
in meta-analysis. Biostatistics (Oxford, England), 12(4), 710-22.
Lopez, A. D., Mathers, C. D., Ezzati, M., Jamison, D. T., & Murray, C. J. (2006). Global and
regional burden of disease and risk factors, 2001: Systematic analysis of population health
data. Lancet, 367(9524), 1747-57.
Lotrakul, M., Sumrithe, S., & Saipanish, R. (2008). Reliability and validity of the Thai version of
the PHQ-9. BMC Psychiatry, 8, 46.
48
MacMillan, H. L., Patterson, C. J., Wathen, C. N., Feightner, J. W., Bessette, P., Elford, R. W., et
al. (2005). Screening for depression in primary care: Recommendation statement from the
Canadian Task Force on Preventive Health Care. CMAJ, 172(1), 33-5.
Manea, L., Gilbody, S., & McMillan, D. (2012). Optimal cut-off score for diagnosing depression
with the Patient Health Questionnaire (PHQ-9): A meta-analysis. CMAJ, 184(3), E191-6.
Mathers, C. D., Lopez, A. D., & Murray, C. J. L. (2006). The burden of disease and mortality by
condition: Data, methods, and results for 2001. In A. D. Lopez, C. D. Mathers, M. Ezzati, D.
T. Jamison & C. J. L. Murray (Eds.), Global burden of disease and risk factors. Washington
(DC): The International Bank for Reconstruction and Development/The World Bank Group.
Meader, N., Mitchell, A. J., Chew-Graham, C., Goldberg, D., Rizzo, M., Bird, V., et al. (2011).
Case identification of depression in patients with chronic physical health problems: A
diagnostic accuracy meta-analysis of 113 studies. The British Journal of General Practice:
The Journal of the Royal College of General Practitioners, 61(593), e808-20.
Meader, N., Moe-Byrne, T., Llewellyn, A., & Mitchell, A. J. (2014). Screening for poststroke
major depression: A meta-analysis of diagnostic validity studies. Journal of Neurology,
Neurosurgery, and Psychiatry, 85(2), 198-206.
Meijer, A., Roseman, M., Milette, K., Coyne, J. C., Stefanek, M. E., Ziegelstein, R. C., et al.
(2011). Depression screening and patient outcomes in cancer: A systematic review. PLOS
One, 6(11), e27181.
49
Mental Health Commission of Canada (2012). Changing directions, changing lives: The mental
health strategy for Canada. Calgary, AB.
Mitchell, A. J. (2008). Are one or two simple questions sufficient to detect depression in cancer
and palliative care? A Bayesian meta-analysis. British Journal of Cancer, 98(12), 1934-43.
Mitchell, A. J., & Coyne, J. C. (2007). Do ultra-short screening instruments accurately detect
depression in primary care? A pooled analysis and meta-analysis of 22 studies. The British
Journal of General Practice: The Journal of the Royal College of General Practitioners,
57(535), 144-51.
Mitchell, A. J., Meader, N., Davies, E., Clover, K., Carter, G. L., Loscalzo, M. J., et al. (2012).
Meta-analysis of screening and case finding tools for depression in cancer: Evidence based
recommendations for clinical practice on behalf of the Depression in Cancer Care consensus
group. Journal of Affective Disorders, 140(2), 149-60.
Mitchell, A. J., Meader, N., & Symonds, P. (2010). Diagnostic validity of the Hospital Anxiety
and Depression Scale (HADS) in cancer and palliative settings: A meta-analysis. Journal of
Affective Disorders, 126(3), 335-48.
Moussavi, S., Chatterji, S., Verdes, E., Tandon, A., Patel, V., & Ustun, B. (2007). Depression,
chronic diseases, and decrements in health: Results from the world health surveys. Lancet,
370(9590), 851-8.
50
National Collaborating Center for Mental Health. (2010). The NICE guideline on the
management and treatment of depression in adults (updated edition). UK: National Institute
for Health and Clinical Excellence.
Ngo, V. K., Rubinstein, A., Ganju, V., Kanellis, P., Loza, N., Rabadan-Diehl, C., & Daar, A. S.
(2013). Grand challenges: Integrating mental health care into the non-communicable disease
agenda. PLOS Medicine, 10(5), e1001443.
Pignone, M. P., Gaynes, B. N., Rushton, J. L., Burchell, C. M., Orleans, C. T., Mulrow, C. D., &
Lohr, K. N. (2002). Screening for depression in adults: A summary of the evidence for the
U.S. Preventive Services Task Force. Annals of Internal Medicine, 136(10), 765-76.
Raffle, A., & Gray, M. (2007). Screening: Evidence and practice. UK: Oxford University Press.
Riley, R. D., Dodd, S. R., Craig, J. V., Thompson, J. R., & Williamson, P. R. (2008). Meta-
analysis of diagnostic test studies using individual patient data and aggregate data. Statistics
in Medicine, 27(29), 6111-36.
Riley, R. D., Lambert, P. C., & Abo-Zaid, G. (2010). Meta-analysis of individual participant
data: Rationale, conduct, and reporting. BMJ (Clinical Research Ed.), 340, c221.
Rutjes, A. W., Reitsma, J. B., Di Nisio, M., Smidt, N., van Rijn, J. C., & Bossuyt, P. M. (2006).
Evidence of bias and variation in diagnostic accuracy studies. CMAJ, 174(4), 469-476.
Shaffer, D., Gould, M. S., Fisher, P., Trautman, P., Moreau, D., Kleinman, M., & Flory, M.
(1996). Psychiatric diagnosis in child and adolescent suicide. Archives of General
Psychiatry, 53(4), 339-48.
51
Sniderman, A. D., & Furberg, C. D. (2009). Why guideline-making requires reform. JAMA,
301(4), 429-31.
Stafford, L., Berk, M., & Jackson, H. J. (2007). Validity of the Hospital Anxiety and Depression
Scale and Patient Health Questionnaire-9 to screen for depression in patients with coronary
artery disease. General Hospital Psychiatry, 29(5), 417-24.
Stewart, L. A., Tierney, J. F., & Clarke, M. (2011). Chapter 18: Reviews of individual patient
data. In J. P. T. Higgins, & S. Green (Eds.), Cochrane handbook for systematic reviews of
interventions (Version 5.1.0 ed.,) Cochrane Collaboration.
Swets, J. A. (1988). Measuring the accuracy of diagnostic systems. Science (New York, N.Y.),
240(4857), 1285-93.
Thombs, B. D., Arthurs, E., El-Baalbaki, G., Meijer, A., Ziegelstein, R. C., & Steele, R. J.
(2011). Risk of bias from inclusion of patients who already have diagnosis of or are
undergoing treatment for depression in diagnostic accuracy studies of screening tools for
depression: Systematic review. BMJ (Clinical Research Ed.), 343, d4825.
Thombs, B. D., Coyne, J. C., Cuijpers, P., de Jonge, P., Gilbody, S., Ioannidis, J. P., et al. (2012).
Rethinking recommendations for screening for depression in primary care. CMAJ, 184(4),
413-8.
Thombs, B. D., Ziegelstein, R. C., Roseman, M., Kloda, L. A., & Ioannidis, J. P. (2014). There
are no randomized controlled trials that support the United States Preventive Services Task
52
Force guideline on screening for depression in primary care: A systematic review. BMC
Medicine, 12, 13.
Thombs, B. D., Ziegelstein, R. C., & Whooley, M. A. (2008). Optimizing detection of major
depression among patients with coronary artery disease using the Patient Health
Questionnaire: Data from the Heart and Soul Study. Journal of General Internal Medicine,
23(12), 2014-7.
Thompson, I. M., Chi, C., Ankerst, D. P., Goodman, P. J., Tangen, C. M., Lippman, S. M., et al.
(2006). Effect of finasteride on the sensitivity of PSA for detecting prostate cancer. Journal
of the National Cancer Institute, 98(16), 1128-33.
U.S. Preventive Services Task Force. (2009). Screening for depression in adults: U.S. Preventive
Services Task Force recommendation statement. Annals of Internal Medicine, 151(11), 784-
92.
UK National, S. C. (2000). Second report of the UK national screening committee. Departments
of Health for England, Scotland, Northern Ireland and Wales.
Vodermaier, A., & Millman, R. D. (2011). Accuracy of the Hospital Anxiety and Depression
Scale as a screening tool in cancer patients: A systematic review and meta-analysis.
Supportive Care in Cancer, 19(12), 1899-908.
Wancata, J., Alexandrowicz, R., Marquart, B., Weiss, M., & Friedrich, F. (2006). The criterion
validity of the Geriatric Depression Scale: A systematic review. Acta Psychiatrica
Scandinavica, 114(6), 398-410.
53
Watnick, S., Wang, P. L., Demadura, T., & Ganzini, L. (2005). Validation of 2 depression
screening tools in dialysis patients. American Journal of Kidney Diseases, 46(5), 919-24.
Weissman, M. M., Wolk, S., Goldstein, R. B., Moreau, D., Adams, P., Greenwald, S., et al.
(1999). Depressed adolescents grown up. JAMA, 281(18), 1707-13.
Whitaker, A., Johnson, J., Shaffer, D., Rapoport, J. L., Kalikow, K., Walsh, B. T., et al. (1990).
Uncommon troubles in young people: Prevalence estimates of selected psychiatric disorders
in a nonreferred adolescent population. Archives of General Psychiatry, 47(5), 487-96.
Whiteford, H. A., Degenhardt, L., Rehm, J., Baxter, A. J., Ferrari, A. J., Erskine, H. E., et al.
(2013). Global burden of disease attributable to mental and substance use disorders:
Findings from the Global Burden of Disease Study 2010. Lancet, 382(9904), 1575-86.
Whiting, P., Rutjes, A. W., Reitsma, J. B., Glas, A. S., Bossuyt, P. M., & Kleijnen, J. (2004).
Sources of variation and bias in studies of diagnostic accuracy: A systematic review. Annals
of Internal Medicine, 140(3), 189-202.
Whiting, P. F., Rutjes, A. W., Westwood, M. E., Mallett, S., & QUADAS-2 Steering Group.
(2013). A systematic review classifies sources of bias and variation in diagnostic test
accuracy studies. Journal of Clinical Epidemiology, 66(10), 1093-104.
Whitley, R., & Kirmayer, L. J. (2008). Perceived stigmatisation of young mothers: An
exploratory study of psychological and social experience. Social Science & Medicine
(1982), 66(2), 339-48.
54
Williams, L. S., Brizendine, E. J., Plue, L., Bakas, T., Tu, W., Hendrie, H., & Kroenke, K.
(2005). Performance of the PHQ-9 as a screening tool for depression after stroke. Stroke; a
Journal of Cerebral Circulation, 36(3), 635-8.
Williams, S. B., O'Connor, E. A., Eder, M., & Whitlock, E. P. (2009). Screening for child and
adolescent depression in primary care settings: A systematic evidence review for the US
Preventive Services Task Force. Pediatrics, 123(4), e716-35.
Wilson, J. M., & Jungner, G. (1968). Principles and practices of screening for disease. Geneva:
World Health Organization.
Wittkampf, K., van Ravesteijn, H., Baas, K., van de Hoogen, H., Schene, A., Bindels, P., et al.
(2009). The accuracy of Patient Health Questionnaire-9 in detecting depression and
measuring depression severity in high-risk groups in primary care. General Hospital
Psychiatry, 31(5), 451-9.
Wittkampf, K. A., Naeije, L., Schene, A. H., Huyser, J., & van Weert, H. C. (2007). Diagnostic
accuracy of the mood module of the Patient Health Questionnaire: A systematic review.
General Hospital Psychiatry, 29(5), 388-95.
Yeung, A., Fung, F., Yu, S. C., Vorono, S., Ly, M., Wu, S., & Fava, M. (2008). Validation of the
Patient Health Questionnaire-9 for depression screening among Chinese Americans.
Comprehensive Psychiatry, 49(2), 211-7.
55
Zelkowitz, P., & Milet, T. H. (1996). Postpartum psychiatric disorders: Their relationship to
psychological adjustment and marital satisfaction in the spouses. Journal of Abnormal
Psychology, 105(2), 281-5.
Zelkowitz, P., & Milet, T. H. (2001). The course of postpartum psychiatric disorders in women
and their partners. The Journal of Nervous and Mental Disease, 189(9), 575-82.