Comparison of a traditional meta-analysis versus an...

1

Comparison of a traditional meta-analysis versus an individual patient data (IPD) meta-

analysis to assess the diagnostic accuracy of the Patient Health Questionnaire-9 (PHQ-9)

Brooke Levis

Department of Epidemiology, Biostatistics and Occupational Health

McGill University, Montreal

Submitted November 2014

A thesis submitted to McGill University in partial fulfillment of the requirements of the degree

of: Master of Science – Epidemiology (Thesis)

Brooke Levis 2014

2

Table of Contents

Title page 1

Table of contents 2

Abstract 5

Résumé 7

Acknowledgements 10

1. Introduction 12

2. Literature review 15

2.1. Depression 15

2.2. Screening 15

2.2.1. What is screening? 15

2.2.2. When is screening appropriate? 15

2.3. Depression screening 16

2.3.1. What is depression screening? 16

2.3.2. Controversy about screening 16

2.4. Diagnostic Accuracy 18

2.4.1. Diagnostic accuracy of depression screening tools 18

2.4.2. Limitations in existing evidence 19

2.4.2.1. Spectrum bias problem 19

2.4.2.2. Small sample sizes, data-driven cutoffs and selective reporting 19

2.5. Traditional meta-analyses 21

2.6. Individual patient data (IPD) meta-analysis 22

2.6.1. General approach 22

3

2.6.2. Advantages compared to traditional meta-analyses 23

3. Methods 25

3.1 Measure 25

3.1.1. Patient Health Questionnaire-9 (PHQ-9) 25

3.2. Data collection 25

3.2.1. Source 25

3.2.2. Inclusion criteria 26

3.2.3. Author contact 26

3.2.4. Ethical approval 26

3.2.5. Transfer of data 26

3.3. Data preparation 27

3.3.1 Extraction 27

3.3.2. Cleaning 27

3.4. Statistical analyses 27

4. Results 29

4.1. Datasets 29

4.2. Comparison of traditional versus individual patient data (IPD) meta-

analysis

29

4.3. Tables and figures 31

4.3.1. Table 1. Characteristics of included studies 31

4.3.2. Table 2. Comparison of meta-analysis results across the two meta-

analytic methods

32

4.3.3. Table 3. Discrepancies in sensitivity and specificity across cutoffs 33

4

4.4.4. Figure 1. ROC curves produced by IPD meta-analysis using all data

vs. traditional meta-analysis using published data only

34

5. Discussion 35

5.1. Main Finding 35

5.2. Patterns in selective outcome reporting 36

5.3. Clinical significance and implications 37

5.4. Implications for research 38

5.5. Limitations 39

6. Summary and final conclusion 42

6.1. Summary 42

6.2. Conclusion 43

7. References 44

8. Appendix 56

8.1 Patient Health Questionnaire-9 (PHQ-9) 57

5

Abstract

Background: Depression accounts for more years lived with disability than any other medical

condition. Major depressive disorder (MDD) may be present in 10-20% of patients in medical

settings. Effective interventions to reduce the burden of depression exist, but most patients with

depression do not receive adequate care. Screening for depression has been recommended to

improve access to depression care. However, studies that have examined the diagnostic accuracy

of depression screening tools typically have used data-driven, exploratory methods to select

optimal cutoffs. Most often, these studies report results from a small range of cutoff points

around whichever cutoff score is most accurate in that given study. When data from these

published studies are combined in meta-analyses, estimates of accuracy for different cutoff

points are often based on data from different studies, rather than having data from all studies for

each possible cutoff point. As a result, traditional meta-analyses may generate exaggerated

estimates of accuracy (i.e., sensitivity and specificity). Individual patient data (IPD) meta-

analyses can address this problem by synthesizing data from all studies for each cutoff.

Objective: To assess the degree to which selective reporting of results from well-performing

cutoff thresholds may bias accuracy estimates in meta-analyses of depression screening tools. To

do this, I examined results from studies of the Patient Health Questionnaire-9 (PHQ-9), a

frequently used depression screening tool, comparing results from a traditional meta-analysis of

published accuracy data to results from an IPD meta-analysis using original patient data from the

same studies.

6

Methods: Authors of studies included in a recently published meta-analysis on the PHQ-9 were

invited to contribute patient-level data. For each dataset, we extracted the PHQ-scores and MDD

diagnoses for each patient. Two sets of statistical analyses were performed: (1) a traditional

meta-analysis where, for each cutoff between 7 and 15, we included data from the studies that

reported accuracy results for the cutoff in the original publication; and (2) an IPD meta-analysis

where, for each cutoff between 7 and 15, we included data from all studies.

Results: We obtained data from 13 of 16 eligible datasets that were included in the original

meta-analysis. Of the 13 studies, 11 (83% of patients) published accuracy results for the

recommended cutoff score of 10 in the original report; accuracy results using traditional meta-

analysis (sensitivity = 0.85, specificity = 0.88) were similar to those using IPD meta-analysis

(sensitivity = 0.87, specificity = 0.88). For other cutoffs, the number of studies that published

accuracy results for the particular cutoff ranged from 3 to 6 (21-46% of all patients) and results

using the two different meta-analytic methods were more discrepant. Cutoffs below the standard

cutoff of 10 tended to underestimate sensitivity in the traditional meta-analysis, whereas cutoffs

above 10 tended to overestimate sensitivity. For instance, for a cutoff of 9, sensitivity was 0.78 in

traditional meta-analysis versus 0.89 using IPD, whereas for a cutoff of 11, sensitivity was 0.92

in traditional meta-analysis versus 0.83 using IPD. For all cutoffs, specificity was similar across

the two meta-analytic methods.

Conclusions: Traditional meta-analyses may exaggerate the diagnostic accuracy of depression

screening tools, especially for cutoffs that are not standard or recommended. IPD meta-analysis

provides a mechanism to obtain unbiased estimates of accuracy.

7

Résumé

Contexte: La dépression provoque plus d’années vécues avec incapacité que toute autre

condition médicale. Un trouble dépressif majeur peut être présent dans 10-20% des patients en

milieu médical. Des interventions efficaces pour réduire le fardeau de la dépression existent,

mais la majorité des patients souffrant de dépression ne reçoivent pas de soins adéquats. Le

dépistage de la dépression a été recommandé pour améliorer l'accès aux soins de la dépression.

Cependant, les études qui ont examiné la précision du diagnostic des outils de dépistage ont

généralement utilisé des méthodes exploratoires axées sur les données pour sélectionner les

seuils optimaux. Souvent, ces études présentent les résultats pour un petit nombre de seuils

autour de celui qui est le plus précis dans l'étude donnée. Lorsque les données de ces études

publiées sont combinées dans les méta-analyses, les estimations de précision pour différents

seuils sont souvent basées sur des données provenant de différentes études, plutôt que d'utiliser

des données de toutes les études pour chaque seuil possible. En conséquence, les méta-analyses

traditionnelles peuvent générer des estimations de précision exagérées. Des méta-analyses sur

données individuelles peuvent améliorer ce problème en synthétisant les données de toutes les

études pour chaque seuil.

Objectif: Évaluer à quel point la publication sélective des résultats de seuils performants peut

biaiser les estimations de la précision du diagnostic dans les méta-analyses des outils de

dépistage de la dépression. Pour ce faire, j’ai examiné les résultats des études du questionnaire

sur la santé du patient-9 (PHQ-9), un outil de dépistage de la dépression fréquemment utilisé, en

comparant les résultats d’une méta-analyse traditionnelle des données de précision publiées aux

8

résultats d’une méta-analyse sur données individuelles en utilisant les données originales des

patients des mêmes études.

Méthodes: Les auteurs des études incluses dans une méta-analyse récemment publiée sur le

PHQ-9 ont été invités à fournir des données au niveau du patient. Pour chaque ensemble de

données, nous avons extrait les scores du PHQ et le diagnostic de trouble dépressif majeur pour

chaque patient. Deux séries d'analyses statistiques ont été réalisées: (1) une méta-analyse

traditionnelle où, pour chaque seuil entre 7 et 15, nous avons inclus les données des études qui

ont publié des résultats de précision pour ce seuil dans la publication originale; et (2) une méta-

analyse sur données individuelles où, pour chaque seuil entre 7 et 15, nous avons inclus les

données de toutes les études.

Résultats: Nous avons obtenu les données de 13 des 16 ensembles de données admissibles de la

méta-analyse originale. Pour le seuil recommandé de 10, 11 des 13 études (83% des patients) ont

publié les résultats de précision du diagnostic dans le rapport initial; les résultats de la précision

utilisant une méta-analyse traditionnelle (sensibilité = 0,85, spécificité = 0,88) étaient semblables

à celles utilisant une méta-analyse sur données individuelles (sensibilité = 0,87, spécificité =

0,88). Pour les autres seuils, le nombre d'études ayant publié des résultats de précision pour le

seuil particulier variait de 3 à 6 (21-46% des patients) et les résultats en utilisant les deux

méthodes de méta-analyse étaient plus discordants. Les seuils inferieures du seuil recommandé

de 10 avaient tendance à sous-estimer la sensibilité dans la méta-analyse traditionnelle tandis que

les seuils supérieurs de 10 avaient tendance à surestimer la sensibilité. Pour tous les seuils, la

spécificité était similaire dans les deux méthodes de méta-analyse.

9

Conclusions: Les méta-analyses traditionnelles peuvent exagérer la précision du diagnostic des

outils de dépistage de la dépression, en particulier pour les seuils qui ne sont pas standard ou

recommandé. Les méta-analyses sur données individuelles fournissent un mécanisme pour

obtenir des estimations de précision non biaisées.

10

Acknowledgements

I would like to acknowledge the tremendous help, support, and guidance that my

supervisor, Dr. Brett Thombs, has provided me with over the past 5 years, and especially during

my Masters training the past 2 years. Amongst so many other things, you have made me an

extremely critical thinker and have taught me the importance of transparency. You are a fabulous

mentor who shows great interest in all of your trainees. I truly appreciate all of the time and

effort you have put into helping me become an independent epidemiologist, and I am thankful

for all of the opportunities you have provided me with to help me develop my career.

I would also like to acknowledge my co-supervisor, Dr. Andrea Benedetti. Thank you for

all of your help in developing my analysis plan and interpreting the results, and for all of your

support and feedback in preparing my manuscript and thesis for submission.

I would like to express my gratitude to McGill University and to the Department of

Epidemiology, Biostatistics and Occupational Health for providing me with a first-class

education. Thank you as well to the staff at the Student Affairs Office for answering all my

questions and for assuring I met all deadlines.

I would also like to thank the external reviewer for their time and effort in evaluating my

thesis. I appreciate the time you have taken away from your schedule to review my thesis and

share your expertise.

Thank you as well to all of the investigators who provided me with their patient-level

data from around the globe. This study would not have been possible without the rich data that

you supplied.

11

I would like to give many thanks to my family and friends for their support, as well as all

my colleagues as the Behavioural Health Research Group. A special thank you to my brother and

colleague Alex; you were extremely helpful in R programming and debugging.

Last, but definitely not least, my most heartfelt thanks go out to my friend, colleague and

roommate, Dr. Linda Kwakkenbos, for her help and support both in and out of the lab. I am

eternally grateful for all of the time you took to answer my many questions, and to proofread just

about anything important, as well as for all the coffee, lunches, and late night work sessions. I

could not have done this without you.

12

1. Introduction

Depression accounts for more years lived with disability than any other medical condition

(Lopez et al., 2006; Mathers et al., 2006; Moussavi et al., 2007; Whiteford et al., 2013). In

medical settings, major depressive disorder (MDD) is present in 5-10% of primary care patients,

including 10-20% of patients with chronic diseases (Evans et al., 2005). There are effective

interventions available to reduce the burden of depression, but most patients with depression do

not receive adequate care (Duhoux et al., 2009; Duhoux et al., 2013). Routine depression

screening has been recommended to improve access to depression care, but is controversial

(Thombs et al., 2012).

Depression screening refers to the use of depression screening tools to identify patients

who may have depression, but who are not seeking treatment for symptoms and whose

depression is not otherwise recognized by their physicians, so that they can be further assessed

and, if appropriate, treated (Raffle & Gray, 2007; UK National Screening Committee, 2000).

For depression screening to be effective, screening tools must be able to accurately distinguish

between patients with and without MDD. There is concern, however, that the diagnostic

accuracy of commonly-used depression screening tools is poorly understood and that existing

evidence on depression screening may overstate what would occur in actual clinical practice

(Thombs et al., 2011; Thombs et al., 2012).

One concern is that selective reporting of results from cutoff thresholds that perform well

in a given study, but not from cutoff thresholds that perform poorly, may be common and may

further inflate estimates of the diagnostic accuracy of depression screening tools. Studies that

have examined the diagnostic accuracy of depression screening tools typically use data-driven,

exploratory methods to select optimal cutoffs, and tend to report results only from a range of

13

cutoff points around whatever cutoff score worked best. When data from these published studies

are combined in traditional meta-analyses, which depend on sample-level results available in

published studies or unpublished reports, estimates of accuracy for different cutoff points are

often based on data from different studies, rather than having data from all studies for each

possible cutoff point. The problem with this is illustrated in a recent meta-analysis of the

diagnostic accuracy of the Patient Health Questionnaire-9 (PHQ-9), (Manea et al., 2012) an

instrument commonly used to screen for MDD in medical settings. In that meta-analysis, for

each possible cutoff score on the PHQ-9, the meta-analysis synthesized sample-level data from

all studies that had published data on sensitivity and specificity for that cutoff score. The

limitations of this method of meta-analyzing are highlighted by the finding that estimates of

sensitivity actually tended to improve as cutoff scores increased, which is a mathematically

impossible result if complete data were used, since as the cutoff score increases, fewer patients

are classified as cases, thus a lower proportion of true cases score above the threshold.

Individual patient data (IPD) meta-analyses (Riley et al., 2010) which synthesize actual

line-by-line patient data from primary studies, rather than only published summary results, is one

way to address the problem of selective reporting of results from only well-performing cutoffs.

This is because in IPD meta-analyses, data can be synthesized for all possible cutoff thresholds

for all included studies. When implemented effectively, IPD meta-analyses are considered by the

Cochrane Collaboration to be the ‘gold standard’ of evidence synthesis (Stewart et al., 2011).

No studies have systematically evaluated how and to what degree the selective reporting

of data from some thresholds, but not others, may bias accuracy estimates of depression

screening tools. Thus, the objective of this study was to systematically assess the manner and

degree to which selective reporting of results from well-performing cutoff thresholds may bias

14

accuracy estimates in meta-analyses of depressing screening tools. To do this, we compared

results produced by conducting a traditional meta-analysis of published accuracy data to results

produced by conducting an IPD meta-analysis using original patient data from the same studies,

using data from the recent meta-analysis on the diagnostic accuracy of the PHQ-9 (Manea et al.,

2012).

15

2. Literature Review

2.1. Depression

Depression accounts for more years lived with disability than any other medical condition

(Lopez et al., 2006; Mathers et al., 2006; Moussavi et al., 2007; Whiteford et al., 2013). In

medical settings, major depressive disorder (MDD) is present in 5-10% of primary care patients,

including 10-20% of patients with chronic diseases (Evans et al., 2005). There are effective

interventions available to reduce the burden of depression, but most patients with depression do

not receive adequate care (Duhoux et al., 2009; Duhoux et al., 2013). Screening for depression

has been recommended to address depression in medical settings.

2.2. Screening

2.2.1. What is screening?

Screening is “the presumptive identification of unrecognised disease or defect by the

application of tests, examinations, or other procedures which can be applied rapidly. Screening

tests sort out apparently well persons who probably have a disease from those who probably do

not. A screening test is not intended to be diagnostic. Persons with positive or suspicious

findings must be referred to their physicians for diagnosis and necessary treatment”

(Commission on Chronic Illness, 1957). Essentially, the purpose of screening is to detect

previously unrecognized cases so that they can be referred for diagnoses and treatment.

2.2.2. When is screening appropriate?

According to the World Health Organization (WHO), screening is appropriate when there

is an important health problem that is prevalent in the population and whose presence would not

likely be detected without screening. The screening tools used must perform well, and there must

16

be effective interventions available to benefit those with the condition. Finally, there needs to be

evidence from randomized controlled trials demonstrating that the benefits of screening

outweigh the harms (Wilson & Jungner, 1968).

2.3. Depression screening

2.3.1. What is depression screening?

Depression screening refers to the use of depression screening tools to identify patients

who may have depression, but who are not seeking treatment for symptoms and whose

depression is not otherwise recognized by their physicians, so that they can be further assessed

and, if appropriate, treated (Raffle & Gray, 2007; UK National Screening Committee, 2000).

Depression screening involves systematically assessing for depression among all individuals in a

given risk group using a standardized test. It is not probing specifically for depression on an

individual level or using depression symptom questionnaires to monitor treatment and relapse.

For depression screening to be effective, screening tools must be able to accurately distinguish

between patients with and without MDD. Accurate assessment of the performance of depression

screening tools, however, requires a large number of patients with and without MDD in order to

accurately estimate sensitivity (percent of patients with MDD correctly identified as likely

depressed) and specificity (percent of patients without MDD correctly identified as likely not

depressed).

2.3.2. Controversy about screening

There is no universally accepted recommendation for depression screening in medical

settings; different countries and groups have made varying and contradicting recommendations

over the years. In 2002, the first major recommendation came out. The United States Preventive

17

Services Task Force (USPSTF) made a recommendation for routine depression screening in

primary care settings where there is enhanced collaborative care available, but not in the absence

of such programs (Pignone et al., 2002). In 2009, the USPSTF updated their evidence review

and, based on evidence from 9 RCTs, recommended screening adults for depression in primary

care settings when staff-assisted depression management programs are available (Grade B

recommendation) (U.S. Preventive Services Task Force, 2009). In 2010, the UK National

Institute for Health and Care Excellence (NICE) put out a guideline on depression management

in primary care stating that there is no evidence that depression screening benefits patients.

Rather than routine screening, they recommended that primary care physicians be alert to

possible depression during encounters with patients (National Collaborating Center for Mental

Health, 2010). In 2013, the Canadian Task Force on Preventive Health Care (CTFPHC) similarly

recommended that physicians be alert, but not routinely screen, for depression in primary care

(Joffres et al., 2013). This recommendation actually reversed a 2005 recommendation to screen

for depression in the context of integrated staff-assisted depression management systems

(MacMillan et al., 2005). A major concern raised by the CTFPHC was that published reports of

the diagnostic accuracy of depression screening tools appear to overstate diagnostic accuracy

compared to what would occur in clinical practice.

In addition to the varying recommendations, a recent systematic review attempted to

evaluate whether there is evidence from randomized controlled trials (RCTs) that depression

screening benefits patients in primary care, using an explicit definition of screening (Thombs et

al., 2014). To be included in the review, a trial needed to compare depression outcomes between

screened and non-screened patients and meet the following 3 criteria: (1) randomize patients

prior to screening, (2) exclude patients that have already been diagnosed or who are already

18

being treated, and (3) offer the same services for depression treatment for patients identified as

depressed from the screener as well as patients identified without using the screener. Based on

these criteria, no screening trials were identified.

2.4. Diagnostic Accuracy

2.4.1. Diagnostic accuracy of depression screening tools

In studies on the diagnostic accuracy of depression screening tools, patient scores from

self-report depression symptom questionnaires are compared to diagnostic status (MDD versus

no MDD) based on a validated diagnostic interview that was designed to reflect Diagnostic and

Statistical Manual of Mental Disorders (DSM) or International Classification of Disease (ICD)

criteria for MDD. To be effective, depression screening tools must be able to accurately

distinguish between patients with and without MDD. Accurate assessment of the performance of

depression screening tools requires a large number of patients with and without MDD in order to

accurately estimate sensitivity and specificity. Sensitivity is the probability that patients with

MDD will be correctly identified as likely depressed by the screener, while specificity is the

probability that patients without MDD will be correctly identified as likely not depressed by the

screener (Altman & Bland, 1994). Sensitivity and specificity are generally regarded as intrinsic

characteristics of a test and are independent of disease prevalence (Brenner & Gefeller, 1997; Li

& Fine, 2011).

Studies of diagnostic accuracy typically seek to identify an optimal cutoff score using

receiver operating characteristic (ROC) curve analysis, by which sensitivity and specificity

associated with all possible cutoff scores are calculated and plotted (Hanley & McNeil, 1982;

19

Swets, 1988). As the required cutoff score for a positive screen increases, sensitivity decreases

(fewer cases reach the necessary threshold), but specificity increases.

2.4.2. Limitations in existing evidence

There is evidence suggesting that the diagnostic accuracy rates that are currently being

reported are exaggerated due to problems such as spectrum bias and the selective reporting of

data driven cutoffs.

2.4.2.1. Spectrum bias problem

Screening tools should detect non-recognized cases. Most existing studies of screening

accuracy, however, include large numbers of patients already being treated for depression. Those

patients were recognized as depressed without screening and would not be screened in clinical

practice, since screening is done to identify undetected cases. A recent review found that only

4% of 197 studies on the diagnostic accuracy of depression screening tools appropriately

excluded patients who were already diagnosed or already undergoing depression treatment

(Thombs et al., 2011). Inclusion of these patients in diagnostic accuracy studies would be

expected to inflate estimates of the number of new cases that would be identified through

screening and estimates of accuracy, since already recognized and treated patients are typically

more easily identified than those who would only be detected via screening.

2.4.2.2. Small sample sizes, data-driven cutoffs and selective reporting

Most existing primary studies are limited by small sample sizes and do not generate

precise estimates. As a result, different studies often generate substantially different ‘optimal’

cutoff scores for the same screening tool. For instance, one review examined 9 small studies that

used the Hospital Anxiety and Depression Scale (HADS) depression screener to detect

20

depression (Meijer et al., 2011). These studies identified optimal cutoff scores ranging from 5-

11, which is a range that is far too wide to be useful in clinical practice.

Selective reporting of results from cutoff thresholds that perform well in a given study,

but not from cutoff thresholds that perform poorly, may be common and may inflate estimates of

the diagnostic accuracy of depression screening tools. Most researchers report accuracy results

for the cutoffs that performed best in the particular studies, and do not report the results for the

cutoffs that did not work well. Because of this, synthesizing results across possible cutoff scores

in traditional meta-analyses, which combine summary results from primary studies, can generate

biased estimates of accuracy. For instance, a recent meta-analysis on the diagnostic accuracy of

the Patient Health Questionnaire-9 (PHQ-9), which is one of the most commonly used

depression screeners, included 18 diagnostic accuracy studies, and, for each possible cutoff, the

authors analyzed all the data available from the publications (Manea et al., 2012). In this paper,

a mathematically impossible result was obtained if all data from the included publications were

considered: as the cutoff increased, sensitivity appeared to increase, which is impossible since

fewer people are being classified as cases. The problem was that because some studies only

reported a small range of cutoffs around the best cutoff, the authors were only able to meta-

analyze a portion of the 18 studies for each cutoff.

Studies in areas outside of mental health have demonstrated that using data-driven cutoff

thresholds to estimate diagnostic accuracy generates overly optimistic estimates, particularly

when sample sizes are small (Ewald, 2006; Goodacre et al., 2005; Leeflang et al., 2008;

Thompson et al., 2006; Whiting et al., 2013). Generally, using the same sample to select an

optimal cutoff score and simultaneously estimate the diagnostic accuracy would be expected to

produce estimates that would not be replicated in actual practice (Rutjes et al., 2006; Whiting et

21

al., 2004). This suggests a high risk of bias when this is done in studies of depression screening

tools, especially since most existing primary studies have been limited by small sample sizes,

specifically by small numbers of cases of MDD.

2.5. Traditional meta-analyses

Traditional meta-analyses have been used to assess the diagnostic accuracy of depression

screening tools. Meta-analyses can overcome problems associated with small sample sizes in

primary studies by combining data across studies. This can be done without bias, however, only

if all relevant outcome data are reported in primary studies (e.g., accuracy data for all relevant

cutoff scores).

There are only a few examples of meta-analyses of depression screening tool accuracy

(Brennan et al., 2010; Gilbody et al., 2007; Hewitt et al., 2009; Manea et al., 2012; Meader et al.,

2014; Mitchell & Coyne, 2007; Mitchell, 2008; Mitchell et al., 2010; Mitchell et al., 2012;

Vodermaier & Millman, 2011; Wancata et al., 2006; Wittkampf et al., 2007). Existing meta-

analyses have handled the issue of selective threshold reporting in 3 different ways: (1) In some

meta-analyses, the authors have explicitly indicated that they have only included one cutoff from

each study. They have used either the cutoff the primary study authors indicated was optimal or a

cutoff that they deemed was most accurate using some alternative method when the primary

study authors did not clearly indicate an optimal cutoff. For instance, in a systematic review by

Wancata et al. (2006), when the primary study authors did not clearly recommend a particular

cutoff, the meta-analysis authors used the cutoff where sensitivity and specificity were closest

together. (2) In other meta-analyses, the authors have stated that they synthesized standard

cutoffs, but have substituted “optimal cutoffs” when standards were not reported in the original

22

studies. This substitution method may be indicated explicitly, or may not have been stated but

was reported to us by the authors of the meta-analysis. For instance, in a meta-analysis by

Meader et al. (2011), the authors attempted to synthesize accuracy data for the standard cutoff

score of 8, when available. However, they were only able to include results from this cutoff for

12 of 27 studies (44% of patients) included in the meta-analysis. Results for the standard cutoff

were not reported in the other included studies, presumably because the standard cutoff had not

performed well. For those studies, the authors used accuracy data from other, better performing

cutoffs instead and combined results quantitatively, even though the studies did not use the same

cutoff threshold. (3) In other meta-analyses, the authors have analyzed outcomes at each cutoff

separately, and for each cutoff, they included studies that reported results for the particular study.

For instance, in the meta-analysis by Manea et al. (2012), the authors included 18 primary

studies in their analyses, and for each cutoff of the Patient Health Questionnaire-9 (PHQ-9), they

included the subset of studies that had reported results for the respective cutoff.

2.6. Individual patient data (IPD) meta-analysis

Individual patient data (IPD) meta-analyses (Riley et al., 2010) which synthesize actual

line-by-line patient data from primary studies, rather than only published summary results, is one

way to address the problem of selective reporting of results from only well-performing cutoffs.

This is because in IPD meta-analyses, data can be synthesized for all possible cutoff thresholds

for all included studies.

2.6.1. General approach

The general approach of an IPD meta-analysis does not differ from a traditional meta-

analysis in terms of defining a research question, establishing study inclusion and exclusion

23

criteria, identifying and screening studies, and analyzing data (Stewart et al., 2011). IPD meta-

analyses are resource intensive in that they require a substantial amount of time to identify and

obtain original data, clarify data-related issues with data providers, and generate a consistent data

format across studies (Ahmed et al., 2012; Riley et al., 2010; Stewart et al., 2011). The quality of

IPD meta-analyses depends on the ability to obtain primary data (Riley et al., 2010).

2.6.2. Advantages compared to traditional meta-analyses

Individual patient data (IPD) meta-analysis (Riley et al., 2010) has the potential to

address shortcomings in existing depression screening research. IPD meta-analyses synthesize

original patient data obtained from researchers responsible for the primary studies, thus allowing

for the analysis of data from all cutoffs for all studies. IPD meta-analyses have particular benefits

when there are limitations in published information or where subgroup analyses are needed, but

cannot be performed from study-level data available in original reports; this is often the case

with depression screening accuracy studies. Because of these advantages, IPD meta-analyses,

when they can be implemented effectively, are considered by the Cochrane Collaboration to be

the ‘gold standard’ of evidence synthesis (Stewart et al., 2011).

In the context of evaluating the diagnostic accuracy of depression screening tools,

assessing all relevant cutoff scores for all studies can eliminate biases in traditional meta-

analyses because either large numbers of datasets are excluded, presumably due to poor accuracy

at standard cutoffs, or because some datasets are used to estimate diagnostic accuracy at some

cutoff scores, whereas other datasets are used at other cutoff scores. In addition, since virtually

all primary studies collect data on current depression treatment (e.g., antidepressant use), IPD

meta-analyses can appropriately exclude already-treated patients. Furthermore, IPD meta-

analyses with large numbers of patients and large numbers of MDD cases can potentially

24

incorporate individual risk factors for depression (e.g., age, sex, medical comorbidity) and study

variables (e.g., study setting, risk of bias factors) that may influence accuracy and clinical

decision-making, but have not been included in traditional meta-analyses, as has been noted by

the CTFPHC (Joffres et al., 2013).

25

3. Methods

3.1. Measure

3.1.1. Patient Health Questionnaire-9 (PHQ-9)

The PHQ-9 (Kroenke & Spitzer, 2002) (see Appendix) is a 9-item measure of depressive

symptoms that is commonly used in medical populations (Gilbody et al., 2007; Wittkampf et al.,

2007). The maximum total score is 27, and higher scores represent increased severity of

depressive symptoms. The standard cutoff score to identify possible depression is 10 (Gilbody et

al., 2007; Kroenke et al., 2001; Wittkampf et al., 2007).

3.2. Data collection

3.2.1. Source

This study was conducted using the set of studies included in the recent Manea et al.

meta-analysis of the PHQ-9 (Manea et al., 2012). We attempted to obtain the original patient

data from all studies included in the meta-analysis in order to re-analyze the data to compare

results produced by traditional meta-analysis versus results of an IPD meta-analysis.

In the meta-analysis by Manea et al., the authors searched electronic databases (Embase,

MEDLINE and PsycInfo) from 1999 to August 2010 for studies that evaluated and reported the

diagnostic accuracy of the PHQ-9 for diagnosing major depressive disorder. The authors

included studies that (1) defined MDD according to standard classification systems (e.g.,

International Classification of Diseases [ICD] or the Diagnostic and Statistical Manual of Mental

Disorders [DSM]), (2) used a standardized diagnostic interview (e.g., Mini International

Neuropsychiatric Interview [MINI], Structured Clinical Interview for DSM Disorders [SCID],

Composite International Diagnostic Interview [CIDI], Diagnostic Interview Schedule [DIS] or

26

Revised Clinical Interview Schedule [CIS-R]), and (3) provided sufficient data to calculate

contingency tables.

3.2.2. Inclusion criteria

We included studies that were comprised of unique patient samples and that published

diagnostic accuracy for MDD for at least one PHQ-9 cutoff.

3.2.3. Author contact

We contacted the authors of the eligible studies and invited them to contribute de-

identified primary data for an individual patient data (IPD) meta-analysis. When we could not

reach an author, we approached co-authors, or others who have worked recently with them.

3.2.4. Ethical approval

Per our approved ethics protocol, when an investigator agreed to contribute primary data,

ethical approval for the inclusion of the dataset was sought from the Research Ethics Committee

of Jewish General Hospital in Montreal. In order to obtain ethical approval, we provided the

Research Ethics Committee with the following documents: (1) a signed copy of a letter of

agreement for participation, (2) a copy of the ethics approval from the original study, (3) a copy

of the consent form used in the original study, and (4) a letter or e-mail from the contributing

author’s research ethics committee stating that IRB approval is not needed for this transfer or, if

necessary, research ethics committee approval of the transfer. In cases where documentation of

the original ethics approval and patient consent forms were not retrievable, ethics approval was

granted if there was other documentation that these documents exist (e.g., publications that

document ethics approval and patient consent).

3.2.5. Transfer of data

27

Once ethical approval for inclusion of a dataset was obtained, original patient data was

sought. Data to be transferred was required to be properly de-identified prior to transfer.

3.3. Data preparation

3.3.1. Extraction

For each included dataset, we extracted the PHQ-scores and MDD diagnoses for each

patient, and information pertaining to weighting. We also extracted study country,

setting/population, and the diagnostic standard used from the original publications.

3.3.2. Cleaning

We reviewed all original publications and compared diagnostic accuracy reported in the

original publications to the accuracy we calculated using the raw datasets. When data and

original publications were discrepant, we resolved discrepancies in consultation with the original

investigators. When 2x2 tables in the primary studies could not be reproduced using the data

provided, we corrected based on the raw data, and confirmed with the authors. For studies where

the original analyses included weights, we replicated the original weighting scheme (Fann et al.,

2005; Lamers et al., 2008; Wittkampf et al., 2009). For studies where the original analyses did

not include weights, but the sample selection method merited weighting, we constructed

appropriate weights (Azah et al., 2005; Yeung et al., 2008). For our analyses, we used the

primary data that was cleaned and verified in collaboration with the primary authors.

3.4. Statistical analyses

Two sets of statistical analyses were performed. First, we performed traditional meta-

analyses where for each PHQ-9 cutoff between 7 and 15 we included data from the studies that

28

included accuracy results for the respective cutoff in the original publication. For instance, if a

study published results for cutoffs 9 through 13, data from this study were included in the meta-

analyses of cutoffs 9 through 13, but not included in the meta-analyses of cutoffs 7,8,14 or 15.

Second, we performed IPD meta-analyses where for each PHQ-9 cutoff between 7 and 15, we

included data from all studies.

For both sets of analyses, bivariate random-effects models were used to identify the

“optimal” cutoff score. Bivariate random-effects meta-analysis models were estimated via

Gaussian Hermite adaptive quadrature, as described in Riley et al., (2008) for each PHQ-9 cutoff

between 7 and 15. This approach models sensitivity and specificity at the same time, accounting

for the inherent correlation between them, and accounts for the precision of estimates within

studies. A random-effects model was used so that sensitivity and specificity were assumed to

vary across primary studies. This model provided us with overall pooled sensitivity and

specificity for each cutoff for the two sets of analyses. By combining information across a range

of cutoffs, we constructed a pooled ROC curve for each set of analyses and show differences in

the results produced by the two methods of meta-analysis (i.e. traditional vs. IPD).

As we were interested in the differences in results produced by including all cutoffs versus only

including just published cutoffs, we used the cleaned raw data for both sets of analysis rather

than using the raw data for the IPD meta-analysis and published summary results for the

traditional meta-analysis. Similarly, since the objective of our study was to examine bias, and not

determine the true diagnostic accuracy of the screening tool, we analyzed the entire IPD dataset

as a whole, and consistent with the original meta-analysis, did not conduct any moderator

analyses. Furthermore, because the focus of this project was to examine the effects of selective

outcome reporting by comparing traditional meta-analysis with IPD meta-analysis, we did not

29

remove patients who were already diagnosed with depression and/or who were currently being

treated for depression, since we would have only been able to do this in the IPD meta-analysis.

30

4. Results

4.1. Datasets

The meta-analysis by Manea et al. included 18 original studies. Of these 18 studies, there

were 17 unique patient samples, 16 of which published diagnostic accuracy results for MDD for

at least one PHQ-9 cutoff and were thus eligible for the current study.

Authors were first invited to contribute data between May and July of 2012. Datasets

were obtained between August 2012 and September 2013, and all discrepancies were resolved in

consultation with the original investigators by March of 2014.

Of the 16 eligible datasets, 13 (80% of all eligible patients; 94% of all eligible MDD

cases) were successfully obtained, for a total sample size of 4589 patients. One of the missing

datasets (Adewuya et al., 2006) was reported by the study’s principal investigator to be lost in a

fire (personal communication, Abiodun Adewuya, October 16, 2012). Another missing dataset

(Kroenke et al., 2001) belonged to a researcher who is deceased. The third and final missing

dataset (Watnick et al., 2005) could not be provided by the principal investigator. Characteristics

of the included datasets can be found in Table 1.

4.2. Comparison of traditional versus individual patient data (IPD) meta-analysis

Of the 13 studies, 11 (83% of patients; 70% of MDD cases) published accuracy results

for the recommended PHQ-9 cutoff score of 10 in the original report; accuracy results using

traditional meta-analysis (sensitivity = 0.85, specificity = 0.88) were similar to those using IPD

meta-analysis (sensitivity = 0.87, specificity = 0.88).

For other cutoffs, the number of studies that published accuracy results for the particular

cutoff ranged from 3 to 6 (21-46% of patients; 14-53% of MDD cases) and results using the two

different meta-analytic methods were more discrepant. For instance, for a cutoff of 9, sensitivity

31

was 0.78 in traditional meta-analysis versus 0.89 using IPD, while for a cutoff of 11, sensitivity

was 0.92 in traditional meta-analysis versus 0.83 using IPD. A comparison of results across the

two meta-analytic methods can be found in Table 2; ROC curves produced by the two meta-

analytic methods are shown in Figure 1.

At PHQ-9 cutoff scores above 10, sensitivity was exaggerated in traditional meta-analysis

compared to IPD, while at cutoffs below 10, sensitivity was underestimated in traditional meta-

analysis compared to IPD. For all cutoffs, specificity was similar across the two meta-analytic

methods. Discrepancies in sensitivity and specificity across cutoffs in relation to the proportion

of data available are shown in Table 3.

32

4.3. Tables and figures

4.3.1. Table 1. Characteristics of included studies

Study, year

Country

Setting/Population Diagnostic

Standard

Published

cutoffs

N total

analyzed

N MDD cases

analyzed

Azah et al., 2005

Malaysia

Family medicine clinic CIDI (ICD-10) 5 to 12 180 30

De lima Osorio et al., 2009 Brazil

Females in primary care SCID (DSM-IV) 10 to 21 177 60

Fann et al., 2005 Unites States

Inpatients with head trauma SCID (DSM-IV) 10 and 12 135 45

Gilbody et al., 2007 United Kingdom

Primary care SCID (DSM-III-R) 9 to 13 96 36

Gjerdingen et al., 2009 Unites States

Mothers registering their newborns for well-child

visits, medical or paediatric clinics

SCID (DSM-IV) 10 438 20

Grafe et al., 2004 Germany

Psychosomatic patients and patients at walk-in clinics

and family practices

SCID (DSM-IV) 10 to 14 521 71

Lamers et al., 2008 Netherlands

Elderly primary care patients with diabetes mellitus

and chronic obstructive pulmonary disease

MINI (DSM-IV) 6 to 8 611 277

Lotrakul et al., 2008 Thailand

Primary care MINI (DSM-IV) 6 to 15 279 19

Stafford et al., 2007 Australia

Hospital settings (coronary artery disease patients) MINI (DSM-IV) 5, 6, and 10 193 35

Thombs et al., 2008 United States

Cardiology outpatients C-DIS (DSM-IV) 4 to 10 1024 224

Williams et al., 2005 Unites States

Stroke patients SCID (DSM-IV) 10 316 106

Wittkampf et al., 2009 Netherlands

Primary care SCID (DSM-IV) 10 and 15 435 77

Yeung et al., 2008 Unites States

Chinese Americans in primary care SCID (DSM-IV) 15 184 37

33

4.3.2. Table 2. Comparison of meta-analysis results across the two meta-analytic methods

Published data (traditional MA) All data (IPD MA)

Cutoff

# of

studies

# of

patients

# mdd

cases Sens

Sens 95%

CI Spec

Spec 95%

CI

# of

studies

# of

patients

# mdd

cases Sens

Sens 95%

CI Spec

Spec 95%

CI

7 4 2094 550 0.85 (0.70-0.94) 0.73 (0.62-0.81) 13 4589 1037 0.97 (0.91-0.99) 0.73 (0.67-0.78)

8 4 2094 550 0.79 (0.63-0.89) 0.78 (0.71-0.85) 13 4589 1037 0.93 (0.85-0.97) 0.78 (0.74-0.82)

9 4 1579 309 0.78 (0.56-0.90) 0.82 (0.75-0.88) 13 4589 1037 0.89 (0.79-0.95) 0.83 (0.80-0.86)

10 11 3794 723 0.85 (0.71-0.93) 0.88 (0.85-0.91) 13 4589 1037 0.87 (0.75-0.94) 0.88 (0.85-0.90)

11 5 1253 216 0.92 (0.58-0.99) 0.90 (0.81-0.95) 13 4589 1037 0.83 (0.68-0.92) 0.90 (0.88-0.92)

12 6 1388 261 0.82 (0.65-0.92) 0.92 (0.87-0.96) 13 4589 1037 0.77 (0.63-0.87) 0.92 (0.90-0.94)

13 4 1073 186 0.82 (0.75-0.87) 0.94 (0.84-0.98) 13 4589 1037 0.67 (0.56-0.77) 0.94 (0.92-0.95)

14 3 977 150 0.71 (0.57-0.83) 0.97 (0.87-0.99) 13 4589 1037 0.59 (0.48-0.70) 0.96 (0.94-0.97)

15 4 1075 193 0.61 (0.52-0.70) 0.98 (0.96-0.99) 13 4589 1037 0.52 (0.42-0.62) 0.97 (0.96-0.98)

34

4.3.3. Table 3. Discrepancies in sensitivity and specificity across cutoffs

Cutoff

% of patients

with published

cutoffs

% of cases

with published

cutoffs

Difference in

sensitivity

(traditional - IPD)

Difference in

specificity

(traditional - IPD)

7 46 53 -0.12 0

8 46 53 -0.14 0

9 34 30 -0.11 -0.01

10 83 70 -0.02 0

11 27 21 0.09 0

12 30 25 0.05 0

13 23 18 0.15 0

14 21 14 0.12 0.01

15 23 19 0.09 0.01

35

4.3.4. Figure 1. ROC curves produced by IPD meta-analysis using all data vs. traditional meta-analysis using published data only

Note: Numbers within ROC curves indicate each of the PHQ-9 cutoffs between 7 and 15

36

5. Discussion

5.1. Main finding

For the PHQ-9 screening tool, cutoff 10 is described as the standard cutoff for identifying

cases of major depression (Gilbody et al., 2007; Kroenke et al., 2001; Wittkampf et al., 2007).

Results for this cutoff were reported in almost all of the studies and accuracy results using

traditional meta-analysis were similar to those using IPD meta-analysis. Results for other cutoffs

were published more haphazardly, and tended to be included when they bordered the particular

study’s optimal cutoff. For the other cutoffs, even those close to 10, less than half of original

studies published accuracy results. While specificity results were similar across the two meta-

analytic methods, sensitivity results were quite discrepant. At cutoff scores above 10, sensitivity

was exaggerated in traditional meta-analysis compared to IPD, while at cutoff scores below 10,

sensitivity was underestimated in traditional meta-analysis compared to IPD. ROC curves using

IPD data were realistic, with specificity increasing and sensitivity decreasing as cutoff scores

increased. ROC curves using the traditional meta-analysis results, however, were implausible,

with sensitivity appearing to increase between cutoffs 9 and 11.

With respect to sensitivity fluctuating more than specificity, the number of cases

decreases substantially as sample sizes decrease (i.e. going from all data to just published data),

thus estimates of sensitivity can greatly fluctuate. There are substantially more non-cases than

cases, however, thus even as sample sizes decrease, estimates of specificity remain stable.

Additionally, while PHQ total scores tend to be normally distributed among MDD cases, PHQ

scores among non-cases tend to be positively skewed, with most of them scoring well below any

of the cutoffs of interest. Thus as cutoffs scores change, most non-cases remain below the cutoff

of interest, while cases can fluctuate on either side of the threshold. Goodacre et al. (2005)

37

compared studies with data-driven cutoffs to studies where cutoffs were chosen a priori and

found something similar: studies with data-driven cutoffs were more likely to report higher

values of sensitivity than studies with cutoffs chosen a priori, however specificity was not

significantly different between the two groups of studies.

5.2. Patterns in selective outcome reporting

Generally, the sum of sensitivity and specificity for a study’s optimal cutoff ranged

between 1.6 and 1.8, regardless of which particular cutoff was optimal. What mainly changed

across studies was the “optimal” cutoff rather than overall level of accuracy. Most studies

reported results for the standard cutoff of 10, but aside from that, patterns reflected one of the

following 3 categorizations: (1) some studies (Gjerdingen et al., 2009; Lotrakul et al., 2008;

Williams et al., 2005) published results for the standard cutoff only, or for a range of cutoffs

surrounding the standard cutoff (neutral cutoff reporting); (2) other studies (Azah et al., 2005;

Lamers et al., 2008; Stafford et al., 2007; Thombs et al., 2008) published results for low cutoffs

that tended to go from the optimal cutoff towards the standard (low cutoff reporting); and (3)

other studies (de Lima Osorio et al., 2009; Fann et al., 2005; Gilbody et al., 2007; Gräfe et al.,

2004; Wittkampf et al., 2009) published results for high cutoffs that tended to go from the

optimal cutoff towards the standard (high cutoff reporting). One of the studies (Yeung et al.,

2008) did not fit any of the above 3 categories, however this was a study where the original

report had incorrectly presented the data in its 2x2 diagnostic accuracy table.

The studies with the low cutoff reporting trend were studies where the PHQ-9 was

poorly sensitive, which resulted in low optimal cutoffs. Reporting tended to include results for

the optimal cutoff and higher cutoffs up until 10, all of which had lower sensitivity than the

38

optimal cutoff. In addition, each cutoff between the optimal and 10 had lower sensitivity than the

sensitivity for the respective cutoff computed using IPD meta-analysis. The studies with the high

cutoff reporting trend, on the other hand, were studies where the PHQ-9 was highly sensitive,

which resulted in high optimal cutoffs. Reporting tended to include results for the optimal cutoff

and lower cutoffs until 10, all of which had higher sensitivity than the optimal cutoff. In addition,

each cutoff between the optimal and 10 had higher sensitivity than the sensitivity for the

respective cutoff computed using IPD meta-analysis.

Studies publishing with the neutral reporting trend include results on both sides of the

optimal cutoff and would not be expected to introduce bias into meta-analysis. Studies

publishing with either the low or high cutoff reporting trend, however, could potentially

introduce bias in meta-analysis. This is because in each study, only results for a range of cutoffs

in the better performing half of the entire range of possible cutoffs are published, which may

explain why sensitivity curves can be flattened or even inverted when there is partial reporting.

5.3. Clinical significance and implications

Depression is a chronic and disabling condition that is the leading global cause of life-

years lived with disability (Lopez et al., 2006; Mathers et al., 2006; Moussavi et al., 2007;

Whiteford et al., 2013). Depression that is not adequately identified and treated is a robust

indicator of poor prognosis among patients with chronic medical comorbidity, of long-term

mental health problems in children and adolescents, and of poor child and family outcomes

among pregnant and postpartum women (Evans et al., 2005; Fergusson & Woodward, 2002;

Shaffer et al., 1996; Weissman et al., 1999; Whitaker et al., 1990; Whitley & Kirmayer, 2008;

Williams et al., 2009; Zelkowitz & Milet, 1996; Zelkowitz & Milet, 2001). Most people with

39

depression, however, do not receive adequate care (Duhoux et al., 2009; Duhoux et al., 2013).

The development and implementation of effective depression identification and management

programs is an urgent priority, both in Canada (Mental Health Commission, 2012) and

internationally (Ngo et al., 2013). Depression screening has been recommended as a solution, but

guidelines and recommendations are sometimes made without full consideration of evidence or

clinical practice realities (Sniderman & Furberg, 2009).

Studies that have examined the diagnostic accuracy of depression screening tools

typically use data-driven, exploratory methods to select optimal cutoffs, and tend to report results

only from a range of cutoff points around whichever cutoff score worked best. As a result,

traditional meta-analyses, which rely on published data, can grossly exaggerate estimates of

diagnostic accuracy, and will not necessarily allow the determination of the best cutoff for

screening. Exaggeration of screening accuracy may lead to misguided enthusiasm for screening

without any real evidence of benefit, and thus, overdiagnosis. By exaggerating its potential

efficacy, it is more likely that screening will be implemented and it will raise the rate of those

being identified and/or treated, without any evidence of improved health.

5.4. Implications for research

Studies on the diagnostic accuracy of screening tests may report results from a single

cutoff or several different possible cutoffs. The STAndards for the Reporting of Diagnostic

accuracy studies (STARD) statement on reporting results from these kinds of studies does not

provide guidance on the range of cutoffs for which results should be reported (Bossuyt et al.,

2003). Researchers should routinely report diagnostic accuracy results for all cutoffs. This is

important because primary studies of diagnostic test accuracy are often conducted in relatively

40

small samples with small numbers of cases, and meta-analytic syntheses may be needed to

confidently assess test accuracy and to ascertain the most appropriate cutoff score for

determining positive test status. For screening tools that include ordinal cutoffs, as is the case for

depression screening tools, guidelines such as STARD should consider including the reporting of

results for all cutoffs as part of their requirements.

Because research is what leads to recommendations, it is imperative that analyses be

conducted in a way that limits the potential for bias. IPD meta-analysis provides another

mechanism to obtain less biased estimates of accuracy. The quality of IPD meta-analyses,

however, depends on the ability to obtain primary data (Riley et al., 2010).

5.5. Limitations

As we were interested in the differences in results produced by including all cutoffs

versus including only published cutoffs, we used the cleaned raw data for both sets of analysis

rather than using the raw data for the IPD meta-analysis and published summary results for the

traditional meta-analysis. Similarly, since the objective of our study was to examine bias, and not

determine the true diagnostic accuracy of the screening tool, we only analyzed the entire IPD

dataset as a whole, and consistent with the original meta-analysis, did not conduct any moderator

analyses.

A downside of the meta-analytic approach used is that the pooled estimates at each cutoff

were extremely correlated with each other. This did not affect the current study too much, since

we were not seeking to determine the true diagnostic accuracy of the screening tool, but will

need to be considered when conducting more conventional IPD meta-analyses.

41

We were unable to acquire data for 3 of the 16 eligible studies included in the original

meta-analysis; however we did obtain a very high percentage of data (80% of the eligible

patients and 94% of the eligible cases). Finally, this was only one example with a relatively small

number of included studies, and will need to be replicated in other comparisons.

42

6. Summary and Final Conclusion

6.1. Summary

When most of the data are published in the original studies, diagnostic accuracy results

using traditional meta-analytic methods and IPD methods are very similar. However, when not

all the data is reported, there can be considerable discrepancies. For the standard PHQ-9 cutoff

score of 10, most studies published accuracy results in the original report and accuracy results

using traditional meta-analysis were similar to those using IPD meta-analysis. For each of the

other cutoffs, however, less than half of the studies published accuracy results for the particular

cutoff in the original reports and results using the two different meta-analytic methods were more

discrepant.

Some studies are more sensitive than others, meaning that across the spectrum of possible

cutoff scores, the percentage of truly depressed patients who score above each particular cutoff is

higher than average. On the other hand, some studies are not very sensitive, thus the cutoff to

identify a likely depression case must be moved lower than usual in order to catch the same

percentage of truly depressed patients as a more sensitive study. The more sensitive a study is,

the higher the optimal cutoff, while the less sensitive a study is, the lower the optimal cutoff.

Studies with a low cutoff reporting trend tend to be studies where low cutoffs are optimal,

whereas studies with the high cutoff reporting trend tend to be studies where high cutoffs are

optimal. In comparison to the true sensitivity values computed by combining results from all

studies, highly sensitive studies (i.e. high optimal cutoffs) tend to overestimate sensitivity at each

cutoff, while less sensitive studies (i.e. low optimal cutoffs) tend to underestimate sensitivity at

each cutoff.

43

In IPD meta-analysis, results for all cutoffs for all studies are included, thus the over-

estimation of sensitivities in some studies and the underestimation in other studies balance out

overall. In traditional meta-analyses, however, where only high performing cutoffs tend to be

reported, and only published results can be included in meta-analysis, sensitivity appears to be

underestimated for low cutoffs and overestimated for high cutoffs, giving rise to flattened or

even inverted ROC curves.

6.2. Conclusion

Current meta-analysis methods seem to exaggerate the diagnostic accuracy of depression

screening tools, especially for cutoffs that are not standard or recommended. Sensitivity is more

severely affected by the selective reporting of results than specificity, due to the smaller

proportion of MDD cases than non-cases in most samples and the fact that PHQ scores tend to be

normally distributed among cases, with most of them lying in the range of cutoffs of interest, but

positively skewed among non-cases, with most of them scoring well below any of the cutoffs of

interest. Sensitivity tends to be underestimated for cutoffs below the standard and overestimated

for cutoffs above the standard.

The degree of exaggeration of accuracy estimates depends on the degree of selective

reporting. IPD meta-analysis provides a mechanism to obtain realistic estimates of depression

screening tool accuracy, which currently appears to be substantially exaggerated. Researchers

should routinely report diagnostic accuracy results for all cutoffs. Guidelines such as STARD

should consider including the reporting of results for all cutoffs as part of their requirements.

44

7. References

Adewuya, A. O., Ola, B. A., & Afolabi, O. O. (2006). Validity of the Patient Health

Questionnaire (PHQ-9) as a screening tool for depression amongst Nigerian university

students. Journal of Affective Disorders, 96(1-2), 89-93.

Ahmed, I., Sutton, A. J., & Riley, R. D. (2012). Assessment of publication bias, selection bias,

and unavailable data in meta-analyses using individual participant data: A database survey.

BMJ (Clinical Research Ed.), 344, d7762.

Altman, D. G., & Bland, J. M. (1994). Diagnostic tests. 1: Sensitivity and specificity. BMJ

(Clinical Research Ed.), 308(6943), 1552.

Azah, N., Shah, M., Juwita, S., Bahri, S., Rushidi., W. M., & Jamil, M. (2005). Validation of the

Malay version brief Patient Health Questionnaire (PHQ-9) among adult attending family

medicine clinics. International Medical Journal, 12, 259-63.

Bossuyt, P. M., Reitsma, J. B., Bruns, D. E., Gatsonis, C. A., Glasziou, P. P., Irwig, L. M., et al.

(2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The

STARD initiative. BMJ (Clinical Research Ed.), 326(7379), 41-4.

Brennan, C., Worrall-Davies, A., McMillan, D., Gilbody, S., & House, A. (2010). The Hospital

Anxiety and Depression Scale: A diagnostic meta-analysis of case-finding ability. Journal

of Psychosomatic Research, 69(4), 371-8.

Brenner, H., & Gefeller, O. (1997). Variation of sensitivity, specificity, likelihood ratios and

predictive values with disease prevalence. Statistics in Medicine, 16(9), 981-91.

45

Commission on Chronic Illness (1957). Chronic illness in the United States: Volume I.

Prevention of chronic illness. Cambridge, Mass: Harvard University Press. p.45

de Lima Osorio, F., Vilela Mendes, A., Crippa, J. A., & Loureiro, S. R. (2009). Study of the

discriminative validity of the PHQ-9 and PHQ-2 in a sample of Brazilian women in the

context of primary health care. Perspectives in Psychiatric Care, 45(3), 216-27.

Duhoux, A., Fournier, L., Gauvin, L., & Roberge, P. (2013). What is the association between

quality of treatment for depression and patient outcomes? A cohort study of adults

consulting in primary care. Journal of Affective Disorders, 151(1), 265-74.

Duhoux, A., Fournier, L., Nguyen, C. T., Roberge, P., & Beveridge, R. (2009). Guideline

concordance of treatment for depressive disorders in Canada. Social Psychiatry and

Psychiatric Epidemiology, 44(5), 385-92.

Evans, D. L., Charney, D. S., Lewis, L., Golden, R. N., Gorman, J. M., Krishnan, K. R., et al.

(2005). Mood disorders in the medically ill: Scientific review and recommendations.

Biological Psychiatry, 58(3), 175-89.

Ewald, B. (2006). Post hoc choice of cut points introduced bias to diagnostic research. Journal of

Clinical Epidemiology, 59(8), 798-801.

Fann, J. R., Bombardier, C. H., Dikmen, S., Esselman, P., Warms, C. A., Pelzer, E., et al. (2005).

Validity of the Patient Health Questionnaire-9 in assessing depression following traumatic

brain injury. The Journal of Head Trauma Rehabilitation, 20(6), 501-11.

46

Fergusson, D. M., & Woodward, L. J. (2002). Mental health, educational, and social role

outcomes of adolescents with depression. Archives of General Psychiatry, 59(3), 225-31.

Gilbody, S., Richards, D., Brealey, S., & Hewitt, C. (2007). Screening for depression in medical

settings with the Patient Health Questionnaire (PHQ): A diagnostic meta-analysis. Journal

of General Internal Medicine, 22(11), 1596-602.

Gjerdingen, D., Crow, S., McGovern, P., Miner, M., & Center, B. (2009). Postpartum depression

screening at well-child visits: Validity of a 2-question screen and the PHQ-9. Annals of

Family Medicine, 7(1), 63-70.

Goodacre, S., Sampson, F. C., Sutton, A. J., Mason, S., & Morris, F. (2005). Variation in the

diagnostic performance of D-dimer for suspected deep vein thrombosis. QJM, 98(7), 513-

27.

Gräfe, K., Zipfel, S., Herzog, W., & et al. (2004). Screening for psychiatric disorders with the

Patient Health Questionnaire (PHQ). Results from the German validation study.

Diagnostica, 50, 171-81.

Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver

operating characteristic (ROC) curve. Radiology, 143(1), 29-36.

Hewitt, C., Gilbody, S., Brealey, S., Paulden, M., Palmer, S., Mann, R., et al. (2009). Methods to

identify postnatal depression in primary care: An integrated evidence synthesis and value of

information analysis. Health Technology Assessment (Winchester, England), 13(36), 1-145,

147-230.

47

Joffres, M., Jaramillo, A., Dickinson, J., Lewin, G., Pottie, K., Shaw, E., et al. (2013).

Recommendations on screening for depression in adults. CMAJ, 185(9), 775-82

Kroenke, K., & Spitzer, R. L. (2002). The PHQ-9: A new depression diagnostic and severity

measure. Psychiatr Ann, 32, 1-7.

Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The PHQ-9: Validity of a brief depression

severity measure. Journal of General Internal Medicine, 16(9), 606-13.

Lamers, F., Jonkers, C. C., Bosma, H., Penninx, B. W., Knottnerus, J. A., & van Eijk, J. T.

(2008). Summed score of the Patient Health Questionnaire-9 was a reliable and valid

method for depression screening in chronically ill elderly patients. Journal of Clinical

Epidemiology, 61(7), 679-87.

Leeflang, M. M., Moons, K. G., Reitsma, J. B., & Zwinderman, A. H. (2008). Bias in sensitivity

and specificity caused by data-driven selection of optimal cutoff values: Mechanisms,

magnitude, and solutions. Clinical Chemistry, 54(4), 729-37.

Li, J., & Fine, J. P. (2011). Assessing the dependence of sensitivity and specificity on prevalence

in meta-analysis. Biostatistics (Oxford, England), 12(4), 710-22.

Lopez, A. D., Mathers, C. D., Ezzati, M., Jamison, D. T., & Murray, C. J. (2006). Global and

regional burden of disease and risk factors, 2001: Systematic analysis of population health

data. Lancet, 367(9524), 1747-57.

Lotrakul, M., Sumrithe, S., & Saipanish, R. (2008). Reliability and validity of the Thai version of

the PHQ-9. BMC Psychiatry, 8, 46.

48

MacMillan, H. L., Patterson, C. J., Wathen, C. N., Feightner, J. W., Bessette, P., Elford, R. W., et

al. (2005). Screening for depression in primary care: Recommendation statement from the

Canadian Task Force on Preventive Health Care. CMAJ, 172(1), 33-5.

Manea, L., Gilbody, S., & McMillan, D. (2012). Optimal cut-off score for diagnosing depression

with the Patient Health Questionnaire (PHQ-9): A meta-analysis. CMAJ, 184(3), E191-6.

Mathers, C. D., Lopez, A. D., & Murray, C. J. L. (2006). The burden of disease and mortality by

condition: Data, methods, and results for 2001. In A. D. Lopez, C. D. Mathers, M. Ezzati, D.

T. Jamison & C. J. L. Murray (Eds.), Global burden of disease and risk factors. Washington

(DC): The International Bank for Reconstruction and Development/The World Bank Group.

Meader, N., Mitchell, A. J., Chew-Graham, C., Goldberg, D., Rizzo, M., Bird, V., et al. (2011).

Case identification of depression in patients with chronic physical health problems: A

diagnostic accuracy meta-analysis of 113 studies. The British Journal of General Practice:

The Journal of the Royal College of General Practitioners, 61(593), e808-20.

Meader, N., Moe-Byrne, T., Llewellyn, A., & Mitchell, A. J. (2014). Screening for poststroke

major depression: A meta-analysis of diagnostic validity studies. Journal of Neurology,

Neurosurgery, and Psychiatry, 85(2), 198-206.

Meijer, A., Roseman, M., Milette, K., Coyne, J. C., Stefanek, M. E., Ziegelstein, R. C., et al.

(2011). Depression screening and patient outcomes in cancer: A systematic review. PLOS

One, 6(11), e27181.

49

Mental Health Commission of Canada (2012). Changing directions, changing lives: The mental

health strategy for Canada. Calgary, AB.

Mitchell, A. J. (2008). Are one or two simple questions sufficient to detect depression in cancer

and palliative care? A Bayesian meta-analysis. British Journal of Cancer, 98(12), 1934-43.

Mitchell, A. J., & Coyne, J. C. (2007). Do ultra-short screening instruments accurately detect

depression in primary care? A pooled analysis and meta-analysis of 22 studies. The British

Journal of General Practice: The Journal of the Royal College of General Practitioners,

57(535), 144-51.

Mitchell, A. J., Meader, N., Davies, E., Clover, K., Carter, G. L., Loscalzo, M. J., et al. (2012).

Meta-analysis of screening and case finding tools for depression in cancer: Evidence based

recommendations for clinical practice on behalf of the Depression in Cancer Care consensus

group. Journal of Affective Disorders, 140(2), 149-60.

Mitchell, A. J., Meader, N., & Symonds, P. (2010). Diagnostic validity of the Hospital Anxiety

and Depression Scale (HADS) in cancer and palliative settings: A meta-analysis. Journal of

Affective Disorders, 126(3), 335-48.

Moussavi, S., Chatterji, S., Verdes, E., Tandon, A., Patel, V., & Ustun, B. (2007). Depression,

chronic diseases, and decrements in health: Results from the world health surveys. Lancet,

370(9590), 851-8.

50

National Collaborating Center for Mental Health. (2010). The NICE guideline on the

management and treatment of depression in adults (updated edition). UK: National Institute

for Health and Clinical Excellence.

Ngo, V. K., Rubinstein, A., Ganju, V., Kanellis, P., Loza, N., Rabadan-Diehl, C., & Daar, A. S.

(2013). Grand challenges: Integrating mental health care into the non-communicable disease

agenda. PLOS Medicine, 10(5), e1001443.

Pignone, M. P., Gaynes, B. N., Rushton, J. L., Burchell, C. M., Orleans, C. T., Mulrow, C. D., &

Lohr, K. N. (2002). Screening for depression in adults: A summary of the evidence for the

U.S. Preventive Services Task Force. Annals of Internal Medicine, 136(10), 765-76.

Raffle, A., & Gray, M. (2007). Screening: Evidence and practice. UK: Oxford University Press.

Riley, R. D., Dodd, S. R., Craig, J. V., Thompson, J. R., & Williamson, P. R. (2008). Meta-

analysis of diagnostic test studies using individual patient data and aggregate data. Statistics

in Medicine, 27(29), 6111-36.

Riley, R. D., Lambert, P. C., & Abo-Zaid, G. (2010). Meta-analysis of individual participant

data: Rationale, conduct, and reporting. BMJ (Clinical Research Ed.), 340, c221.

Rutjes, A. W., Reitsma, J. B., Di Nisio, M., Smidt, N., van Rijn, J. C., & Bossuyt, P. M. (2006).

Evidence of bias and variation in diagnostic accuracy studies. CMAJ, 174(4), 469-476.

Shaffer, D., Gould, M. S., Fisher, P., Trautman, P., Moreau, D., Kleinman, M., & Flory, M.

(1996). Psychiatric diagnosis in child and adolescent suicide. Archives of General

Psychiatry, 53(4), 339-48.

51

Sniderman, A. D., & Furberg, C. D. (2009). Why guideline-making requires reform. JAMA,

301(4), 429-31.

Stafford, L., Berk, M., & Jackson, H. J. (2007). Validity of the Hospital Anxiety and Depression

Scale and Patient Health Questionnaire-9 to screen for depression in patients with coronary

artery disease. General Hospital Psychiatry, 29(5), 417-24.

Stewart, L. A., Tierney, J. F., & Clarke, M. (2011). Chapter 18: Reviews of individual patient

data. In J. P. T. Higgins, & S. Green (Eds.), Cochrane handbook for systematic reviews of

interventions (Version 5.1.0 ed.,) Cochrane Collaboration.

Swets, J. A. (1988). Measuring the accuracy of diagnostic systems. Science (New York, N.Y.),

240(4857), 1285-93.

Thombs, B. D., Arthurs, E., El-Baalbaki, G., Meijer, A., Ziegelstein, R. C., & Steele, R. J.

(2011). Risk of bias from inclusion of patients who already have diagnosis of or are

undergoing treatment for depression in diagnostic accuracy studies of screening tools for

depression: Systematic review. BMJ (Clinical Research Ed.), 343, d4825.

Thombs, B. D., Coyne, J. C., Cuijpers, P., de Jonge, P., Gilbody, S., Ioannidis, J. P., et al. (2012).

Rethinking recommendations for screening for depression in primary care. CMAJ, 184(4),

413-8.

Thombs, B. D., Ziegelstein, R. C., Roseman, M., Kloda, L. A., & Ioannidis, J. P. (2014). There

are no randomized controlled trials that support the United States Preventive Services Task

52

Force guideline on screening for depression in primary care: A systematic review. BMC

Medicine, 12, 13.

Thombs, B. D., Ziegelstein, R. C., & Whooley, M. A. (2008). Optimizing detection of major

depression among patients with coronary artery disease using the Patient Health

Questionnaire: Data from the Heart and Soul Study. Journal of General Internal Medicine,

23(12), 2014-7.

Thompson, I. M., Chi, C., Ankerst, D. P., Goodman, P. J., Tangen, C. M., Lippman, S. M., et al.

(2006). Effect of finasteride on the sensitivity of PSA for detecting prostate cancer. Journal

of the National Cancer Institute, 98(16), 1128-33.

U.S. Preventive Services Task Force. (2009). Screening for depression in adults: U.S. Preventive

Services Task Force recommendation statement. Annals of Internal Medicine, 151(11), 784-

92.

UK National, S. C. (2000). Second report of the UK national screening committee. Departments

of Health for England, Scotland, Northern Ireland and Wales.

Vodermaier, A., & Millman, R. D. (2011). Accuracy of the Hospital Anxiety and Depression

Scale as a screening tool in cancer patients: A systematic review and meta-analysis.

Supportive Care in Cancer, 19(12), 1899-908.

Wancata, J., Alexandrowicz, R., Marquart, B., Weiss, M., & Friedrich, F. (2006). The criterion

validity of the Geriatric Depression Scale: A systematic review. Acta Psychiatrica

Scandinavica, 114(6), 398-410.

53

Watnick, S., Wang, P. L., Demadura, T., & Ganzini, L. (2005). Validation of 2 depression

screening tools in dialysis patients. American Journal of Kidney Diseases, 46(5), 919-24.

Weissman, M. M., Wolk, S., Goldstein, R. B., Moreau, D., Adams, P., Greenwald, S., et al.

(1999). Depressed adolescents grown up. JAMA, 281(18), 1707-13.

Whitaker, A., Johnson, J., Shaffer, D., Rapoport, J. L., Kalikow, K., Walsh, B. T., et al. (1990).

Uncommon troubles in young people: Prevalence estimates of selected psychiatric disorders

in a nonreferred adolescent population. Archives of General Psychiatry, 47(5), 487-96.

Whiteford, H. A., Degenhardt, L., Rehm, J., Baxter, A. J., Ferrari, A. J., Erskine, H. E., et al.

(2013). Global burden of disease attributable to mental and substance use disorders:

Findings from the Global Burden of Disease Study 2010. Lancet, 382(9904), 1575-86.

Whiting, P., Rutjes, A. W., Reitsma, J. B., Glas, A. S., Bossuyt, P. M., & Kleijnen, J. (2004).

Sources of variation and bias in studies of diagnostic accuracy: A systematic review. Annals

of Internal Medicine, 140(3), 189-202.

Whiting, P. F., Rutjes, A. W., Westwood, M. E., Mallett, S., & QUADAS-2 Steering Group.

(2013). A systematic review classifies sources of bias and variation in diagnostic test

accuracy studies. Journal of Clinical Epidemiology, 66(10), 1093-104.

Whitley, R., & Kirmayer, L. J. (2008). Perceived stigmatisation of young mothers: An

exploratory study of psychological and social experience. Social Science & Medicine

(1982), 66(2), 339-48.

54

Williams, L. S., Brizendine, E. J., Plue, L., Bakas, T., Tu, W., Hendrie, H., & Kroenke, K.

(2005). Performance of the PHQ-9 as a screening tool for depression after stroke. Stroke; a

Journal of Cerebral Circulation, 36(3), 635-8.

Williams, S. B., O'Connor, E. A., Eder, M., & Whitlock, E. P. (2009). Screening for child and

adolescent depression in primary care settings: A systematic evidence review for the US

Preventive Services Task Force. Pediatrics, 123(4), e716-35.

Wilson, J. M., & Jungner, G. (1968). Principles and practices of screening for disease. Geneva:

World Health Organization.

Wittkampf, K., van Ravesteijn, H., Baas, K., van de Hoogen, H., Schene, A., Bindels, P., et al.

(2009). The accuracy of Patient Health Questionnaire-9 in detecting depression and

measuring depression severity in high-risk groups in primary care. General Hospital

Psychiatry, 31(5), 451-9.

Wittkampf, K. A., Naeije, L., Schene, A. H., Huyser, J., & van Weert, H. C. (2007). Diagnostic

accuracy of the mood module of the Patient Health Questionnaire: A systematic review.

General Hospital Psychiatry, 29(5), 388-95.

Yeung, A., Fung, F., Yu, S. C., Vorono, S., Ly, M., Wu, S., & Fava, M. (2008). Validation of the

Patient Health Questionnaire-9 for depression screening among Chinese Americans.

Comprehensive Psychiatry, 49(2), 211-7.

55

Zelkowitz, P., & Milet, T. H. (1996). Postpartum psychiatric disorders: Their relationship to

psychological adjustment and marital satisfaction in the spouses. Journal of Abnormal

Psychology, 105(2), 281-5.

Zelkowitz, P., & Milet, T. H. (2001). The course of postpartum psychiatric disorders in women

and their partners. The Journal of Nervous and Mental Disease, 189(9), 575-82.

56

8. Appendix

8.1. Patient Health Questionnaire-9 (PHQ-9)

Comparison of a traditional meta-analysis versus an...

Documents

Transcript of Comparison of a traditional meta-analysis versus an...