Handbook - WHO
Transcript of Handbook - WHO
HandbookforGuideline
Development2nd edition
17. Developing guideline recommendations
for tests or diagnostic tools
17. Developing guideline recommendations for tests or diagnostic tools
17.1 Introduction 215
17.2 Evidence pertaining to diagnostic tests 215
17.2.1 What is a diagnostic accuracy study? 215
17.2.2 What is a diagnostic test randomised controlled trial (D-RCT)? 217
17.2.3 Diagnostic accuracy studies or diagnostic randomised controlled trials? 217
17.3 Formulating diagnostic test questions 218
17.4 Systematic reviews of diagnostic accuracy studies 219
17.5 GRADE evidence profiles for diagnostic tests 220
17.5.1 What is GRADE for diagnostic tests? 220
17.5.2 GRADE evidence profiles for diagnostic tests 220
17.6 Evidence-to-decision frameworks for diagnostic tests 230
17.6.1 What are evidence-to-decision frameworks for diagnostic tests? 230
17.6.2 Components of evidence-to-decision frameworks for diagnostic tests 232
17.6.3 When is evidence from test accuracy studies sufficient to develop a recommendation? 237
17.7 Useful resources 238
Acknowledgements 238
References 239
17.1 Introduction
The prior chapters of the WHO Handbook for Guideline Development (1) have addressed developing recommendations for interventions. While tests can be considered interventions, typically research that concerns tests is in the form of accuracy studies (2,3). Thus, this chapter will address how to develop guideline recommendations for tests from accuracy studies when direct evidence about the test’s effect on patient important outcomes is lack-ing. This chapter is applicable to screening, monitoring and diagnostic tests, but for clarity we will refer to tests as “diagnostic tests” and test accuracy studies as “diagnostic accuracy studies”.
17.2 Evidence pertaining to diagnostic tests
17.2.1 What is a diagnostic accuracy study?
A diagnostic accuracy study is a study that determines the ability of a particular test (the index test) to correctly classify a patient as having the disease compared with a reference standard test. The reference stand-ard test (sometimes referred to as the gold standard test) should be a test that best defines a person as diseased or not diseased (with ethical and feasibility considerations in mind). In a diagnostic accuracy study, a defined target population undergoes an index test, as well as a refer-ence standard test. The number of true positives, false positives, true negatives, and false negatives is then calculated (Figure 1). From these estimates, the sensitivity and specificity of a test can be calculated, as can other measures of diagnostic accuracy such as positive (PPV) and nega-tive predictive values (NPV) and diagnostic odds ratios. Note that PPV
17. Developing guideline recommendations for tests or diagnostic tools
215
and NPV, unlike sensitivity and specificity, are affected by the prevalence of a disease.
■ “True positives” refers to the number of people that the index test correctly identified as having the condition (i.e. the number of par-ticipants that tested positive with both the index test and reference standard).
■ “False positives” refers to the number of people that the index test incorrectly identified as having the condition (i.e. the number of people that tested positive with the index test but negative with the reference standard).
■ “True negatives” refers to the number of people that the index test correctly identified as not having the condition (a negative result with both the index and the reference standard test).
■ “False negatives” refers to the number of people that the index test incorrectly identified as not having the condition (a negative result on the index test but positive with the reference standard).
It should be noted that unlike a randomized, controlled trial (RCT), diagnostic accuracy studies contain only one participant group and are not randomised: all participants undergo both the index and the reference standard test (note that some diagnostic accuracy studies do not subject par-ticipants to both the index and reference standard tests, but they are subject to verification bias (4)). Often, however, the participant group may be further stratified into subgroups (e.g. children, HIV+ patients) which is useful to assess sensitivity and specificity of a test in specific populations.
WHO handbook for guideline development
216
Figure 1. Study design of diagnostic accuracy studies
17.2.2 What is a diagnostic test randomised controlled trial (D-RCT)?
Diagnostic tests can also be assessed as interventions in RCTs where par-ticipants are randomised to receive or not to receive a test. Thus, as in all RCTs, there are two or more groups and patient-important outcomes such as mortality and morbidity are assessed. In this way, the outcomes gener-ated for D-RCTs differ from the outcomes generated in diagnostic accuracy studies (sensitivity and specificity). D-RCTs can randomise participants to a test, such as randomising participants with low back pain to an x-ray or no x-ray (5). Alternatively, D-RCTs can randomise participants to a test-treatment strategy, such as randomising participants with heart failure to B type natriuretic peptide (BNP) guided management or to usual management (6). Further details of D-RCTs are beyond the scope of this chapter; more information is available in other sources (7, 8).
17.2.3 Diagnostic accuracy studies or diagnostic randomised controlled trials?
Diagnostic accuracy studies generate measures of the accuracy of a test to diagnose a target disease such as sensitivity and specificity, whereas D-RCTs assess the effectiveness of a test on patient-important outcomes. So what evi-dence should we use as the basis for guideline recommendations? Ideally, diagnostic test recommendations should be based on D-RCTs as they are the optimal way to assess a diagnostic strategy (9). When such studies are available, evidence synthesis techniques and the Grading of Recommenda-tions Assessment, Development and Evaluation (GRADE) system for inter-ventions should be used to formulate recommendations (see Chapters 7 to 10 of the WHO Handbook for Guideline Development (1)). Unfortunately, D-RCTs are rare, whereas diagnostic accuracy studies are common (2,3). Thus, this chapter focuses on the more common situation of basing guideline recommendations primarily on diagnostic accuracy studies. It should also be noted that even when D-RCTs are available, diagnostic accuracy studies are often useful for developing recommendations and thus, when available they should also be identified and their data synthesised. The use of both D-RCT data and data from diagnostic accuracy studies is advantageous as health professionals often want information on the accuracy of a test, and the ability to accurately define disease is an important part of many treat-ment guidelines.
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
217
17.3 Formulating diagnostic test questions
The first step in the production of a guideline recommendation is to define the question to be addressed by the recommendation. This question should ref lect clinical or public health uncertainty. When one develops a recom-mendation for diagnostic tests, both a PICO (Population, Intervention, Comparator, Outcome) and a PIRT question (Participants, Index Test(s), Reference Standard, Target condition) should be developed. Structuring questions in the format PICO or PIRT forms the basis for search strategies for D-RCT and for diagnostic test accuracy studies, respectively. Further-more, at this stage, guideline developers should classify their outcomes as either “critical” or “important but not critical” for their guideline. For guidelines on diagnostic tests, guideline developers should consider accu-racy (sensitivity and specificity), as well as the potential side effects of tests and other relevant considerations (e.g. time to conduct the test, time to get results, cost of the test). Note that guideline developers should also consider patient-important outcomes such as mortality: these outcomes are typically determined from intervention studies (D-RCT and observa-tional studies), the methods for which are covered elsewhere in the WHO handbook for Guideline Development (1).
A PICO question should always be formulated because ideally diagnos-tic test recommendations should be based on D-RCT evidence comparing the effect of a test as an intervention and this helps focus the recommenda-tion on patient-important outcomes rather than only on test accuracy. For instance, a test may be accurate but too invasive, too costly or take too long to generate results to recommend it. Frequently, however, there is no D-RCT evidence to support test recommendations and thus diagnostic accuracy studies are used. This chapter focuses on the production of guideline recom-mendations from diagnostic accuracy studies and will thus focus on PIRT and developing guideline recommendations based exclusively on diagnostic accuracy studies.
An example of a diagnostic test question in PICO format is: “In patients with smear-positive TB, does the use of line probe assays to diagnose drug resistance lead to lower mortality compared with conventional culture-based drug-susceptibility testing?”. The corresponding PICO is: P, patients with smear-positive TB; I, line probe assays; C, culture-based drug-susceptibility testing; and O, mortality.
WHO handbook for guideline development
218
WHO handbook for guideline development
218
An example of a diagnostic accuracy question in PIRT format is “In patients with smear-positive tuberculosis, are line probe assays better than culture-based drug susceptibility tests at diagnosing multi-drug resistance TB (MDR-TB)?”. The corresponding PIRT is: P, patients with smear positive TB; I, line probe assays; R, culture-based drug-susceptibility testing; and T, MDR-TB.
17.4 Systematic reviews of diagnostic accuracy studies
Upon formulation of PICO and PIRT questions, a systematic review is per-formed to identify, appraise and (if appropriate), produce a meta-analysis of sensitivity and of specificity. The steps to complete a systematic review of diagnostic accuracy studies follow the same structure as that used to com-plete a systematic review of intervention studies. Reporting guidelines for systematic reviews are available (10), as is a structured tool to assess the qual-ity of a systematic review (ROBIS) (11), and these are applicable to reviews of diagnostic accuracy studies. Note that the AMSTAR-2 tool used to assess the quality of systematic reviews of interventions is not recommended for diagnostic accuracy systematic reviews (12).
There are important considerations when conducting a systematic bib-liographic database search for diagnostic accuracy studies. Firstly, search filters that are often applied to systematic searches are not advised as they can miss a considerable number of relevant studies (13). Other important considerations include consulting with clinical and methodological topic experts to generate key search terms, involving a medical librarian, identi-fying PubMed MeSH terms, and comparing one’s draft search strategy to a search strategy used in published systematic reviews. Ideally, no language restrictions should be placed on the search; unpublished (including trial reg-isters) and grey literature, ongoing studies, as well as conference abstracts should be sought; and authors of primary studies should be contacted if questions or concerns arise in data extraction or assessment. As for any sys-tematic review, studies and data should only be included if there is sufficient information to permit assessment of their risk of bias. Importantly, system-atic reviews of diagnostic test accuracy should follow the PRISMA checklist for reporting diagnostic test accuracy reviews (14).
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
219
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
219
17.5 GRADE evidence profiles for diagnostic tests
17.5.1 What is GRADE for diagnostic tests?
GRADE was originally developed to assess evidence pertaining to inter-ventions, however subsequently the GRADE Working Group developed a system for assessing the quality of evidence pertaining to diagnostic test accuracy (9). Broadly, the GRADE approach is a structured way to assess the quality (certainty) of a body of evidence to address a specific question. In the GRADE approach for interventions, RCTs start as high-quality evi-dence, whereas observational studies start as low-quality evidence (15); the quality of evidence is then downgraded or upgraded according to specific criteria (see Chapter 9 of the WHO handbook for guideline development (1)). In the GRADE approach for diagnostic accuracy studies, the body of evidence starts as high quality (certainty), although such studies provide only indirect evidence for patient-important outcomes (9). This is differ-ent from GRADE for interventions where observational studies start as low-quality evidence. This chapter will outline guidance for WHO staff using the GRADE system for assessing the quality of evidence pertaining to diagnostic test accuracy.
17.5.2 GRADE evidence profiles for diagnostic tests
GRADE evidence profiles are structured, domain-based tables that outline factors that can affect the quality (certainty) of a body of evidence. Impor-tantly, they are outcome-based: they incorporate all the available evidence for specific outcomes. In the case of diagnostic accuracy studies, these out-comes are sensitivity (true positives and false negatives) and specificity (true negatives and false positives). GRADE evidence profiles for diagnostic accu-racy follow the same structure as evidence profiles for interventions: they contain summaries of the diagnostic accuracy estimates and the number of studies and patients contributing to these estimates. They also include five domains for consideration of the certainty of the evidence for each outcome: risk of bias, indirectness, inconsistency, imprecision and publication bias. All the domains except publication bias are assessed individually to deter-mine if very serious, serious or no serious limitations exist. Publication bias
WHO handbook for guideline development
220
is considered either “not detected” or “highly suspected”. If there are no serious limitations across any of the five domains, the quality of evidence is not downgraded. If serious limitations exist in one of the five domains, the evidence is downgraded by one level, from high to moderate. If very serious limitations exist for one domain or serious limitations are present in two domains, the evidence is downgraded by two levels, from high to low quality. Similarly, if three domains contain serious limitations (or one domain with very serious limitations and another with serious) the evidence is down-graded by three, to very low.
Although the five domains that decrease the certainty of the effect esti-mates of test accuracy studies are the same as those that affect the certainty of intervention studies, they require different operationalization. Further, the GRADE evidence profiles for diagnostic accuracy have two groups of outcomes: true positives and false negatives (sensitivity) and true negatives and false positives (specificity) (Figure 2). The following sections will outline how to assess the five domains as they pertain to diagnostic test studies.
17.5.2.1 Risk of biasTo assess the risk of bias (also called study limitations) of individual diag-nostic accuracy studies, the QUADAS-2 (Quality Assessment of Diagnos-tic Accuracy Studies) tool (17) should be used. This should be completed by the systematic review authors. This tool assesses the risk of bias of individual studies across four domains: participant selection, index test, reference standard, and f low and timing across both sensitivity (true posi-tives and false negatives) and specificity (true negatives and false posi-tives). The full QUADAS-2 tool also contains an applicability domain; however, this domain should not be used to downgrade evidence when assessing risk of bias as this is covered in the indirectness domain of the GRADE system when the certainty of the body of evidence is assessed. The QUADAS-2 tool contains a series of signalling questions for each domain, which the appraisers use to judge each domain as high, low or unclear risk of bias.
Results from the risk of bias assessment (of each included study) with the QUADAS-2 tool can be displayed graphically (Figure 3) (16), and used to assess the overall risk of bias for a body of evidence concerning an outcome.
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
221
WHO handbook for guideline development
222
Im
age f
rom
: Wor
ld H
ealth
Org
aniza
tion g
uideli
ne: T
he us
e of m
olecu
lar lin
e pro
be as
says
for t
he de
tect
ion of
resis
tanc
e to i
soni
azid
and r
ifam
picin
. 201
6 (16
).
Figu
re 2
. Ex
ampl
e of
a G
RAD
E ev
iden
ce p
rofil
e fo
r dia
gnos
tic t
ests
There are no specific rules for what proportion of low, high or unclear risk of bias assessments with the QUADAS-2 tool constitutes very serious, seri-ous or no serious limitations for the GRADE assessment of the certainty of the body of evidence for each outcome. This is a judgement that needs to be made in the context of individual situations, however the reasons for the decision need to be explicit, transparent and clear. In Figure 3, for example, the body of evidence was assessed as having serious limitations because: “The sampling of patients (participants) was often not stated”, reflecting the unclear risk of bias in the majority of included studies (approximately 60%) in the patient selec-tion domain of the QUADAS-2 tool (16). In addition, “it was often not stated whether investigators were blinded between the index and reference standard” (unclear risk of bias in the majority of included studies (approximately 70%) in the index test and reference standard domains of the QUADAS-2 tool) (16).
Figure 3. Example of a display of risk of bias assessments across studies
The proportion of individual studies with low, high or unclear risk of bias across the domains. Image from: World Health Organization. WHO Guideline: The use of molecular line probe assays for the detection of resistance to isoniazid and rifampicin. 2016 (16).
17.5.2.2 Indirectness The assessment of the indirectness of a body of evidence refers to two concepts.
1. How well does the underlying body of evidence match the clinical or public health question (PIRT)?
2. How certain can the reviewer be that the consequences of using the test will lead to improvement in patient-important outcomes?
These two questions should both be considered in the assessment of indirectness; one or both can contribute to downgrading the evidence for
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
223
indirectness. For instance, if the underlying body of evidence does not match the clinical or public health PIRT, the evidence should be downgraded, even if there are no concerns regarding question 2 (and vice versa). To address the first question, the Guideline Development Group needs to compare the PIRT question they generated in Section 17.3 to the PIRT of the underlying body of evidence (from their systematic search). Differences in the partici-pants, index and/or reference standard tests and setting between the PIRT developed a priori by the guideline developers and the PIRT of the under-lying evidence may mean that the results from the underlying evidence do not translate to the intended patients or populations. For instance, con-sider if a Guideline Development Group commissioned a systematic review to determine the accuracy of brain natriuretic peptide (BNP) to diagnose heart failure in patients with signs and symptoms of heart failure present-ing to general practice, compared to a reference standard of echocardiogra-phy. The Guideline Development Group may consider downgrading if the available evidence only included patients with shortness of breath and no other symptoms of heart failure, did not use echocardiography as a reference standard (be careful not to double count this in the risk of bias assessment), or included patients presenting to the emergency department. Evidence may be additionally downgraded if it was conducted in a high-resource setting and the guideline is intended for low-resource settings.
It should be noted that PIRT questions that include two or more index tests present an exceptional situation. In these circumstances, an additional concept pertaining to indirectness should be considered: were the two index tests compared directly in the included studies using the same reference standard? If this is not the case, downgrading should occur.
Furthermore, the outcomes from diagnostic accuracy studies, sensitiv-ity and specificity, are inherently indirect: improved diagnostic accuracy does not always translate into improvements for patients or populations (2). Improved diagnostic accuracy can lead to benefits for patients but relies on the following assumptions:
■ A test with an increased number of true positives (TP) will benefit patients if effective treatment for the disease is available, the benefits of treatment outweigh the harms and all patients receive the treatment. Also, benefit will only be seen if effective health services exist which are able to deliver the treatment to patients.
■ A test with fewer false negatives (FN) will benefit patients by mini-mising delayed diagnoses if the natural history of the disease leads to morbidity and mortality.
WHO handbook for guideline development
224
■ A test with an increased number of true negatives (TN) may benefit patients by providing reassurance and sparing from unnecessary treat-ment. However, patients are not uniformly reassured by true nega-tive test results (18): they may seek another diagnosis to explain their symptoms.
■ A test with fewer false positive (FP) results will benefit patients if treat-ment and/or additional testing lead to adverse effects and/or the diag-nostic label leads to patient anxiety.
To assess indirectness, the Guideline Development Group needs to deter-mine how confident they are that the above assumptions are true, ideally with evidence to support their decisions. For instance, evidence to address the TP and FP assumption should come from RCTs of a treatment (and its adverse effects) of the target disease. Evidence to address the FN assump-tion can come from prognostic (natural history) studies and/or the control arms of RCTs.
When assessing the indirectness of diagnostic accuracy evidence using the GRADE system, guideline panels need to determine if very serious, seri-ous or no serious indirectness exists by answering the two questions stated at the beginning of this section. Indirectness with respect to either or both of these questions can lead to downgrading the evidence. As is the case for all components of the GRADE system for diagnostic accuracy studies, indirect-ness needs to be determined for both sensitivity and specificity. In reality, the first indirectness question (how well does the underlying body of evidence match the PICO question?) is difficult to apply to sensitivity and specific-ity separately, and thus sensitivity and specificity are often downgraded (or not downgraded) together. However, this is not the case when addressing the second indirectness question where sensitivity and specificity are often downgraded independently of each other. For instance, if an effective treat-ment is not available, a test that produces more true positive results will not necessarily translate to improved patient outcomes. This situation would lead to downgrading sensitivity (true positives and false negatives) but would not necessarily lead to downgrading specificity as the latter is dependent on the assumptions (as stated above) surrounding the number of true negatives and false positives.
17.5.2.3 InconsistencyThis domain in the GRADE approach refers to an assessment of the incon-sistency of effect estimates (in our case sensitivity and specificity) across included studies. Substantial heterogeneity among different test accuracy
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
225
studies is common, expected, and generally not considered to be due to chance alone (19). There are a number of factors that contribute to hetero-geneity in meta-analyses of diagnostic accuracy studies: chiefly, the varia-tion in the threshold used to define a participant as having or not having the disease (often referred to as the “positive threshold”) (20). Given the expected heterogeneity, meta-analyses of diagnostic accuracy should be per-formed with a random-effects model rather than a fixed-effects model (19). For similar reasons, conventional tests investigating heterogeneity such as the I2 test (21), are infrequently used and are not endorsed by Cochrane (19).
Lastly, as is the case for all five components of the GRADE assessment of the certainty of evidence, sensitivity and specificity should be considered and assessed separately.
For the investigation of inconsistency in diagnostic accuracy meta-anal-yses, Guideline Development Groups should address the following questions.
1. How similar are the point estimates across the primary studies? 2. Do the confidence intervals overlap?
To address these questions Guideline Development Groups can use forest plots. Consider the forest plots in Figure 4. The sensitivity point esti-mates are all above 72% with overlapping confidence intervals, apart from one outlying study. Conversely, the point estimates for specificity are scat-tered between 12% and 88%. In this example, specificity should be down-graded for inconsistency. For sensitivity, if there is an apparent reason for the outlying study (e.g. poor quality as assessed with QUADAS-2, use of a different positive threshold, smaller study, different PIRT or study setting) then sensitivity need not be downgraded.
Importantly, inconsistency should only lead to downgrading if it is unexplained. If inconsistency can be explained, for instance by factors explored with sensitivity analyses or meta-regression, then it should not lead to downgrading.
Lastly, it is important to note that differences in PIRT between the sys-tematic review and the guideline outcome should be addressed in the indi-rectness domain of GRADE and not as inconsistency.
17.5.2.4 ImprecisionThis component guides panels to judge whether the pooled sensitivity and specificity are precise enough to support a recommendation. This assessment
WHO handbook for guideline development
226
Re
prod
uced
from
Mus
tafa
R. e
t al. S
yste
mat
ic re
view
s and
met
a-an
alyse
s of t
he ac
cura
cy of
HPV
tests
, visu
al in
spec
tion w
ith ac
etic
acid,
cyto
logy
, and
colp
osco
py. In
t J G
ynae
col O
bste
t. 20
16
Mar
;132(
3):2
59-6
5. do
i: 10.1
016/
j.ijgo
.201
5.07.0
24. E
pub 2
015 N
ov 12
.(22)
. Lice
nces
unde
r the
Crea
tive C
omm
ons A
ttrib
ution
, Non
-Com
mer
cial, N
o Der
ivativ
e wor
ks lic
ence
(CC B
Y-NC
-ND
4.0);
http
://cre
ative
com
mon
s.org
/lice
nses
/by-
nc-n
d/4.0
/). N
ote t
hat t
he I2 an
d Q es
timat
es ha
ve be
en re
mov
ed fr
om th
e orig
inal
figur
e. Po
oled e
stim
ates
are d
erive
d fro
m a
hier
arch
ical m
odel.
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
227
Figu
re 4
. Ex
ampl
e of
a fo
rest
plo
t of s
ensi
tivi
ty a
nd sp
ecifi
city
228
concerns the width of confidence intervals surrounding the pooled sensitiv-ity and the pooled specificity.
Guideline Development Groups can use the following two concepts to guide their judgement on the width of confidence intervals:
1. specify a priori a width of the confidence interval that constitutes imprecision; or
2. consider how a recommendation may change if the sensitivity and specificity were at the upper or lower limit of their respective confi-dence intervals.
The assessment of the imprecision of a test’s sensitivity and specificity is a judgement that may be influenced by a number of factors. Most signifi-cantly, guideline panels should consider the role of the test and the preva-lence of the disease. To produce more transparent guidelines, some previous WHO Guideline Development Groups have used explicit thresholds for assessments of imprecision. For instance, a guideline panel addressing a question of the accuracy of line probe assays for detecting rifampicin resist-ance in patients with signs and symptoms of TB considered estimates to be imprecise if the confidence intervals of the pooled sensitivity were wider than 10% in either direction (16). On the other hand, they considered the pooled specificity to be imprecise if confidence intervals were wider than 5% in either direction (16).
If a threshold is going to be used it should be specified a priori and be generated in line with the role of the specific test. For instance, different thresholds may be selected for tests that are used to rule out disease (where a high level and precise estimate of sensitivity are desirable); conversely tests used to rule in disease rely on a high level of specificity. Ideally, any thresh-old that is used should be informed by evidence, particularly regarding the consequences of varying test results on patient-important outcomes. For example, modelling of the number of false positives and follow up conse-quences could be used if specificity is lower than a chosen threshold. Fur-thermore, there are some unique situations where wide confidence intervals around measures of diagnostic accuracy do not necessarily lead to a down-grading of the certainty of the evidence. For instance, in a condition that has a low prevalence, wide confidence intervals around sensitivity may not lead to real concern with imprecision for TP and FN.
Thresholds need not be applied for all guidelines. A Guideline Develop-ment Group may also consider how their recommendations may change if they assumed the sensitivity (or specificity) was the lower limit of the con-
WHO handbook for guideline development
228
229
fidence interval. For instance, consider a pooled sensitivity of 90% with a confidence interval of 60% to 98%: although a sensitivity of 90% seems high, if the sensitivity was 60% then guideline developers may not consider this an adequate sensitivity to support the use of a test. If guideline panels choose to assess imprecision this way, it is desirable for the change in TP, FP, TN and FN to be determined, and, importantly, the consequences of these changes modelled: for example, the number of people successfully treated at the lower limit of the confidence interval.
17.5.2.5 Publication bias The final component of the assessment of certainty of the evidence for diag-nostic tests is publication bias. This refers to the increased likelihood that studies with statistically significant or favourable results will be published compared to results with non-significant or unfavourable results (23). This can, in turn, lead to syntheses of incomplete sets of the evidence and produce summary results potentially biased towards favourable treatment effects (23). A substantial amount of research exists on publication bias in system-atic reviews of RCTs, however little research exists for systematic reviews of diagnostic accuracy (19).
Currently there is no widely agreed upon statistical test to help iden-tify publication bias in diagnostic accuracy studies and previous WHO guidelines have not assessed it (24,25). Guideline developers should ensure that searches for evidence are comprehensive (including trial reg-isters and grey literature), acknowledge any limitations in searching, and report any evidence to suggest that important information may not have been retrieved.
17.5.2.6 Assessing the certainty of the body of evidence across outcomes After completing the assessment of the five domains for each critical out-come (for PICO questions) and for sensitivity and specificity (for PIRT ques-tions), the overall assessment of the certainty of the body of evidence can be determined for each of the PICO and PIRT questions, across outcomes. As is the case for GRADE for interventions, the overall certainty can be assessed as high, moderate, low or very low. Evidence starts as high quality but is downgraded by one (e.g. high to moderate) or two (moderate to very low) levels depending on an aggregate assessment across outcomes. Examples of diagnostic test evidence profiles are available at (http://www.who.int/tb/areas-of-work/laboratory/fllpa_online_annexes.pdf?ua=1) and the GRADE Database of Evidence Tables (https://dbep.gradepro.org/search).
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
229
Guideline developers can use GRADEpro software (available at https://gradepro.org/) to present their evidence profiles. This software allows both the presentation of each of the five domains for assessing certainty of evi-dence and summary of findings (SOF) tables. GRADEpro also allows guide-line developers to view the data in many different forms and to assess multiple outcomes including sensitivity, specificity, positive and negative predictive value and other measures of diagnostic accuracy; this facilitates formulation of recommendations.
17.6 Evidence-to-decision frameworks for diagnostic tests
17.6.1 What are evidence-to-decision frameworks for diagnostic tests?
Evidence-to-decision (EtD) frameworks were developed to facilitate the process of formulating recommendations based on various considerations in addition to the benefits and harms of an intervention. These considera-tions affect the direction (for or against the intervention) and the strength (strong or conditional) of a guideline recommendation. The GRADE system includes explicit guidance on how to use EtD frameworks when developing diagnostic test recommendations (26), in addition to specific guidance for interventions (27) (see Chapter 10 of the WHO handbook on guideline development (1)).
Each of the considerations in the EtD framework should be thought-fully examined at the beginning of any guideline development process, whether for interventions or for diagnostic tests. Once the general scope and key questions are decided upon, guideline developers need to consider what evidence in addition to diagnostic test accuracy and patient-important health outcomes is needed to enable the Guideline Development Group to formulate a given recommendation. Additional evidence on, for example, feasibility or acceptability, may be needed to inform the recommendation. Guideline developers must specify the perspective they are taking, whether a population or an individual patient perspective, prior to seeking evidence and populating the EtD framework.
WHO handbook for guideline development
230
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
231
Question
Judgem
ents
Is th
e pro
blem
a pr
iority
Don’t
know
Varie
sNo
Prob
ably
noPr
obab
ly ye
sYe
s
Test
Accu
racy
Don’t
know
Varie
sVe
ry in
accu
rate
Inac
cura
teAc
cura
teVe
ry ac
cura
te
Desir
able
effec
tsDo
n’t kn
owVa
ries
Trivia
lSm
allM
oder
ate
Larg
e
Unde
sirab
le eff
ects
Don’t
know
Varie
sLa
rge
Mod
erat
eSm
allTri
vial
Certa
inty
of te
st ac
cura
cyNo
stud
iesVe
ry lo
wLo
wM
oder
ate
High
Certa
inty o
f crit
ical o
r impo
rtant
outco
mes
No st
udies
Very
low
Low
Mod
erat
eHi
gh
Certa
inty
of th
e man
agem
ent g
uided
by te
st re
sults
No st
udies
Very
low
Low
Mod
erat
eHi
gh
Certa
in of
link b
etwe
en te
st re
sults
and
man
agem
ent
No st
udies
Very
low
Low
Mod
erat
eHi
gh
Over
all ce
rtain
ty ab
out e
ffect
s of t
est
No st
udies
Very
low
Low
Mod
erat
eHi
gh
Valu
eIm
porta
nt un
certa
inty
or
varia
bility
Poss
ibly
impo
rtant
un
certa
inty
Prob
ably
no im
porta
nt
unce
rtaint
yNo
impo
rtant
un
certa
inty
Balan
ce of
desir
able
and u
ndes
irabl
e effe
cts
Don’t
know
Varie
sFa
vour
s the
com
paris
onPr
obab
ly fa
vour
s the
co
mpa
rison
Does
not f
avou
r eith
erPr
obab
ly fa
vour
s the
in
dex t
est
Favo
urs t
he in
dex
test
Reso
urce
requ
irem
ents
Don’t
know
Varie
sLa
rge c
osts
Mod
erat
e cos
tsNe
glig
ible
costs
or sa
vings
Mod
erat
e sav
ings
Larg
e sav
ings
Certa
inty o
f res
ource
requ
irem
ents?
No in
clude
d stu
dies
Very
low
Low
Mod
erat
eHi
gh
Cost-
effec
tiven
ess
No in
clude
d stu
dies
Varie
sFa
vour
s the
com
paris
onPr
obab
ly fa
vour
s the
co
mpa
rison
Does
not f
avou
r eith
erPr
obab
ly fa
vour
s the
in
dex t
est
Favo
urs t
he in
dex
test
Equi
tyDo
n’t kn
owRe
duce
dPr
obab
ly re
duce
dPr
obab
ly no
impa
ctPr
obab
ly in
creas
edIn
creas
ed
Acce
ptab
ility
Don’t
know
Varie
sNo
Prob
ably
noPr
obab
ly Ye
sYe
s
Feas
ibilit
yDo
n’t kn
owVa
ries
NoPr
obab
ly no
Prob
ably
Yes
Yes
Tabl
e 1.
Co
mpo
nent
s of t
he G
RAD
E ev
iden
ce-t
o-de
cisi
on fr
amew
ork
WHO handbook for guideline development
232
17.6.2 Components of evidence-to-decision frameworks for diagnostic tests
The assessments that make up the GRADE EtD framework for diagnostic tests are listed in Table 1 and detailed below (26).
17.6.2.1 Is the problem a priority?The first component of the EtD framework is a judgement regarding the pri-ority of the problem being addressed in the recommendation question. Here Guideline Development Groups must consider evidence regarding the prev-alence, incidence, morbidity, mortality and cost of the disease the recom-mendation is addressing. The burden of under- or over-diagnosing patients can also be highlighted. This component is not necessary and can be omitted if the guideline or recommendation is not prioritizing treatment or testing options; the problem was clearly a priority or the guideline would not have been undertaken.
17.6.2.2 How accurate is the test? This component can be answered directly from the pooled sensitivity (TP, FN) and specificity (TN, FP) presented in the GRADE evidence profiles. It should be noted that guideline developers should consider a priori what they will consider “very accurate”, “accurate”, “inaccurate” and “very inaccurate”. Also, they should consider the specific prevalence of the disease that will be considered in this judgement. GRADE evidence profiles allow, and in fact encourage, developers to enter different disease prevalences to see how the number of true positives, false positives, true negatives and false negatives change per 1000 patients (see Figure 2; pre-test probability is synonymous with disease prevalence).
17.6.2.3 How substantial are the desirable and undesirable anticipated effects?
This component asks panels to judge the anticipated benefits and harms from the test in question, including direct effects from the test (e.g. benefits such as faster diagnosis, and harms such as adverse effects from administration of the test). In addition, the possible subsequent effects of the test must be included, for instance effects of treatment after a positive diagnosis and the effect of no treatment or further testing after a negative test. Evidence should inform these downstream effects after a diagnosis, ideally from systematic reviews of D-RCTs. For instance, an EtD table completed to determine if Human Papilloma Virus (HPV) tests should be used to screen for cervical
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
233
intra-epithelial neoplasia 2 (CIN2) (a cervical cancer precursor), included evidence from systematic reviews of benefits (e.g. decreased mortality) and harms (e.g. infection, bleeding) from the three possible treatments for CIN2 (22). If systematic review evidence is not available, the potential subsequent effects can be modelled. For instance, the prevalence of the disease in ques-tion combined with the sensitivity and specificity can be used to estimate the number of FPs and FNs in a population and to estimate PPV and NPV. These data can assist guideline developers to make a judgement about the undesirable effects of the test.
17.6.2.4 What is the overall certainty of the test accuracy evidence? This component is the judgement of the quality (certainty) of the evidence for diagnostic test accuracy. For this domain of the E2D framework, sensi-tivity and specificity should be considered as one collective measure of test accuracy.
17.6.2.5 What is the certainty of the evidence for any critical or important outcomes?
Critical or important outcomes in the GRADE approach are those related to harms and benefits caused to patients (26) and these are specified a priori by guideline developers (26,28). This component prompts guideline developers to assess the quality (certainty) of evidence on the direct benefits and harms of the test (part of the assessment in section 17.6.2.3). Direct benefits can include faster diagnosis, for example, whereas direct harms refer to specific harms from the test, for instance allergic reactions to radioactive contrast dye. Often these outcomes data will be found in the diagnostic accuracy studies included in the systematic review, however, they may also come from other primary studies or systematic reviews. Panels will have to make a judgement on the quality of evidence from these additional studies (very low, low, moderate or high).
17.6.2.6 What is the overall certainty of the evidence of effects of the management that is guided by the test results?
Guideline Development Groups are asked to judge the certainty of evidence for two issues.
1. The evidence supporting treatment and management after a posi-tive diagnosis: a) the quality of evidence supporting the treatment of the target disease (this is particularly relevant for persons who are classified as TP); and b) the quality of evidence for adverse effects of
WHO handbook for guideline development
234
treatment (this is particularly relevant for persons that test FP). This evidence should be addressed in a systematic review of RCTs or obser-vational studies, with a corresponding GRADE evidence profile for interventions.
2. The evidence supporting the natural history (prognosis) of the target condition: Improvement or deterioration without treatment or fur-ther management is relevant to those that test FN. The evidence on the natural history of the target condition should generally come from the control arms of RCTs or observational studies and the qual-ity of evidence is judged using the GRADE approach for questions of prognosis (29).
17.6.2.7 How certain is the link between test results and management decisions?
Guideline Development Groups must make a judgement about the likelihood that the appropriate management (such as treatment decisions) will follow on from test results. Important features of a test, such as test turnaround time and interpretability of results can pose barriers to patients receiving the appropriate treatment after obtaining a test result. Further, there may be factors external to the test that reduce the likelihood of patients receiv-ing appropriate management after a test such as out-of-pocket expenses; access to quality, coordinated services; health literacy; among many others. Guideline Development Groups should consider the literature broadly and seek and include relevant contextual knowledge about the target healthcare settings for potential barriers that prevent appropriate follow up and man-agement after a test. Although it is highly preferable to consider published research studies to inform this judgement, often guideline developers have to rely on their own experience concerning the likelihood a test result is managed appropriately and this can still be considered “high” certainty. For instance, the 2018 American Society of Haematology Guideline for manage-ment of venous thromboembolism: diagnosis of venous thromboembolism (30) considered that “with [a pulmonary embolus] (PE) diagnosis, positive results will be treated with anticoagulation (regardless of the chances of false posi-tives)” (30). They assumed this because the “intervention is relatively simple to apply and few patients would be missed in health care systems that are equipped to offer testing” (26,30). The guideline panel thus considered the certainty of the link between the diagnostic test results and management decisions as high.
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
235
17.6.2.8 What is the overall certainty of the evidence about the effects of the test?
This component requires Guideline Development Groups to make an over-all judgement about the certainty (quality) of the evidence, considering the lowest quality of the previous four sections (17.6.2.4 to 17.6.2.7) which reflect the entire test-to-treatment pathway: from the accuracy of the test, to the likelihood patients that test positive get treated, to the effectiveness of treatment.
17.6.2.9 Is there important uncertainty about or variability in how much people value diagnostic accuracy of this test, and the other outcomes associated with the test-to-treat pathway?
This section addresses how much people value the outcomes and how this may affect recommendations. The outcomes of interest include accuracy (sensitivity and specificity) of a test, but also outcomes related to direct ben-efits and harms from tests and subsequent management of the disease or condition. For instance, Guideline Development Groups should use evi-dence from persons effected (or potentially effected) by the target condition and its associated test to assign a value to usability and resource require-ments of a test, time-to-result, benefits and harms from treatment following a positive test, as well as sensitivity and specificity of the test. As stated in section 17.3 each outcome should be rated as critical or important according to GRADE guidance (26), informed by qualitative studies reflecting patient, healthcare provider and other stakeholders’ perspectives. This will help to guide how a test (or a collection of test strategies) should be used in practice. Patient values may help lead guideline developers to recommend the test to rule out disease and thus prioritize a highly sensitive test. For example, the 2018 American Society of Haematology guideline prioritized a sensitive initial test because of patient desire to rule out a pulmonary embolism (30). Guideline Development Groups should reflect any uncertainty or variability in how patients and healthcare professionals value the outcomes. The Guideline Development Group will often survey its members to gather their views on the relative value of each outcome (although ideally persons effected by the recom-mendation will be surveyed). For an example, see https://guidelines.gradepro.org/profile/e9600faf-99bc-4ade-9f2f-70bf6e078f9e (in the evidence-to-decision framework tab).
WHO handbook for guideline development
236
17.6.2.10 Does the balance between desirable and undesirable effects favour the index test or the comparison?
Guideline Development Groups are prompted to make an overall judgement about the benefits and harms of the test. This assessment is based on the accuracy of the test, direct benefits (e.g. faster diagnosis) and harms (e.g. radiation) from the test, the benefits and harms from management follow-ing the test results, and the certainty of the bodies of evidence informing these assessments. For instance, a WHO guideline on the use of line probe assays to diagnosis multi-drug resistant TB judged the balance of benefits and harms to favour the line probe assay because of its high sensitivity and specificity and subsequent small numbers of FN and FP results, coupled with documented reductions in diagnostic and treatment delays (16). To aid decision-making, modelling can be performed to determine the number of TP, FP, FN and TN based on different disease prevalences (pre-test prob-abilities). Similarly, the number or rate of harmful effects of tests can be determined and compared and assessed. In the absence of data to support a decision, an assessment should usually be made by the Guideline Develop-ment Group. In this situation an assessment of “probably favours” either the intervention or the comparison should be made.
17.6.2.11 Resource requirementsResource requirements for implementing a new test should be considered before a recommendation is issued. Resource requirements are addressed by three questions.
1. How large are the resource requirements? 2. What is the certainty of the evidence on resource requirements? 3. Does the cost-effectiveness of the intervention favour the intervention
or the comparison?
When addressing this component, Guideline Development Groups should consider the resource requirements of both the test and subsequent management (e.g. treatment). Because resource cost and affordability are so contextual, the Guideline Development Group may elect to not consider cost in the recommendation, rather explicitly noting that adoption and adapta-tion of the guideline to the national or sub-national context will require careful consideration of the cost of the test and subsequent management strategies.
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
237
17.6.2.12 Equity, acceptability and feasibilityGuideline developers must consider the impact of the index test on health equity, the acceptability of a new test by relevant stakeholders and whether the implementation of the test is feasible. Evidence that can be used to address these issues includes qualitative studies on the perspectives of key stakeholders (e.g. patients, health professionals and programme managers). Often, due to the lack of research evidence addressing these components, Guideline Development Groups rely on the experience with the test in dif-ferent settings. For instance, the successful use of liquid-based cytology as a method of screening in the Kingdom of Saudi Arabia was used to support an assessment of the feasibility of implementing a new cervical cancer screening method in the WHO guideline on cervical cancer screening (22,31). Equity, acceptability and feasibility should all be judged individually.
17.6.2.13 Developing a recommendationA completed EtD framework assists Guideline Development Groups to gen-erate transparent, valid, trustworthy recommendations. Guideline develop-ers need to consider both the direction of the recommendation (whether the test should be recommended or not) and the strength of the recom-mendation (strong or conditional). How to formulate recommendations, including when to issue a strong or conditional recommendation, is cov-ered in detail in Chapter 10 of the WHO handbook on guideline development (1). There are, however, some issues specific to formulating diagnostic test recommendations.
It is common for a discrepancy to exist between the certainty of the evi-dence pertaining to test accuracy and the certainty of evidence for patient-important outcomes. This is most commonly due to uncertainty surrounding the link between test and treatments and/or uncertainty surrounding the effect of subsequent management. In these situations, where the certainty surrounding diagnostic accuracy evidence is moderate or high, but the cer-tainty surrounding the evidence of downstream management and/or the link between the test and management is low or very low, this uncertainty should be reflected and in most situations a conditional, rather than strong recommendation will be appropriate (26).
17.6.3 When is evidence from test accuracy studies sufficient to develop a recommendation?
A 2017 study reported the results of interviews with 24 international experts in evidence and decisions about healthcare-related tests and diagnostic strat-
WHO handbook for guideline development
238
egies (32). This study concluded that “test accuracy is rarely, if ever, sufficient to base guideline recommendations” and thus evidence-to-decision frame-works are necessary to help developers to consider important issues beyond test accuracy. Diagnostic test experts did, however, note four potential situ-ations when test accuracy is likely to be sufficient to extrapolate the effects of tests on patient-important outcomes (32):
1. when diagnostic noninferiority is sufficient for a decision;2. when inferences can be made about the impact on patient-importance
outcomes;3. when the accuracy of one test is equivalent or better than the com-
bined accuracy of two tests; and 4. when the primary goal of the guideline is to establish a diagnosis for a
condition or to rule out a condition.
In all of these four scenarios, however, there is an assumed link to patient-important outcomes. As such, even if one of these four situations applies, it is still advisable to consider patient-important outcomes when formulating recommendations.
17.7 Useful resources
The Cochrane handbook for systematic review of diagnostic test accuracy: http://methods.cochrane.org/sdt/handbook-dta-reviews (33)
■ The GRADEpro website: https://gradepro.org/ ■ GRADE database of evidence profiles: https://guidelines.gradepro.org/
search
Acknowledgements
This chapter was prepared by Dr Jack W. O’Sullivan and Dr Susan L Norris with peer review by Professor Reem Mustafa, Dr Mariska (MMG) Leeflang and Dr Alexei Korobitsyn.
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
239
References1. World Health Organiztion. WHO handbook for guideline development-
2nd edition. Geneva. 2014. Available from: http://apps.who.int/iris/bitstream/10665/145714/1/9789241548960_ eng.pdf?ua=1, accessed 22 February 2019.
2. Siontis KC, Siontis GCM, Contopoulos-Ioannidis DG, Ioannidis JPA. Diagnostic tests often fail to lead to changes in patient outcomes. J Clin Epidemiol. 2014;67(6):612–21. https://doi.org/10.1016/j.jclinepi.2013.12.008. PMID: 24679598
3. Ferrante Di Ruffano L, Davenport C, Eisinga A, Hyde C, Deeks JJ. A capture-recapture analysis demonstrated that randomized controlled trials evaluating the impact of diag-nostic tests on patient outcomes are rare. J Clin Epidemiol. 2012;65(3):282–7. http://dx.doi.org/10.1016/j.jclinepi.2011.07.003. PMID: 22001307
4. O’Sullivan JW, Banerjee A, Heneghan C, Pluddemann A. Verification bias. BMJ Evidence-Based Med. 2018;23(2):54-55. http://ebm.bmj.com/content/23/2/54. PMID: 29595130
5. Kendrick D, Fielding K, Bentley E, Kerslake R, Miller P, Pringle M. Radiography of the lumbar spine in primary care patients with low back pain: randomised controlled trial. BMJ. 2001;322:400–5. https://doi.org/10.1136/bmj.322.7283.400. PMID: 11179160
6. Felker GM, Anstrom KJ, Adams KF, Ezekowitz JA, Fiuzat M, Houston-Miller N, et al. Effect of Natriuretic Peptide–Guided Therapy on Hospitalization or Cardiovascular Mortality in High-Risk Patients With Heart Failure and Reduced Ejection Fraction. JAMA. 2017;318(8):713. http://jama.jamanetwork.com/article.aspx?doi=10.1001/jama.2017.10565. PMID: 28829876
7. Lijmer JG, Bossuyt PMM. Various randomized designs can be used to evaluate medical tests. J Clin Epidemiol. 2009;62(4):364–73. http://dx.doi.org/10.1016/j.jclinepi.2008.06.017. PMID: 18945590
8. Mustafa RA, Wiercioch W, Cheung A, Prediger B, Brozek J, Bossuyt P, et al. Decision making about healthcare-related tests and diagnostic test strategies. Paper 2: a review of methodological and practical challenges. J Clin Epidemiol. 2017;92:18–28. https://doi.org/10.1016/j.jclinepi.2017.09.003. PMID: 28916488
9. Schünemann HJ, Oxman AD, Brozek J, Glasziou P, Jaeschke R, Vist GE, et al. GRADE: grading quality of evidence and strength of recommendations for diagnostic tests and strategies. BMJ. 2008;336(7654):0–b. https://doi.org/10.1136/bmj.39500.677199.AE. PMID: 18483053
10. Moher D, Liberati A, Tetzlaff J, Altman DG, PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ. 2009;339:b2535. https://doi.org/10.1136/bmj.b2535. PMID: 19622551
11. Whiting P, Savović J, Higgins JPT, Caldwell DM, Reeves BC, Shea B, et al. ROBIS: A new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 2016;69:225–34. https://doi.org/10.1016/j.jclinepi.2015.06.005. PMID: 26092286
12. Shea BJ, Reeves BC, Wells G, Thuku M, Hamel C, Moran J, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;4008:j4008. https://doi.org/10.1136/bmj.j4008. PMID: 28935701
13. Leeflang MMG, Scholten RJPM, Rutjes AWS, Reitsma JB, Bossuyt PMM. Use of method-ological search filters to identify diagnostic accuracy studies can lead to the omission of relevant studies. J Clin Epidemiol. 2006;59(3):234–40. https://doi.org/10.1016/j.jclinepi.2005.07.014. PMID: 16488353
WHO handbook for guideline development
240
14. McInnes MDF, Moher D, Thombs BD, McGrath TA, Bossuyt PM, Clifford T, et al. Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies. JAMA. 2018;319(4):388. https://doi.org/10.1001/jama.2017.19163. PMID: 29362800
15. Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, et al. GRADE guidelines: 1. Introduction - GRADE evidence profiles and summary of findings tables. J Clin Epidemiol. 2011;64(4):383–94. http://doi.org/ 10.1016/j.jclinepi.2010.04.026. PMID: 21195583
16. World Health Organization. WHO Guideline: The use of molecular line probe assays for the detection of resistance to isoniazid and rifampicin. 2016. Available from: http://apps.who.int/iris/bitstream/handle/10665/250586/9789241511261-eng.pdf?sequence=1&isAllowed=y, accessed 22 February 2019.
17. Whiting PF, Rutjes AWS, Westwood ME, Mallet S, Deeks JJ, Reitsma JB, et al. QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies. Ann Intern Med. 2011;155(4):529–36. https://doi.org/10.7326/0003-4819-155-8-201110180-00009. PMID: 22007046
18. Rolfe A, Burton C. Reassurance After Diagnostic Testing With a Low Pretest Probability of Serious Disease. JAMA Intern Med. 2013;173(6):407. https://doi.org/10.1001/jamaintern-med.2013.2762. PMID: 23440131
19. Macaskill P, Gatsonis C, Deeks JJ, Harbord RM, Takwoingi Y. Chapter 10: Analysing and Presenting Results. In: Deeks JJ, Bossuyt PM, Gatsonis C (editors), Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Version 1.0. The Cochrane Collaboration, 2010. Available from: http://srdta.cochrane.org/, accessed 21 February 2019.
20. Leeflang MMG. Systematic reviews and meta-analyses of diagnostic test accuracy. Clin Microbiol Infect. 2014;20(2):105–13. https://coi.org/10.1111/1469-0691.12474. PMID: 24274632
21. Higgins JPT. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557-60. https://doi.org/10.1136/bmj.327.7414.557. PMID: 12958120
22. Mustafa RA, Santesso N, Khatib R, Mustafa AA, Wiercioch W, Kehar R, et al. Systematic reviews and meta-analyses of the accuracy of HPV tests, visual inspection with acetic acid, cytology, and colposcopy. Int J Gynecol Obstet. 2016;132(3):259–65. http://dx.doi.org/10.1016/j.ijgo.2015.07.024. PMID: 26851054
23. Ahmed I, Sutton AJ, Riley RD. Assessment of publication bias, selection bias, and unavail-able data in meta-analyses using individual participant data: a database survey. BMJ. 2012;344:d7762–d7762. https://doi.org/10.1136/bmj.d7762. PMID: 22214758
24. Rogozinska E, Khan K. Grading evidence from test accuracy studies: what makes it challenging compared with the grading of effectiveness studies? Evid Based Med. 2017;22(3):81–4. https://doi.org/10.1136/ebmed-2017-110717. PMID: 28600330
25. World Health Organization. WHO recommendations on antenatal care for a positive pregnancy experience. Geneva; 2016. Available from: https://www.who.int/reproductive-health/publications/maternal_perinatal_health/anc-positive-pregnancy-experience/en/, accessed 20 February 2019.
26. Schünemann HJ, Mustafa R, Brozek J, Santesso N, Alonso-Coello P, Guyatt G, et al. GRADE Guidelines: 16. GRADE evidence to decision frameworks for tests in clinical practice and public health. J Clin Epidemiol. 2016;76:89–98. https://doi.org/10.1016/j.jclinepi.2016.01.032. PMID: 26931285
27. Alonso-Coello P, Schünemann HJ, Moberg J, Brignardello-Petersen R, Akl EA, Davoli M, et al. GRADE Evidence to Decision (EtD) frameworks: A systematic and transparent approach to making well informed healthcare choices. 1: Introduction. Gac Sanit. 2018;32(2):166.e1-166.e10. https://doi.org/10.1016/j.gaceta.2017.02.010. PMID: 28822594
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
241
28. Andrews J, Guyatt G, Oxman AD, Alderson P, Dahm P, Falck-Ytter Y, et al. GRADE guide-lines: 14. Going from evidence to recommendations: The significance and presentation of recommendations. J Clin Epidemiol. 2013;66(7):719–25. https://doi.org/ 10.1016/j.jclinepi.2012.03.013. PMID: 23312392
29. Iorio A, Spencer FA, Falavigna M, Alba C, Lang E, Burnand B, et al. Use of GRADE for assess-ment of evidence about prognosis: rating confidence in estimates of event rates in broad categories of patients. BMJ. 2015;350h870–h870. https://doi.org/10.1136/bmj.h870. PMID: 25775931
30. Lim W, Le Gal G, Bates SM, Righini M, Haramati LB, Lang E, et al. American Society of Hematology 2018 guidelines for management of venous thromboembolism: diagnosis of venous thromboembolism. Blood Adv. 2018 Nov 27;2(22):3226–56. https://doi.org/10.1182/bloodadvances.2018024828. PMID: 30482764
31. World Health Organization. Guidelines for screening and treatment of precancerous lesions for cervical cancer prevention. 2013;60. Available from: http://www.who.int/repro-ductivehealth/publications/cancers/screening_and_treatment_of_precancerous_lesions/en/index.html, accessed 20 February 2019.
32. Mustafa RA, Wiercioch W, Ventresca M, Brozek J, Schünemann HJ, Bell H, et al. Decision making about healthcare-related tests and diagnostic test strategies. Paper 5: a qualitative study with experts suggests that test accuracy data alone is rarely sufficient for decision making. J Clin Epidemiol. 2017;92:47–57. https://doi.org/10.1016/j.jclinepi.2017.09.005. PMID: 28917629
33. Reitsma JB, Rutjes AWS, Whiting P, Vlassov VV, Leeflang MMG, Deeks JJ,. Chapter 9: Assessing methodological quality. In: Deeks JJ, Bossuyt PM, Gatsonis C (editors), Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Version 1.0.0. The Cochrane Collaboration, 2009. Available from: https://methods.cochrane.org/sites/meth-ods.cochrane.org.sdt/files/public/uploads/ch09_Oct09.pdf, accessed 20 February 2019.