Grading Strength of Evidence Prepared for: The Agency for Healthcare Research and Quality (AHRQ)...

Grading Strength of EvidencePrepared for:

The Agency for Healthcare Research and Quality (AHRQ)

Training Modules for Systematic Reviews Methods Guide

www.ahrq.gov

Systematic Review Process Overview

To define what “grading strength of evidence (SOE)” is

To describe why grading SOE is important To distinguish between grading SOE and

rating the quality of individual articles To list primary and additional domains for

grading SOE To describe options for scoring SOE domains To describe how to score and present SOE

grades

Learning Objectives

Is distinct from rating the quality of individual studies

Is generally used only to assess: Major outcomes (benefits and harms) Major comparisons, when relevant

Grading Strength of Evidence

To facilitate use of systematic reviews by diverse decisionmakers and stakeholders

To give decisionmakers: A comprehensive evaluation of the evidence A sense of how much confidence they can place in the

evidence

To foster transparency and documentation

Why Grade Strength of Evidence?

1.Scoring four required domains

a. Risk of bias

b. Consistency

c. Directness

d. Precision

2.Considering, and possibly scoring, four additional domains

a. Dose-response association

b. Plausible confounders

c. Strength of association

d. Publication bias

3.Combining scores from required domains into a single strength-of-evidence score, taking scores on additional domains into account as needed

Three Steps to Grading Strength of Evidence

Concerns both study design and study conduct for individual studies, rated by usual methods

Assesses the aggregate quality of studies within each major study design and integrates those assessments into an overall risk-of-bias score

Risk-of-bias scores: High — lowers strength-of-evidence grade Medium Low — raises strength-of-evidence grade

Four Required Domains: Risk of Bias

Defined as the degree of similarity in the effect sizes of different studies within an evidence base Consistent evidence bases:

Have the same direction of effect (same side of “no effect”)

Have a narrow range of effect sizes Inconsistent evidence bases:

Have nonoverlapping confidence intervals Have significant unexplained clinical or statistical

heterogeneity

Four Required Domains: Consistency

Only three possible scores for consistency: Consistent (i.e., no inconsistency) Inconsistent Unknown or not applicable (single study cannot be

assessed)

Meta-analysis: Use appropriate tests, such as Cochran’s Q test or I2

statistics

Four Required Domains: Consistency Scores

Defined as whether the evidence being assessed: Reflects a single, direct link between the interventions of

interest and the ultimate health outcome under consideration

Relies on multiple links in a causal chain

If multiple links are involved, strength of evidence can be only as strong as the weakest link

Using analytic frameworks* is important

Four Required Domains: Directness

*See the “Analytic Frameworks” module

Intermediate or surrogate outcomes instead of health or patient-centered outcomes Example: laboratory test results or radiographic findings versus

patient-reported functional outcomes or death

Indirect comparisons rather than direct, head-to-head comparisons Direct (e.g., A vs. B, A vs. C, and B vs. C):

Head-to-head studies in the evidence base Generally assumes use of health outcomes, not surrogate/proxy

outcomes Better strength of evidence

Indirect (e.g., A vs. B, B vs. C, but not A vs. C): No head-to-head studies that cover all interventions or outcomes of

interest Problematic situation for all types of comparisons Strength-of-evidence grades not as strong as with direct evidence

Four Required Domains: Aspects of Indirectness

Applicability is evaluated separately from directness for the Evidence-based Practice Center (EPC) program. For decisionmakers, the applicability of evidence depends

on the different interests of diverse groups. A PICOS framework (patient populations, interventions,

comparators, outcomes, and settings) is used for applicability assessment in the EPC program.

Although the EPC program separates applicability from strength-of-evidence grading, other systems that work with one decisionmaker may incorporate applicability issues into their evaluations of directness.

Related Issue of Applicability*

*See the “Assessing Applicability” module

Only two possible scores for directness: Direct:

Evidence is based on a single link between the intervention and health outcomes

Indirect: Evidence relies on:

Surrogate/proxy outcomes More than one body of evidence Both situations

Four Required Domains: Directness Scores

Defined as the degree of certainty for estimate of effect with respect to a specific outcome

Is a complicated concept that: Asks the question:

What can decisionmakers conclude about whether one treatment is, clinically speaking, inferior, superior, or equivalent (neither inferior nor superior) to another?

Includes considerations of: Statistical significance for effect estimates Confidence intervals for those effect estimates

Four Required Domains: Precision

Are rated separately for each important outcome or comparison, including for any summary estimate of effect size

Only two scores are possible Precise: estimate allows a clinically useful conclusion Imprecise: confidence interval is so wide it could include

clinically distinct (even conflicting) conclusions

Four Required Domains: Precision Scores

Four “discretionary” domains: Dose-response association Plausible confounders Strength of association Publication bias

Use when they are: Applicable Helpful in reaching conclusions about overall grades for

strength of evidence

Additional Domains

Pattern of a larger effect with greater exposure (dose, duration, adherence) either across or within studies

Rate if studies give levels of exposure

Additional Domains: Dose-Response Association

Three scores are possible for dose-response: Present: dose-response pattern observed

In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence.

Not present: no dose-response pattern observed (dose-response relationship not present)

Not applicable or not tested

Additional Domains: Dose-Response Scores

In an observational study, sometimes plausible confounding factors work in the direction opposite that of the observed effect. Had such “effect-weakening” confounders not been present,

the observed effect would have been even larger than the one observed.

In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence.

Consider whether or not plausible confounding exists that would decrease the observed effect.

Additional Domains: Plausible Confounding

Two scores are possible for plausible confounding: Present: confounding factors that would decrease the

observed effect may be present Absent: confounding factors that would decrease the

observed effect are not likely to be present

Additional Domains:Plausible Confounding Scores

Magnitude of effect: Defined as the likelihood that the observed effect is large

enough that it cannot have occurred solely as a result of bias from potential confounding factors

Consider when effect size is particularly large

Additional Domains: Strength of Association

Two scores are possible for strength of association: Strong: large effect size that is unlikely to have occurred in

the absence of a true effect of the intervention In such a case, Evidence-based Practice Center reviewers

may want to upgrade the level of evidence. Weak: small enough effect size that it could have occurred

solely as a result of bias from confounding factors

Additional Domains:Strength of Association Scores

Studies may have been published selectively. Example: only a small proportion of relevant trials or other

studies has been published.

Estimated effects of an intervention that are based on published studies do not reflect true effect.

Publication bias may undermine the overall robustness of a body of evidence.

Additional Domains: Publication Bias

Publication bias scores: Need not be formally computed but can influence ratings of

required domains Should take these possible publication bias factors into

account: Rating for consistency Calculating a summary confidence interval for an effect

Add comments on publication bias when circumstances suggest that relevant empirical findings, particularly negative or no-difference findings, have not been published or are not otherwise available.

Additional Domains: Publication Bias Scores

Use two or more reviewers with the appropriate clinical and methodological expertise.

Assess separately: Each required domain (or each optional domain, as relevant) Each major outcome, including benefits and harms

Resolve differences by consensus or mediation by an additional expert; consensus scores should appear in tables.

Record and maintain records of each reviewer's individual judgments about domains as background documentation.

Procedures for Assessing Domains

Reflect a global assessment that: Takes the required domains directly into account Incorporates judgments about the additional domains as

needed

Aim to: Provide “actionable” information for a variety of different

users, readers, and stakeholders Be transparent in how the strength-of-evidence grades are

reached

Strength of Evidence Grades (I)

For each comparison of interest, rate the strength of evidence for: Each major benefit (e.g., positive effects on health

outcomes such as physical function or quality of life, or effects on laboratory measures or other surrogate variables)

Each major harm (ranging from rare, serious, or life-threatening adverse events to common but bothersome effects)

For both benefits and harms: Focus on the outcomes most relevant to patients, clinicians,

and policymakers

Strength of Evidence Grades (II)

High: High confidence that the evidence reflects the true effect. Further research is very unlikely to change our confidence in the estimate of effect.

Moderate: Moderate confidence that the evidence reflects the true effect. Further research may change our confidence in the estimate of effect and may change the estimate.

Low: Low confidence that the evidence reflects the true effect. Further research is likely to change the confidence in the estimate of effect and is likely to change the estimate.

Insufficient: Evidence either is unavailable or does not permit a conclusion.

Strength of Evidence Grades and Definitions

Using the high, moderate, or low strength-of-evidence grade: Implies that a body of evidence actually exists Is intended to convey how confident reviewers are about

decisions that may be made based on evidence graded one way or another

Requires the use of only one designation, not a range (e.g., not “low to moderate”)

Strength of Evidence Grades: Additional Points (I)

The insufficient strength-of-evidence grade: Is applied when:

Reviewers cannot draw conclusions about an outcome, comparison, or other question

Is appropriate when: No evidence is available at all Evidence is too insubstantial to permit conclusions to be

drawn (e.g., opposing results from studies with a similar risk of bias; wide and overlapping confidence intervals)

Strength of Evidence Grades:Additional Points (II)

Use different approaches to incorporate multiple domains into an overall strength-of-evidence grade GRADE algorithm Weighting system of the Evidence-based Practice Center Some qualitative approach

Use (at least) two reviewers Assess resulting interrater reliability for

each domain score, and keep records

Scoring and Reporting: General Guidance

Risk of bias (given design and conduct of available studies) is the essential component in determining the strength-of-evidence grade. First, consider which study design is most appropriate to

reduce bias for each question. Next, consider the risk of bias from available studies.

Guiding Principles: Risk of Bias

Drug comparisons in randomized controlled trials (RCTs), with either placebo or an active comparator as an appropriate design: Evidence from well-conducted RCTs will have less risk of

bias than evidence based on observational studies. For RCTs, reviewers can start with a rating of low for risk of

bias and change the assessment if the RCTs have important flaws.

For observational data, reviewers can start with a rating of high for risk of bias and change the assessment, depending upon how well studies were conducted.

Guiding Principles: Risk of Bias Example

Be explicit about how the evidence grade will be determined. A point system for combining ratings of the domains A qualitative consideration of the domains

Carefully document procedures. Keep records of procedures and results for

each review so that they may contribute to the overall expertise of the Evidence-based Practice Center and the science of grading evidence.

Further Guidance: Principles for Scoring

Explain the rationale for the approach used and identify which domains were important in upgrading or downgrading the strength of evidence.

Explain judgments about the degree to which any additional domains altered the overall strength-of-evidence grade.

Provide enough detail within the report to ensure that users can grasp the methods.

Further Guidance: Principles for Reporting (I)

Use the terms high, moderate, low, or insufficient.

Do not use Roman numerals or other symbols. Use or adapt the illustrative tabular approach to

reporting (see the publications listed below for examples). Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a

body of evidence when comparing medical interventions. In: Methods Guide for Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality, Posted August 2009. Available at: http://effectivehealthcare. ahrq.gov/ ehc/products/60/318/2009_0805_grading.pdf.

Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:531-523.

Further Guidance: Principles for Reporting (II)

Grading Strength of Evidence:Presentation of Results — Moderate and High Grades

CI = confidence interval; RCT = randomized controlled trial

Number of

Studies (Subject

s)Domains Pertaining to Strength of

Evidence

Magnitude of Effect and

Strength of Evidence (SOE)

Risk of Bias; Design/Quali

tyConsisten

cyDirectnes

sPrecisio

n

Absolute Risk Difference per

100 Patients

Severe Diarrhea Moderate SOE

4 (256) RCT/Fair Consistent DirectImprecis

e4 (95% CI –8 to

+1)

14 (28,400)

Cohort/Fair Consistent Direct Precise 5 (95% CI 8 to 2)

Improved Quality of Life High SOE

6 (265) RCTs/Good Consistent Direct Precise 5 (95% CI 1 to 7)

Grading Strength of Evidence:Presentation of Results — Insufficient and Low

Number of

Studies (Subject

s)Domains Pertaining to Strength of

Evidence

Magnitude of Effect and

Strength of Evidence (SOE)

Risk of Bias; Design/Quali

tyConsisten

cyDirectne

ssPrecisio

n

Absolute Risk Difference per

100 Patients

Mortality Insufficient SOE

1 (80) RCT/Fair Unknown DirectImprecis

e1 (95% CI 4 to +3)

14 (384)Retrospective

cohort/FairInconsiste

ntDirect

Imprecise

7 to +5 (range)

Myocardial Infarction Low SOE

7 (625)Retrospective cohort/Low

Consistent DirectImprecis

e3 (95% CI 5 to 1)

CI = confidence interval; RCT = randomized controlled trial

The grading system used by the Evidence-based Practice Centers (EPCs) is similar to the GRADE system.

The EPC grading system reflects the needs of AHRQ stakeholders for reviews on a wide variety of topics and not for recommendations or guidelines.

The main differences between the two grading systems: The definitions of domains differ slightly; in the EPC system “directness”

excludes “applicability,” which is handled separately. In the EPC system, observational studies are considered to have less risk of

bias for outcomes such as harms, which can raise the initial grade to “moderate.”

The definition of overall grade differs; the EPC system emphasizes confidence in estimate, whereas the GRADE system emphasizes effect of future research.

The EPC system permits three different ways to reach an overall strength-of -evidence grade; the GRADE formula has one.

Comparison With the GRADE System

Is a critical last step in analysis and presentation Is done after the quality of articles is rated by at

least two independent reviewers Helps users of systematic reviews understand the

body of evidence and how much confidence they can have in making decisions based on that evidence

Uses scores on four primary (mandatory) domains and four additional (discretionary) domains

Focuses on major outcomes and comparisons Is denoted in terms of high, moderate, or low

strength or insufficient evidence Presents strength-of-evidence grades in tabular

form

Summary: Grading Strength of Evidence

Atkins D, Best D, Briss PA, et al, for the GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ. 2004;328:1490.

Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Agency for Healthcare Research and Quality. Methods Guide for Comparative Effectiveness Reviews [posted July 2009]. Rockville, MD. Available at: http://effectivehealthcare. ahrq.gov/healthInfo.cfm?infotype=rr&ProcessID=60.

Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:513-523.

References

This presentation was prepared by Kathleen N. Lohr, Ph.D., a Distinguished Fellow at RTI International.

This module is based on an update of chapter 11 in version 1.0 of the Methods Guide for Comparative Effectiveness Reviews (updated chapter available at: http://effectivehealthcare.ahrq.gov/ehc/products/60/318/2009_0805_ grading.pdf ).

Author

Grading Strength of Evidence Prepared for: The Agency for Healthcare Research and Quality (AHRQ)...

Documents

Transcript of Grading Strength of Evidence Prepared for: The Agency for Healthcare Research and Quality (AHRQ)...