Post on 27-Jul-2020
1
Developing Software to Predict Patient Responses to Knee Osteoarthritis Treatments and to Identify Patients for Possible Enrollment in Randomized Controlled Trials
Harry P. Selker, MD, MSPH, Denise H. Daudelin, RN, MPH, Robin Ruthazer, MPH, Manlik Kwong,
BSEE, BSCS, Rebecca C. Lorenzana, BA, Daniel J. Hannon, MS, PhD, John B. Wong, MD, David M.
Kent, MD, CM, MS, Norma Terrin, PhD, Alejandro D. Moreno-Koehler, BS, MPH, Timothy E.
McAlindon, MD, MPH
Original Project Title: A Method for Patient-Centered Enrollment in Comparative Effectiveness Trials: Mathematical EquipoisePCORI ID: ME-1306-02327 HSRProj ID: 20143597
_______________________________ To cite this document, please use: Selker HP, Daudelin DH, Ruthazer R, et al. 2019. Developing Software to Predict Patient Responses to Knee Osteoarthritis Treatments and to Identify Patients for Possible Enrollment in Randomized Controlled Trials. Washington, DC: Patient-Centered Outcomes Research Institute (PCORI). https://doi.org/10.25302/9.2019.ME.130602327
2
Table of Contents ABSTRACT ......................................................................................................................................... 3
BACKGROUND .................................................................................................................................. 5
PATIENT AND STAKEHOLDER PARTICIPATION ................................................................................. 9
Selection of Outcomes ................................................................................................................. 9
Modeling Database Creation ..................................................................................................... 10
Predictive Model Development and Results.............................................................................. 10
User Interface Development and Testing .................................................................................. 10
METHODS ....................................................................................................................................... 12
Selection of Data Sets and Description of Outcomes ................................................................ 12
Evaluation of Registry Variables ................................................................................................ 14
Creating the Modeling Database ............................................................................................... 14
Creating Predictive Models for Outcomes ................................................................................. 17
Prototype Decision Support Software Development, Interface Design, and Usability Testing . 18
RESULTS ......................................................................................................................................... 20
Study Design and Database Creation ......................................................................................... 20
Study Sample ............................................................................................................................. 19
Model Development .................................................................................................................. 20
Prototype Decision Support Software Development, Interface Design, and Usability Testing . 29
DISCUSSION.................................................................................................................................... 32
Study Results in Context ............................................................................................................ 32
Uptake of Study Results ............................................................................................................. 34
Study Limitations ....................................................................................................................... 35
Future Research ......................................................................................................................... 37
CONCLUSIONS ................................................................................................................................ 39
REFERENCES ................................................................................................................................... 40
Acknowledgments.......................................................................................................................... 43
APPENDIX ..................................................................................................................................... 45
3
ABSTRACT
Background: Although they represent a standard of evidence, randomized controlled trials (RCTs) often fall short because of insufficient or unrepresentative enrollment, and many needed trials are never conducted. This leaves gaps in evidence to inform patient care decisions and creates a need for a method to facilitate RCTs in usual care settings.
As medical therapies become increasingly less satisfactory for patients with osteoarthritis, an average of 680 886 patients receive surgical knee replacement per year in the United States. Yet, there have been no substantial comparative effectiveness RCTs of medical versus surgical total knee replacement (TKR). The question about TKR for knee osteoarthritis is suitable for exploring a method that would facilitate the conduct of comparative effectiveness RCTs by assisting discernment of patient-specific equipoise between treatments.
Clinical equipoise is a prerequisite for enrollment into an RCT; likewise, mathematical equipoise is the use of mathematical models to predict and compare patient-specific outcomes of alternative treatment options that should be considered when enrolling patients into an RCT. When the predictions are similar, suggesting equipoise, then random treatment assignment may be justified, and the patient may feel more comfortable enrolling in the RCT. When the predictions suggest one treatment is better than another, trial enrollment may be inappropriate, but the predictions still can inform clinical decision making. Objectives: This project aimed to use mathematical equipoise for making patient-specific comparisons of alternative treatment outcomes of TKR versus nonsurgical treatment of knee osteoarthritis as a way to consider enrollment into a comparative effectiveness RCT.
Methods: We first obtained the views of patient stakeholders with knee osteoarthritis to identify key pain and physical function outcomes. After creating a consolidated database from non-RCT sources of knee osteoarthritis outcomes, and adjusting for the inherent differences between the databases, we developed multivariable mathematical models that predict patient-specific pain and physical function outcomes for TKR or nonsurgical treatment. We then developed the Knee Osteoarthritis Mathematical Equipoise Tool (KOMET) user interface based on these models to discern patient-specific equipoise. We pilot tested the interface to assess usability and responsiveness to the needs of patients and physicians and its adequacy for supporting shared decision making, both for RCT enrollment and for treatment.
Results: We incorporated KOMET regression models into prototype KOMET decision support software, which we successfully pilot tested in a range of clinics. Patients found it very helpful in making treatment decisions, but only 7 of the 12 understood the concept of equipoise. Conclusions: This project demonstrated the use of mathematical equipoise as a method for providing patient-specific decision support for shared patient–physician decision-making for selecting between alternative treatments and considering enrollment into a comparative effectiveness RCT.
4
Limitations and subpopulation considerations: Although largely accomplishing its intended objectives, as an early stage in the development of mathematical equipoise decision support, this project has limitations related to the available clinical data, the modeling methods and variables, and the prototype software. The next step will be to conduct a larger-scale test, and then to implement it for its intended use—the conduct of a comparative effectiveness trial in usual care settings.
5
BACKGROUND
Symptomatic knee osteoarthritis has an estimated prevalence of 17% to 34% in US adults1
and is the most frequent cause of dependency in lower limb tasks, especially in elderly patients.2
It has considerable economic and societal costs, including 68 million work-loss days per year, and
is the cause for more than 5% of the annual retirement rate and for hundreds of thousands of
hospital admissions.3-6 For many patients, as osteoarthritis progresses, medical and physical
therapy become less satisfactory, making this the most frequent reason for joint replacement
surgery.4
There are concerted efforts to develop drugs that retard the progression of osteoarthritis,
many through preserving cartilage. Ultimately, effective intervention will require addressing the
multistructure failure inherent to osteoarthritis, which includes periarticular bone as well as soft-
tissue structures within the joint. Meanwhile, total knee replacement (TKR) has become the
ultimate standard for treatment, now completed for an average of 680 886 patients per year in
the United States, with aggregate charges greater than $36 billion.7
Shared patient–clinician decision making is particularly germane to deciding between
medical treatments and surgical knee replacement. Not only do patient preferences have great
relevance, but the availability of treatments, their inconvenience and expense, and the
accumulation of comorbidities over time are all salient.7 Compromising these decisions are gaps in
patient-specific information about alternatives and their effects in different populations.8 At the
time we initiated this project, we found decision aids but no explicit predictive models in the
literature or published randomized controlled trials (RCTs) of medical versus surgical treatment of
knee osteoarthritis. At the time of this writing, a Danish study of 100 patients with knee
osteoarthritis who were eligible for unilateral total knee replacement was the only such known
trial to show that TKR followed by nonsurgical treatment resulted in greater pain relief and
functional improvement after 12 months versus nonsurgical treatment alone.9 However, TKR was
associated with more serious adverse events than nonsurgical treatment, and most patients who
were randomly assigned to nonsurgical treatment alone did require TKR within the study’s 12-
month follow-up.9 Thus, the question is far from settled at this point.
6
The description and measurement of clinical change in knee osteoarthritis is not
necessarily reliable, undermining comparisons of alternative treatments.10 Moreover, the cross-
sectional US national DECISIONS survey found that more than half of patients discussing knee or
hip surgery underestimated the harm from surgery, and only 28% correctly estimated the amount
of pain relief following surgery.11
As clinical decision support, we previously created and tested predictive instruments
based on multivariable logistic regression models that provide 0% to 100% predictions of medical
diagnoses and outcomes of treatments.12-15 They have been used successfully for short-term
decisions such as whether to hospitalize a patient and/or to treat for acute myocardial infarction.
These emergency decisions are dominated more by physician judgment than are decisions about
longer-term and more complex treatments. Decision support for more complex decisions—for
which shared patient–clinician decision making is central—has been well studied. A 2014
Cochrane systematic review of 115 RCTs found that decision aids increased patient knowledge,
improved accuracy of risk perceptions when expressed in probabilities, enhanced concordance
with patient values when including a values clarification exercise, and reduced decisional conflict
due to feeling uninformed and unclear about personal values.16 Similar decision aid benefits have
been seen for patients with osteoarthritis considering hip or knee arthroplasty.17-21
Accordingly, the objective of this project was to create the Knee Osteoarthritis
Mathematical Equipoise Tool (KOMET), intended to be embedded in electronic health records
(EHRs) as decision support for shared clinical decision making about patients’ choices of
treatment, especially between medical treatment and TKR. Additionally, this shared decision-
making is intended to identify patients for whom, based on their specific characteristics, there is
insufficient evidence to favor 1 of 2 or more treatment alternatives. This situation is referred to as
clinical equipoise, the ethical and scientific basis for enrolling patients in a randomized clinical
trial. Shared patient–clinician decision making is important in this circumstance, when patients’
personal preferences and objectives can dominate what otherwise might appear to be a toss-up
treatment decision.22 By illustrating the generation and use of patient-specific equipoise, KOMET
also is intended to support shared decision making about participation in RCTs, as an example
implementation of mathematical equipoise, for practical, ethical, targeted enrollment into
7
comparative effectiveness RCTs. If successful, presumably this approach could be used in many
other conditions and clinical decisions.
In developing our cardiac predictive instruments, we were fortunate to have extensive
patient-level data from RCTs. A great advantage of such data is that random assignment of
treatments helps avoid having treatment effects biased by the selection of treatments and their
use among patients. RCT data allow the multivariable regressions to accurately reflect the effect
of a treatment when used in comparable patients; however, RCTs are expensive and time-
consuming, and there are many conditions and treatments for which RCT-generated data are not
available. Moreover, for the circumstances in which we might want to run a new RCT—for which
we would potentially use mathematical equipoise for participant selection—there often will be
few or no RCTs. In this case, to create predictive models, we must use data from observational
studies, registries, EHR-based data warehouses, patient-acquired data feeds, and other sources.
Registries of various patient groups and populations are relatively inexpensive and common, and
EHRs generate increasingly more data available in databases and data warehouses. If these non-
RCT sources could be used for creating predictive models, there would be vast opportunities for
the mathematical equipoise approach to facilitate the conduct of clinical effectiveness RCTs, but
there are protean challenges and limitations to this.
Clinical equipoise—the ethical and scientific basis for randomly assigning patients different
treatments—is considered no longer present after a pivotal clinical trial shows one treatment is
better than alternatives. All patients then must be offered the most-effective known therapy.
Typically, however, this is not an individual patient-centered determination; only group-based
general inclusion and exclusion criteria are available. Mathematical equipoise is intended as a
method by which, for a given condition, only those individuals for whom there still is uncertainty
could be enrolled in a comparative effectiveness trial, while individuals for whom the question is
settled would not be enrolled.22 The objective is to generate RCT evidence based on
individualization of treatments consistent with the principle of equipoise. This ultimately could
allow treatment that accounts for the heterogeneity of treatment effects among different
individuals and groups.
If embedded in EHRs and computerized physician ordering systems, potentially,
8
determination of mathematical equipoise could serve as a practical way in routine clinical care to
detect all eligible patients for possible RCT enrollment. It also could identify those patients not
suitable for enrollment, for whom it could enhance clinical care by indicating the potentially best
treatment. Also, the basis for selection for a clinical study could be transparent to patients and
clinicians in real time to enhance truly informed consent during clinical care.
In this project we sought to create KOMET as an example of mathematical equipoise. To
represent the prevailing circumstances in which this approach would be used, we used patient-
level data from existing non-RCT sources to build predictive models of treatment outcomes; these
models determined the presence or absence of mathematical equipoise to inform decision
making. We sought to illuminate limitations of available data and to explore strategies for
overcoming such limitations to optimize modeling. Success in using non-RCT data in this way
would support the goal of widespread use of the mathematical equipoise method.
We also sought to demonstrate through this project the utility of incorporating
stakeholder input to ensure relevance of the ultimate predictive models to patient–physician
decision making. Although research into the engagement of stakeholders in research is still
evolving in its terminology and frameworks,23,24 the criterion we used for this project—intended
for comparative effectiveness research (CER)—was “individuals, organizations, or communities
that have a direct interest in the process and outcomes of a project, research, or policy endeavor.”
9
PATIENT AND STAKEHOLDER PARTICIPATION
We engaged stakeholders throughout the entire project to ensure the relevance of the
ultimate models and the decision support to patient–physician decision making. Patient,
researcher, and clinician stakeholders were involved in the selection of study questions, choice of
study outcomes, selection of candidate variables for the modeling database and the predictive
model, and development and testing of the user interface. To foster this, we used the Patient-
Centered Outcomes Research Institute’s (PCORI’s) 6 engagement principles: reciprocal
relationships, co-learning, partnerships, transparency, honesty, and trust, all of which allow for
effective engagement in research.25
We held quarterly in-person meetings to build reciprocal relationships among stakeholders
and the research team, to educate stakeholders about the research methods being used, and to
solicit patient, researcher, and clinician stakeholder input. Participating groups included (1)
patients with or at risk of having knee osteoarthritis, (2) patient advocates for those with arthritis,
(3) clinicians who cared for these patients, and (4) knee osteoarthritis researchers.26
We identified interested patient and advocate stakeholders through discussions with
clinicians, knee osteoarthritis researchers, and the Arthritis Foundation. The patient panel
included 3 women and 4 men representing people at risk for knee osteoarthritis due to existing
osteoarthritis in other joints, people actively considering treatment options for their existing knee
osteoarthritis, and patients who had received TKR for osteoarthritis. We recruited clinician
stakeholders from primary care, orthopedics, and rheumatology. The clinician panel included 2
rheumatologists, 2 primary care physicians, 2 orthopedic surgeons, and 1 physical therapist, some
of whom had a dual role representing researchers.
Selection of Outcomes We chose the 2 outcome scales on which we built our models, the Western Ontario and
McMaster Universities Arthritis Index (WOMAC) and SF-12 Health Survey (SF-12) physical
component scores, after discussions with clinician and patient stakeholders. Factors considered
included the time frame of the outcome beyond surgery and the meaningfulness to someone
10
making a decision about surgery, taking into account constraints imposed by our available data
sources. Stakeholders were strongly supportive of using both the pain and functional outcome
scores, as both were part of patients’ decision-making processes.
Modeling Database Creation We created a modeling database from 4 data sets, matching patients who had surgical
treatment with ones who had nonsurgical treatment. To guide our choice of the variables on
which they would be matched, we gathered input from clinicians on the research team, clinician
and patient stakeholders, and results from prior published literature. Variable choices for these
models were informed by the needs of stakeholders who would use decision support for knee
osteoarthritis, focusing on their views about the representation of pain and functional outcomes.
Predictive Model Development and Results We provided all stakeholders with an orientation to the modeling process to foster their
ability to provide input on selection of candidate variables for model development. Interaction
terms in the statistical models allow differences in predicted benefit for different patients, so
receiving input on plausible interactions was important. The candidate primary and interaction
variables included in the model selection process were those stakeholders considered important,
plausible, and easily and reliably provided. We considered outcome variables based on
stakeholder ranking of how much the variable would be related to pain and functional outcomes 1
year in the future.
We sought clinician and patient stakeholder input on the clinical significance of the results
of predictive modeling. As the project evolved, the research team and stakeholders concluded
that many of the variables under consideration were too burdensome to collect or too difficult to
ascertain. To accommodate this, we adjusted models that did not have significant impact on
performance characteristics.
User Interface Development and Testing Both clinician and patient stakeholders contributed extensively to the design of the user
11
interface of the decision support application. They reviewed its presentation of outcome
predictions and its usability. Their recommendations led to improvements in the wording and
ordering of the questions, instructions, and display of predicted outcomes.
12
METHODS
To develop KOMET predictive models for outcomes of TKR and of nonsurgical treatments,
we created a consolidated database with treatment outcomes of knee osteoarthritis from a
variety of clinical study and registry data. We selected model variables based on input from
patients and clinicians about the best capture of important determinants of outcomes and
measurements of the clinical outcomes as well as on variables’ contributions to models’
predictive performance. We incorporated these models into prototype decision support software
and tested them with stakeholders, clinicians, and patients.
Selection of Data Sets and Description of Outcomes To create the modeling database, we considered a range of knee osteoarthritis databases
(briefly described below) as well as the scales used in these databases: the WOMAC (for pain) and
the SF-12 (for functional status). We selected 3 of the databases (MOST, OAI, and CORP) because
they are large, well established, and publicly available epidemiological studies of knee
osteoarthritis. The 2 additional databases are knee osteoarthritis registries (NEBH and TMC)
determined to have adequate cases and the required indexes, and that were available from
collaborating organizations.
Multicenter Osteoarthritis Study (MOST)27: MOST is an NIH-sponsored longitudinal,
prospective, observational study of knee osteoarthritis in adults with osteoarthritis or at increased
risk of developing osteoarthritis.27 The database includes a community-based sample of 3026
participants aged 50-79, with preexisting osteoarthritis or those at high risk for osteoarthritis
based on weight, knee symptoms, or a history of knee injuries or operations. Approximately 60%
are women and 15% are African Americans. The cohort was followed for 84 months and the data
was collected through clinical assessments, radiological studies, several measures and
instruments, and telephone interviews. The study focused on mechanical risk factors, causes of
knee symptoms and pain, and the long-term disease trajectory of knee osteoarthritis. Data used
in this article were obtained from the MOST, available for public access at http://most.ucsf.edu.
Osteoarthritis Initiative (OAI)28: The OAI is an NIH-sponsored multicenter, longitudinal,
13
prospective observational study of osteoarthritis intended as a public domain research resource.
Its database includes clinical evaluation data, radiological (X-ray and MRI) images, and a
biospecimen repository for 4796 men and women aged 45-79 who have, or are at high risk for
developing, symptomatic knee osteoarthritis. Data used in this article were obtained from the OAI
database available for public access at http://www.oai.ucsf.edu/.
Canadian Osteoarthritis Research Program (CORP)29-31: The Women’s College Hospital
CORP data set includes 2200 participants of this prospective, population-based cohort with at
least moderately severe knee osteoarthritis, aged 55 or older. Ultimately, because of the
challenges with this data set, we did not use it for this project.
New England Baptist Hospital (NEBH) Orthopedic Surgery Registry32: The NEBH registry
includes 2462 patients who have underwent TKR there since 2011. Assessments occur prior to
surgery, at 6 weeks, and at 12 months. Data collected include demographic, vital signs, clinical
measures, medications, knee examination, the Knee Society Score (KSS) pain and physical function
score, the SF-12 health status score, surgical complications, and procedure outcomes. The mean
age of patients is 68 years, and 57% are women.
Tufts Medical Center (TMC) Orthopedic Surgery Registry33: The TMC registry includes 535
patients who had received TKR since 2007. Assessments occur prior to surgery, at 6 weeks, 12
months, and 24 months. Data collected include demographic, vital signs, clinical measures,
medications, knee examination, pain and physical function (KSS), health status (SF-12), surgical
complications, and procedure outcomes. The mean age of patients is 62, and 61% are women.
The Western Ontario and McMaster Universities Arthritis (WOMAC) Index34: The WOMAC,
developed in 1982, is widely used in the evaluation of hip and knee osteoarthritis and is available
in more than 100 languages. It is a self-administered questionnaire of 24 items, divided into 3
subscales: (1) pain (5 items) during walking, using stairs, in bed, sitting or lying, and standing
upright; (2) stiffness (2 items) after first waking and later in the day; and (3) physical function (15
items) using stairs, rising from sitting, standing, bending, walking, getting in and out of a car,
shopping, putting on and taking off socks, rising from bed, lying in bed, getting in and out of a
bath, sitting, getting on and off the toilet, heavy domestic duties, and light domestic duties. We
used the knee pain scale as the primary outcome in this project. In its raw form the WOMAC knee
14
pain scale ranges from 0 to 20. To make it easier to interpret and represent in the final models, we
rescaled it to 0 to 100, with 0 representing absence of pain and 100 representing extreme pain.
SF-12® Health Survey: The SF-12 is a multipurpose short-form generic measure of health
status.35,36 It was developed to be a much shorter, yet valid, alternative to the SF-36® for use in
large surveys of general and specific populations and for large longitudinal studies of health
outcomes. We used its physical functioning summary score as the second predicted outcome for
this project. The SF-12 scores range from 0 to 100, with higher scores indicating better function.37
Evaluation of Registry Variables We used a consensus process involving clinician investigators and stakeholders to select
variables for model development. First, clinicians were asked to rank variables based on their
impact on (1) predicting prognosis for pain or function, with or without surgery, and/or (2)
predicting assignment to medical or surgical treatment (ie, indications or contraindications for
treatment).
They a priori ranked each variable from A to D:
A. Variables that almost certainly must be included in the model; eg, age
B. Variables that would be desirable to have established risk factors for the outcome; eg,
body mass index (BMI)
C. Variables that would be desirable to have for exploratory analyses; eg, history of falls
over the past 12 months
D. Variables not likely to be needed; eg, family history of arthritis
Finally, a few variables were ranked by clinicians for importance and ease of collection
using a scale of 1 to 10, with 10 being very important or very hard to collect. We collapsed the
importance rankings into 3 categories: not at all important (1-3), fairly important (4-7), and very
important (8-10). Clinicians ranked most of the variables as easy to collect. We included in the
modeling database the final list of variables deemed as fairly important and very important.
Creating the Modeling Database The database for creating KOMET models included 2 types of registries. Two databases,
15
MOST and OAI, had data collected on knee osteoarthritis at fixed intervals per their protocols.
During the course of follow-up, some patients had TKR and continued to be followed afterward.
The 2 other registries, NEBH and TMC, were from hospitals that collected baseline and follow-up
data only on their patients who had TKR.
For this project, our target sample was patients who had knee osteoarthritis and had
reached the clinical stage at which they would be deciding whether to have TKR. Lacking a cohort
of such patients randomized to the medical or surgical options, we used data from patients who
had TKR and matched them to patients (knees) who did not have TKR but who had similar
characteristics. Where possible, we matched non-TKR knees to TKR knees within the same
database (OAI, MOST). We matched TKR knees from the NEBH and TMC registries to non-TKR
knees from MOST and OAI based on the best match. In practice, we created a database in which
we used the knee as the unit of analysis, and we conducted matching based on characteristics of
the knee and the patient. Thereby, we created a study sample of patients who would or could be
considering this therapeutic choice.
For the MOST and OAI registries, we identified all knees that underwent TKR and then
designated the data collected at the closest previous visit as the baseline visit for that TKR. We
then extracted baseline data on these TKR knees from the patients’ registry data, including
demographics, knee characteristics, comorbidities, mental and physical function, and other
clinical features. To find non-TKR control knees, we created a sub-database of all knee visits from
all patients, excluding any that occurred after a TKR. We then used a greedy matching computer
algorithm38 to select control knees for each TKR knee (within the same database, OAI or MOST ).
It should be noted that the variables used for matching differed among the databases, based on
data availability. As a guide to determine variables to use for matching, we used input from
research team clinicians, stakeholders, and the literature. For matching, we converted continuous
variables to categories. We loosely based categories on Riddle et al., which presented an
algorithm to judge the appropriateness of TKR39. Our research team considered the factors used
in that algorithm as reasonable factors to match on where possible. Categories were ordered, and
we did not allow matches beyond one category of difference. We did not always require exact
matches because we did not want to lose patients who had TKR from the model-building sample,
16
and we could statistically adjust for differences between the TKR and non-TKR groups in the
modeling process. Thereby, we matched each TKR knee in OAI with a similar non-TKR knee in OAI
based on values of matching variables at baseline. The same was true for MOST.
Because the TMC and NEBH samples included only TKR subjects, we drew their matched
non-TKR controls from a pooled data set of knee visits from the OAI and MOST registries.
We established exclusion criteria based on discussions with the research team members
and applied them before we performed modeling. We excluded any knee that did not have
follow-up information (9 months to 5 years after the baseline visit or TKR) on the same knee in the
same state (TKR versus non-TKR). If a knee visit was a candidate control but had TKR at some
point between that visit and a follow-up at least 9 months later, we excluded it from the pool of
non-TKR knees used for matching. If a knee had TKR but did not have pre-TKR baseline data within
12 months of the TKR, we excluded that TKR. If a knee had TKR, we excluded the contralateral
knee from the pool of non-TKR knees used as controls. If a patient had TKR on 2 knees, more than
90 days apart, we excluded both knees; with an interval of >90 days, we were concerned that the
1-year evaluation of pain and function for the first knee could still be during the recovery period
of the surgery for the second knee, which would confound the assessment of the outcome. If
bilateral surgery was completed on 2 knees within 90 days of each other, we used the first knee or
randomly chose one if both knees were completed on the same day. There was 1 exception in the
MOST data for which 14 patients were counted twice, including and following each bilateral knee
separately. In the full database, 104 other patients were counted 2 times (92 patients) or 3 times
(12 patients). Overall, 1322 patients contributed data for 1452 matched knees for these analyses.
We did allow single patients to contribute both a control and TKR knee when surgeries were far
enough apart in time to allow full follow-up on each independently. We also allowed OAI and
MOST control knees to be reused for the matching process for TKR knees from the NEBH and TMC
registries. See Appendix A for details and limitations of this approach.
On the matched data set, we compared baseline characteristics between knees with and
without TKR, using chi-square tests and t tests. To account for missing data, we used multiple
imputation, creating 10 imputed data sets for each study source. We also compared baseline
characteristics on imputed data sets as we used these for model development. We adjusted P-
17
values from the analysis of the multiple imputation data set to account for imputation variability.40
We used SAS software for these analyses using the model information (MI)procedure to impute
the data and MIANALYZE to process the results of analyses on the imputed data. See Appendix
B.41
Creating Predictive Models for Outcomes We conducted analyses using SAS for Windows (Version 9.4 TS Level 1M2. Cary NC: SAS
Institute, 2002-2012) and SAS Enterprise Guide (Version 7.13 HF3. Cary, NC: SAS Institute, 2016).
We developed a multivariable linear regression model to predict the 1-year knee pain
outcome based on the WOMAC score or, when a database lacked WOMAC items, using an
estimated WOMAC score, as described in Appendix C. Our approach was to develop the model
using a set of matched TKR to non-TKR knees from the OAI database and then to validate/test it
on a set of matched TKR to non-TKR knees from the MOST database. We then pooled the OAI and
MOST data sets and built a new model, starting with variables used in the model developed in the
OAI data and tested on the MOST data. We also rederived models on a database that included all
4 data sets (OAI, MOST, NEBH, and TMC). We used a similar variable selection process but with a
more limited set of candidate predictor variables because NEBH and TMC did not capture as many
variables as the OAI and MOST registries. We repeated this entire process for the functional
outcome (SF-12 physical component at 1 year). To create models that could provide predicted
estimates of 1-year knee pain and 1-year function, with and without TKR, for any patient based on
their characteristics, all models included an indicator variable for treatment type. We explored
covariates and interactions of treatment type with covariates in the different phases of the
modeling process. We did not adjust for matching in the linear regression during modeling
because the purpose of matching was to create a reasonably balanced study sample, and
covariates in the models could account for remaining imbalances between groups.42 We describe
further details of our approach in Appendix D.
18
Prototype Decision Support Software Development, Interface Design, and Usability Testing
The goal of software development and usability testing was to translate the results of the
predictive models into easily understood, patient-specific reports with predictions of 1-year
outcomes that could be produced in real time in the course of clinical care, for shared treatment
decision making and, if appropriate, enrollment into an RTC.
Decision Support Software Development: There were 2 KOMET development tasks, for the
analytics and for the user interface. Analytics development included implementing the predictive
models as reusable, multiplatform software components to generate both the current and 1-year
predicted pain and function outcomes for nonsurgical and surgical treatments. In addition, the
analytics software calculated the respective 95% confidence intervals around each prediction as
the basis for considering the degree of overlap that would suggest near equivalence, or equipoise.
User interface development included creating a web browser–based questionnaire interface to
collect patient demographics, items for computing the WOMAC pain score, the SF-12 physical
functioning scale, and comorbidities. Together, the user interface and analytics component
included methods for data retention and presentation of the predicted outcome results. We then
incorporated the predictive models into the web-based decision support application for iterative
user testing.
Interface Design: The user interface design process involved iterative prototyping of
methods to collect data for the predictive models, displaying the predictions through data tables,
bar charts, data plots, dynamic text descriptions, and printed reports, and determining and
alerting users about mathematical equipoise. We began with image mockups and storyboards,
then used online prototyping tools (www.axshare.com) to establish page layout, content
placement, and workflow. Once we identified key user interface elements, we finalized general
layout and content placement and conducted subsequent user interface design iterations on a live
website. We implemented the analytics components and user interface on a stand-alone web-
based application server using an Apache.org Tomcat 8 webserver (Wakefield, MA: Apache
Software Foundation, 1999-2019.50
Usability Testing: We tested the prototype decision support application and iteratively
19
redesigned it to address patient and clinician user needs. We conducted initial testing with 12
research institute staff members as well as members of our patient and clinician stakeholder
panels. We tested the final design with 10 patients and 6 physicians in 3 clinical settings during
typical clinic and research-specific visits. Testing included (1) entering demographic data and
completing questionnaires to provide the information needed for the predictive models, (2)
interpreting predictive model results through data displays, and (3) determining user
understanding of the predictive models and mathematical equipoise and clinical trial
randomization through case-based discussions. Usability testing included a “think-aloud” protocol
and a usability testing script, as described in Appendix E.
A research assistant and the project director conducted testing. All sessions were recorded
and transcribed. Testing with research institute staff and stakeholders was conducted virtually or
in a conference room, and testing with patients and clinicians was conducted in the clinic setting.
The IRB determined the project was exempt from IRB review.
20
RESULTS
Study Design and Database Creation The final database included 1452 knees (726 with TKR and 726 without) of 1322 patients.
Of patients, 91% (1204) had a single knee included in the database, 8% (106) had 2 knees used or
a single knee used 2 times, and 1% (12) had knees used 3 times. We matched TKR knees from OAI
to control knees from OAI, and we matched TKR knees from MOST to controls from MOST.
Because NEBH and TMC included only TKR knees, we drew the controls for those databases from
non-TKR knees from OAI and MOST. In the final matched database, the relative contributions of
TKR knees were OAI, 252; MOST, 154; NEBH, 248; and TMC, 72. For the control knees,
contributions were OAI, 472, and MOST, 254. Figure 1 and Appendix F: Figures 1a-1d provide
breakdowns of how we selected the final analysis sample from each database in CONSORT-type
figures.
18
Figure 1. Description of Final Analysis Sample Selection
OAI MOST NEBH TMC [May 2014] [January 2015] [December 2014] [July 2015]
4796 Patients 3026 Patients 5519 Subjects 117 Subjects Excluded because did not have osteoarthritis, no follow-up, bilateral TKR >90 days apart, prior to start
4379 Patients/8713 knees 2957 Patients/5914 knees 314 Subjects/knees 97 Subjects/knees
Control sample TKR sample Control sample
TKR sample Control sample TKR Sample Control (non-TKR) sample
TKR sample
4049 Patients 253 Patients 2652 Patients 2652 Patients
2652 Patients (5071 knees) [MOST]
314 Subjects 2652 Patients (5071 knees) [MOST]
97 Subjects
8095 Knees 278 Knees ** 5071 Knees 5071 Knees 4049 Patients (8095 knees) [OAI]
314 Knees 4049 Patients (8095 knees) [OAI]
97 Knees
Excluded because TKR on contralateral knee, no pre-TKR visit, no post-TKR visit
MATCH TKR SAMPLE WITH CONTROL
SAMPLE [KNEE VISITS] MATCH TKR SAMPLE WITH
CONTROL SAMPLE [KNEE VISITS] MATCH TKR SAMPLE WITH CONTROL
SAMPLE [KNEE VISITS] MATCH TKR SAMPLE WITH CONTROL SAMPLE [KNEE VISITS]
Matching variables: Matching variables: Matching variables: Matching variables: Age (<55, 55-65, >65) Age (<55, 55-65, >65) Age (<55, 55-65, >65) Age (<55, 55-65, >65)
Gender Gender Gender Gender WOMAC pain + disability (Riddle based): on
incident knee WOMAC knee pain [0-20 scale] (0-
3, 4-9, 10-20, missing) WOMAC knee pain [0-100] (11-50, 51-75, 75-100, missing) WOMAC knee pain [0-100] (11-50, 51-75, 75-
100, missing) WOMAC pain + disability (Riddle based): on
contralateral knee WOMAC contralateral knee pain
(0-2, 3-8, 9-20, missing) WOMAC contralateral knee pain [0-100] (11-50, 51-75, 75-100, missing)
Location (Riddle category) K-L (Riddle): moderate/severe versus not
SF-12 ( <44 , 44-56, >56) SF-12 ( <44, 44-56, >56) SF-12 ( <44 , 44-56, >56) SF-12 ( <44 , 44-56, >56) Charlson (0, 1, ≥2, missing) Charlson (0, 1, ≥2, missing) Charlson (0, 1, ≥2, missing) Charlson (0, 1, ≥2, missing)
Change in WOMAC pain from prior visit (≥2 points versus not)
Change in WOMAC pain from prior visit (≥2 points versus not)
Control TKR Control Control Control Control Control TKR
252 Knees 252 Knees 154 Knees 154 Knees 248 Knees 248 Knees 72 Knees 72 Knees
19
Study Sample We compared distributions of variables used for the matching process between TKR and
non-TKR knees for each data source; these results are presented in Appendix F: Table 1a. They
confirmed that the matching algorithm had worked. In each database, characteristics used for
matching were well balanced between the TKR and non-TKR knees. Baseline characteristics
considered for the modeling process, and of interest to clinicians and stakeholders, were
comparable between TKR and non-TKR knees, as presented in Appendix F: Table 1b. This also was
true of the variables used in the final multivariable models using the imputed data, as shown in
Appendix F: Table 1c.
Baseline characteristics and outcomes at follow-up of the matched study sample are
summarized in Table 1. Approximately 40% were men, the mean age was 65, and the mean BMI
was 31. On the 0 to 100 pain scale (100 indicating extreme pain), the mean baseline knee pain
was significantly higher in the TKR group than in the non-TRK group (mean = 45.6 versus 40.5; P =
< .01), despite efforts to match on this variable (categorized). Comparisons of mean baseline SF-
12 scores between TKR and non-TKR groups showed better physical and mental function in the
non-TKR groups than in the TKR groups, with the difference being significant for physical function
(mean = 37.2 versus 38.6; P =.008). Overall, at follow-up there was less knee pain and better
physical function in the TKR groups than in the non-TKR groups. Irrespective of significance, we
used all variables listed in Table 1 in building the multivariable models of long-term
(approximately 1-year) outcomes.
20
Table 1. Description of Pooled Study Sample Used for Model Derivation for n = 1462 Matched Knees (Imputed Data)
Variable TKR (N = 726)
Non-TKR (N = 726)
TKR Minus Non-TKR Delta (∆) and [95% CI]
Effect Sizea
Mean +/– standard deviation (SD) (∆/SD)
Baseline Characteristics
Age 65.29 ± 8.57 64.77 ± 8.57 0.52 [–0.36-1.40] 0.03
Male, N(%) 0.43 ± 0.49 0.42 ± 0.49 0.00 [–0.05-0.06] 0
Baseline BMI 31.31 ± 6.49 30.97 ± 6.36 0.34 [–0.32-1.00] 0.03
Baseline SF-12 physical 37.16 ± 9.46 38.59 ± 10.94 –1.44 [–2.49 to –0.38] –0.07
Baseline SF-12 mental 52.56 ± 11.48 53.62 ± 11.82 –1.07 [–2.27-0.13] –0.05
Baseline WOMAC knee pain (0-100) 45.59 ± 21.87 40.48 ± 21.76 5.11 [2.87-7.36] 0.12
Baseline knee pain, contralateral (0-100) 18.92 ± 21.06 19.66 ± 22.05 –0.74 [–2.96-1.48] –0.02
Baseline hip pain or pain/ache/stiffness 0.34 ± 0.50 0.62 ± 0.51 –0.27 [–0.33 to –0.22] –0.27
At least one comorbidity, N (%) 0.32 ± 0.52 0.31 ± 0.56 0.01 [–0.05-0.07] 0.01
Narcotics, N (%) 0.14 ± 0.36 0.13 ± 0.36 0.00 [–0.03-0.04] 0.01
Follow-up Results
Follow-up SF-12 physical 44.48 ± 11.88 39.81 ± 10.80 4.67 [3.51-5.84] 0.21
Follow-up WOMAC knee pain (0-100) 13.92 ± 19.44 29.22 ± 19.45 –15.30 [–17.30 to
–13.30] –0.39 a Shaded rows indicate variables in which definitions varied between databases such that these variables ultimately were excluded as candidates in the building of final models.
Model Development We used linear regression to model the 2 outcomes, the WOMAC knee pain scale
(rescaled 0 to 100; see Appendix C: WOMAC Knee Pain, Part II) and the SF-12 physical functioning
component score.
Based on the methods described above, we chose these outcomes (including timing),
prior to building models, following repeated discussions with clinician and patient stakeholders
and the research team. We chose 1 year as the target follow-up time to have a time point beyond
21
the recovery time from surgery, estimated as up to 9 months. Stakeholders felt benefits of surgery
were stable beyond that time point. To address inconsistencies and gaps, we allowed for use data
from up to 5 years past baseline in which there was no closer time to 1 year for a knee.
Stakeholders were strongly supportive of using both the pain and functional outcomes in
patients’ decision-making processes, although the outcomes were not of equal importance to all
patient stakeholders. As the project progressed, the team continued to receive more input from
patient and clinician stakeholders, which influenced modeling, an example of which is described
in Appendix G.
Models Built on OAI Database and Tested on MOST Database (Appendix H: Tables 2a-2b):
We tested the models built on the OAI database on the MOST database to check that the
statistical modeling had been effective, as reflected on an independent data set. The first model
built was for WOMAC knee pain at 1 year and used the matched OAI database that included 252
knees that underwent TKR and 252 knees that did not, using all knees for which there were
WOMAC knee pain data available for the 1-year endpoint. The final model, built on the imputed
data sets, included main effects for younger ages (defined as less than 60 years old) and a
measure of body pain based on data collected from a homunculus in which locations of pain could
be indicated by patients and a calculation could be made that measured the percentage of sites
on a diagram of a body that had symptoms, hip pain (yes versus no), baseline WOMAC knee pain,
and treatment (TKR or not). The model also included interactions of TKR with both baseline knee
pain and hip pain. The model r-square was 0.36 for WOMAC knee pain. We applied the
coefficients from the OAI model to the imputed MOST data set and compared the resulting fitted
values for 1-year knee pain with the observed 1-year knee pain values. There was a positive
association between observed and fitted values (r-square = 0.32). We conducted a similar analysis
for the 1-year physical functioning outcome. The model for the 1-year functional outcome built on
the OAI data included main effects for gender, age, baseline SF-12 mental and physical
components, homunculus, hip pain, depression score, and baseline knee pain in the contralateral
knee. There was also a main effect for treatment and no significant interactions of treatment with
any other variables. The model indicated that, on average, the 1-year physical function score (SF-
12 physical component score) was 3.4 points higher for patients who had TKR than those who had
22
not. This OAI model had an r-square of 0.42. When we applied this model to the MOST data set,
the fitted values for the physical function outcome were positively associated with the observed
results (some of which were imputed), although the r-square on the MOST data dropped to 0.18.
While the decline in performance was not what we wished for, the research team still decided to
combine the 2 databases and try to refit the model on the pooled data, with the objective that
with the large sample size a better model could be constructed.
Models for 1-year Knee Pain Built on Pooled Databases (Appendix H: Tables 2a-2b): We
built multivariable models on versions of the databases that included imputed values for 1-year
pain outcome. We constructed the 1-year knee pain models on the combined OAI and MOST data
sets (P1 model) and on the combined OAI, MOST, NEBH, and TMC (P2 model) databases. The 2
models included terms for a treatment indicator variable and for baseline knee pain and an
interaction of these 2 and had similar r-square values (0.32), suggesting equivalent performance.
In both models, the expected knee pain at 1 year was less for patients who had TKR than for those
who did not have TKR, with the difference being greater in those who had higher knee pain levels
at the start.
The P1 model also indicated worse knee pain at 1 year with younger age, more knee pain
at baseline in the contralateral knee, more total body pain (on the homunculus), and higher BMI.
There was also an interaction with baseline hip pain for which the benefit of TKR versus non-TKR
in knee pain reduction was greater in patients who had baseline hip pain versus those who did
not.
Some of the variables available in the OAI and MOST data sets were not available in the
other databases (eg, pain indicated on a homunculus, pain in contralateral knee), and some
variables, such as hip pain, had not been collected for the surgery databases (NEBH, TMC) in the
same way as for the OAI and MOST databases. Accordingly, we did not use these variables in
modeling in the larger database. The final P2 model included age as a continuous variable, with
more expected knee pain at younger ages, as was seen in the P1 model. The model also included
baseline SF-12 scores with less expected knee pain at 1 year, with higher baseline physical
component scores and mental component scores.
Models for 1-year Physical Function Built on Pooled Databases (Appendix H: Tables 2a- 2b):
23
The model-building process for the 1-year physical functioning models (F1, F2) was similar to the 1-
year pain models. Again, we built the F1 model on data from OAI and MOST that included many
possible predictor variables. We built the F2 model on a larger database that included the same OAI
and MOST data as well as data from the NEBH and TMC cohorts. This larger data set, however,
included fewer predictor variables common to all 4 data sets. The final physical function models are
presented in Appendix H: Table 2b. Both models had similar r-square values (0.34, 0.35). Both
indicated better 1-year physical function for males, younger patients, higher initial physical and
mental component scores, and lower BMI. The F1 model also included a main effect for baseline
knee pain in the contralateral leg, with more baseline pain being associated with a worse 1-year
physical function outcome. The F2 model also included interaction terms of TKR treatment with
both age and the SF-12 mental score. Results from the model indicate that the estimated benefit in
function at 1 year for patients treated with TKR versus standard of care is greater for younger
patients and for patients with lower baseline mental health scores. The F1 and F2 models are
presented in Appendix H: Tables 2a-2b.
Summary of Multivariable Models (Table 2, Figure 2, and Appendix H: Table 2c): Appendix
H: Table 2c shows a summary of variables included in all 4 final models (P1, P2, F1, F2) and the
distribution of each variable in the pooled databases. In the earlier phases of this project, we hoped
our P1 and F1 models would have better performance because we had a larger pool of variables
(although fewer patients) to use for the modeling process. As the project evolved, the research
team realized that many of the variables under consideration were burdensome to collect and/or
difficult to capture consistently. In the end we decided to use only models P2 and F2—which we
built on the data sets that had more patients (OAI, MOST, NEBC, TMC) but fewer independent
variables—for the development of the software. The coefficients for these models are presented in
Table 2. Although neither model was validated in an independent database, we believe the models
have sufficient performance, based on variables consistent with clinical understanding and
importance such that they are reasonable for use in this demonstration project. Based on the
results of testing our OAI model on the MOST data, we are optimistic the models can be useful in
patients similar to those used to develop the models. These patients, who are presumably at the
point of deciding whether to have TKR, have characteristics similar to those shown in Table 1.
25
Table 2. Final Models for 1-year Knee Pain (P2) and SF-12 Physical Function (F2)
a Beta coefficients, standard errors [stderr], and P values are from combined linear regression models built on an imputed data set.
Range in Data Set (5th-95th
Percentile)
P2. Knee Pain Model (Higher Scores Mean
More Knee Pain)
F2. Physical Function (SF-12)
(Higher Scores Mean Better Function)
Term in Model, Status at Baseline Adjusted r-square = 0.32 Adjusted r-square = 0.34
Beta Coeff (stderr) P Valuea Beta Coeff (stderr) P Value
Model intercept (constant) 31.44(5.52) P = < .0001 17.40(4.27) P = < .0001
Treatment (1 = TKR, 0 = control) –3.33(2.16) P = 0.1246 25.41(4.33) P = < .0001
WOMAC knee pain (base), 100-point scale 10-80 0.49(0.03) P = < .0001
Interaction: treatment aWOMAC knee pain –0.33(0.05) P = < .0001
Age (in years) 51-79 –0.12(0.05) P = .0225 –0.05(0.04) P = .2397
SF-12 mental component (base) 34-66 –0.11(0.05) P = .033 0.19(0.04) P = < .0001
SF-12 physical component (base) 23-53 –0.21(0.07) P = .0017 0.55(0.03) P = < .0001
Gender (1 = male, 0 = female) 42% male
0.99(0.57) P = .0873
Body mass index, kg/m2 23-41 –0.19(0.05) P = .0008
Charlson comorbidity score > = 1 (versus 0) 31% with at least 1 –2.05(0.60) P = .0009
Interaction: treatment aage –0.15(0.06) P =.0084
Interaction: treatment aSF-12 mental score –0.18(0.06) P =.0013
26
We used these models to estimate 1-year knee pain and physical functioning for the
treatment each subject actually underwent (TKR or non-TKR) and also for their counterfactual
situation, as if they received the alternative treatment. In other words, we calculated 2 predicted
values for each subject in our database (that we used to make our models). One prediction
assumed subjects received TKR and the other prediction assumed they did not. These data
allowed us to predict the difference in pain and function outcomes for each patient under 2
courses of treatment (TKR versus non-TKR). The distribution of predicted differences in pain and
function with and without TKR is shown in Figure 2. The figure shows that there was a range of
predicted improvement with TKR, and those patients predicted to have benefit in knee pain may
not have been the same as those for whom benefit in physical functioning is predicted. In this
project’s database, 9% of subjects had a predicted gain in function of TKR versus non-TKR of at
least 8 SF-12 physical function points and a predicted reduction in knee pain of at least 20 points
(on WOMAC scale of 0- 100). At the other end of the spectrum, 6% had predicted gains in physical
function of fewer than 4 points and reduction of knee pain of fewer than 10 points. Only 2% had
larger gains in physical function and smaller improvements in pain. Figure 2 also shows sample
subjects from each of the 9 combinations of estimated knee pain and physical function change.
Examples of subjects with the most, mid-, and least-estimated reduction of pain as well as their
gain in function with 95% prediction intervals for the estimates are shown in Table 3. Subjects
with higher baseline knee pain had the largest predicted reductions in knee pain with TKR versus
non-TKR. Younger patients with lower SF-12 scores had the largest predicted benefits in physical
function with TKR versus not having TKR. These differences in estimated benefits between
subjects are because of the interaction terms included in the multivariable models.
We ran into statistical questions regarding the use of the proposed linear model with 1-
year outcomes, specifically knee pain, for which the scores do not have normal distributions and
adjustment for covariates still produced a model in which the resulting residuals (the difference in
predicted and observed values) still had skewed distributions. We explored alternative nonlinear
models with little gain in model performance and ultimately used the linear form of the model.
See Appendix I.
27
Figure 2. Mosaic Plot Showing Distribution of Predicted Differences (TKR Versus Non-TKR) for 1-year Knee Pain and SF-12 Physical Function in Pooled Data (n = 1452 Subjects)
28
Table 3. Estimated Outcomes for a Sample of Cases
Predicted Change With TKR Compared With Non-TKR Baseline Characteristics
Knee Pain (1 Year): Estimate and 95% Prediction Interval
SF-12 Function (1 Year): Estimate and 95% Prediction Interval
Estimated
Reduction in Knee Pain
Estimated Improvement in
Function Gen
der
Age
BMI
Any
Com
orbi
ditie
s
SF-1
2 M
enta
l
WO
MAC
Kne
e Pa
in
SF-1
2 Ph
ysic
al
Non-TKR TKR (TKR Minus Non-TKR) Non-TKR TKR (TKR Minus
Non-TKR)
Most pain reduc (≥20 pts)
Most gain fcn (≥8 pt improve) F 58 33.3 N 38 65 30 46
(12 to 80) 21
(80 to –13) –24.5
(–72.4 to 23.5) 32
(15 to 49) 42
(49 to 25) 9.6
(–14.6 to 33.8)
Most pain reduc (≥20 pts)
Mid gain fcn (4-<8 pt improve) F 63 35.0 N 60 70 41 43
(09 to 77) 17
(77 to –17) –26.1
(–74.1 to 21.9) 42
(25 to 59) 47
(59 to 30) 4.7
(–19.5 to 28.8)
Most pain reduc (≥20 pt)
Least gain fcn (<4 pt improve) M 77 24.3 Y 60 55 36 35
(01 to 69) 14
(69 to –20) –21.2
(–69.2 to 26.7) 39
(22 to 56) 42
(56 to 25) 2.7
(–21.5 to 26.9)
Mid pain reduc (≥10 to <20 pts)
Most gain fcn (>=8 pt improve) M 63 28.3 N 34 45 36 34
(00 to 68) 16
(68 to –17) –18.0
(–65.9 to 30.0) 36
(19 to 53) 46
(53 to 29) 9.6
(–14.6 to 33.8)
Mid pain reduc (≥10 to <20 pts)
Mid gain fcn (4-<8 pt improve) M 66 31.3 Y 62 25 50 18
(–16 to 52) 7
(52 to –27) –11.5
(–59.4 to 36.5) 46
(29 to 63) 50
(63 to 33) 4.0
(–20.2 to 28.2)
Mid pain reduc (≥10 to <20 pts)
Least gain fcn (<4 pt improve) M 71 35.8 Y 59 30 44 22
(–12 to 55) 8
(55 to –25) –13.1
(–61.0 to 34.8) 42
(25 to 59) 46
(59 to 29) 3.7
(–20.4 to 27.9)
Least pain reduc (<10 pts)
Most gain fcn (≥8 pt improve) M 51 23.4 N 39 20 49 20
(–14 to 54) 11
(54 to –23) –9.8
(–57.9 to 38.2) 46
(29 to 63) 56
(63 to 39) 10.5
(–13.7 to 34.8)
Least pain reduc (<10 pts)
Mid gain fcn (4-<8 pt improve) F 62 26.3 N 61 20 44 18
(–16 to 52) 8
(52 to –26) –9.8
(–57.8 to 38.1) 45
(28 to 62) 50
(62 to 33) 4.7
(–19.5 to 28.8)
Least pain reduc (<10 pts)
Least gain fcn (<4 pt improve) F 74 27.4 Y 56 5 51 8
(–26 to 42) 3
(42 to –31) –5.0
(–53.0 to 43.1) 46
(28 to 63) 49
(63 to 32) 3.7
(–20.4 to 27.9)
29
Prototype Decision Support Software Development, Interface Design, and Usability Testing
The KOMET development process resulted in the creation of 1 web-based application for
clinicians (http://medicalequipoise.com/tkrclinician) and one for patients
(http://medicalequipoise.com/tkrpatient). Both applications are composed of an analytics
software library that also could be embedded into an EHR system.
The applications underwent user testing to assess the ease of data collection through the
web-based questionnaire and users’ ability to understand the outcome predictions when
presented in data tables, graphs, and as dynamic text. We also tested depictions of prediction
uncertainty and definitions of mathematical equipoise.
All users were able to easily enter demographic data and complete the questionnaire with
only minor questions or comments. We initially presented users with a table and bar graphs
describing current and predicted pain and function outcomes (Appendix J: Figure 1). After initial
testing, we refined the report to provide a dynamic text description (Appendix J: Figure 2). This
change improved users’ ability to identify their current pain and function scores and the predicted
1-year outcome scores with surgical and nonsurgical treatments.
The combined pain and function plot proved to be less intuitive. Many users immediately
understood that the single data point represented both the pain and function outcome
predictions, but others struggled to describe the data represented by the graph (Appendix J:
Figure 3).
User testing led to improvements in the way predicted outcome uncertainty was
communicated. The degree of uncertainty around the predicted pain and function outcomes,
initially represented by whiskers on the bar chart (Appendix J: Figure 1), was not understood by
users. We changed the chart by using shading within the bar that faded at the edges and added a
dynamic text explanation describing the range of possible values. (Appendix J: Figure 2). This
improved user understanding. Analogously, for the combined pain and function plot, we changed
the uncertainty around the prediction from a dotted circle around a data point (Appendix J:
Figure 3) to a shaded circle. (Appendix J: Figure 4). A limitation of our depiction of the results, not
of the interface per se, is that our methodology made separate statistical models for 1-year knee
30
pain and physical function; in reality, these 2 outcomes are likely related. Therefore, our
uncertainty regions may still not be accurately capturing, and are likely overestimating, the joint-
prediction areas. The true uncertainty region would be a subset of the circle if pain and
functioning were dependent.
Based on these uncertainty estimates regarding the predictions and based on the
mathematical equipoise approach, we used KOMET to identify patients for whom enrollment in a
randomized clinical trial might be appropriate. For the purpose of demonstration, we defined
mathematical equipoise as a condition when pain and functioning outcome predictions with
nonsurgical care and TKR are relatively close and fall within each other’s circles of zones of
uncertainty—ie, their circles of uncertainty overlap. These circles are created when the pain and
function outcome predictions are presented as point estimates on a 2-dimensional graph with
pain on the vertical axis (y) and function on the horizontal axis (x). The uncertainty circle is defined
by the shaded area extending around each of the point estimates and represents the uncertainty
associated with the predictions. In Appendix J: Figure 1, the blue diamond represents the
outcome prediction point estimate for nonsurgical care and the green circle represents the point
estimate for TKR. The large shaded blue and green overlapping circles are around the 95%
confidence intervals of the pain and function point estimates and represent the uncertainty
associated with the predictions. When we computed the mathematical distance between the
nonsurgical and TKR predictions the resulting distance between the 2 coordinates on the pain and
function graph was 43. Empirically, based on the input of rheumatologists, orthopedists, and
primary care clinician stakeholders, and after reviewing the 95% CI for a sample of cases, we
selected the distance of less than or equal to 20 to flag the possible presence of equipoise during
the usability testing.
When mathematical equipoise was present, an alert appeared on the user interface’s
results page and a patient contact and screening form was made available to the clinician. The
form could be used to begin the clinical trial recruitment process.
We asked participants about usefulness of the information for decision making. Each
stated that the tool was helpful or somewhat helpful. All wanted to discuss the results with their
physicians.
31
We used the combined pain and function predicted outcomes plot to discuss the idea of
equipoise in the context of random assignment of treatments in an RCT. We showed users 3
sample graphs depicting the predicted outcomes of the 2 potential treatments, with small,
moderate, and large amounts of overlap between the 2 circles that depicted uncertainty around
the predicted point value (see Appendix L). We assumed that if patients perceived greater overlap
in the predicted outcomes between the 2 treatments, then they would be more likely to consider
being randomized to 1 of the treatments. Only 7 of 12 users shown these scenarios responded
that they understood the concept of equipoise. Because of their personal preferences, some users
rejected the option of surgery despite predictions suggesting dramatic reductions in pain and
improvement in physical functioning. Other patients indicated they would consider randomization
only if the burden of surgery promised a far better outcome than nonsurgical treatment. Overall,
users did not respond to the depiction of the circles of uncertainty scenarios as we expected.
We conducted patient and clinician user testing to understand KOMET use in the clinical
setting during regularly scheduled clinic visits and research-specific visits. There were significant
challenges in allocating adequate time for the patient to complete the decision support tool and
for the clinician and patient to discuss the results and implications for decision making. We
determined that future dissemination should include patients completing the tool prior to their
visit to allow the patient and clinician more time during the visit to discuss the results, the
patient’s priorities and choices, and treatment options.
Through the efforts of our research team, stakeholders, and design consultants, we were
able to develop a software program that users found helpful in shared clinical decision making.
Although the final prototype seemed attractive and easy to use, there will need to be further
refinements for routine clinical care use and enrollment into RCTs.
32
DISCUSSION
Study Results in Context In deciding between treatment options and deciding whether to participate in a clinical
study, the patient is the ultimate decision maker. Ideally, these determinations will be made with
ample consultation and support by relevant clinicians. In this context, methods to share
information and support a shared conversation about these decisions can be very helpful.
Decision aids explicitly intended for shared patient–clinician decision making have been shown to
improve patient knowledge, patients’ satisfaction with decision making, and agreement between
choices for treatment and their health outcome preferences, among other positive effects.43 The
same kind of shared decision making is justified in the decision as to whether to participate in an
RCT. In developing KOMET, we sought to develop decision support that could support both a
clinical decision and a decision to participate in an RCT—in this case, the decision between
surgical (TKR) and nonsurgical treatment of knee osteoarthritis.
There are 2 general contexts for the results of this project: (1) the development of
mathematical equipoise as a basis for decision support and (2) the state of evidence for treatment
decisions for knee osteoarthritis. We discussed the latter of these 2 in the introduction; although
knee replacement surgery for osteoarthritis is very common in this country, until a relatively small
RCT was conducted as this project was being completed there were no RCT data to directly inform
this treatment choice.9 Although KOMET does not provide new data, it presents those data
available at the time of its development in a potentially helpful way. More to the general point of
this project, KOMET is intended to help generate the needed RCT data for knee osteoarthritis to
add to extant evidence. Thus, the main context for the results of this project is for the
development of the mathematical equipoise method.
Mathematical equipoise is based on the use of mathematical models that serve as clinical
predictive instruments to predict patient-specific outcomes of treatment options, which then can
be compared. By doing so, in a sense we are discerning patient-specific equipoise. When the
predictions are not discernibly or importantly different, which can suggest equipoise between
options, enrollment in an RCT that compares the treatments can be considered. When the
33
predictions suggest one treatment is likely to have better outcomes then trial enrollment would
not be appropriate. When this is the case, however, this identification of a potentially superior
treatment can inform patient–clinician decision making, thereby constituting an approach for
enrolling RCT participants that also supports clinical decision making for those not to be enrolled
in an RCT.
Our original examples of this approach used predictive model outcomes of acute
myocardial infarction that were built using RCT data, which are ideal sources of data for making
predictive models. However, for many treatments there are no prior RCTs—and indeed these are
the very conditions for which RCTs and, in particular, clinical effectiveness trials are needed.
For the widespread use of mathematical equipoise to help fill in gaps in RCTs, predictive
models for these conditions will need to be built on data from clinical registries, EHRs, and other
non-RCT data. Therefore, the purpose of this project was to further develop the method by
applying it to an important clinical treatment question for which there were essentially no prior
RCTs. With more than 680 886 total knee replacements done each year in the United States for
knee osteoarthritis,44 we considered this an important question for patients and society, and a
good opportunity to test whether this approach could work in this challenging but common
situation.
To do this, we created a consolidated database from non-RCT sources on knee
osteoarthritis outcomes on which we created predictive models of the outcomes of surgical knee
replacement and nonsurgical treatments. The choices for variables for these models were
informed by the needs of stakeholders who would use decision support for knee osteoarthritis,
with focus on their views on the representation of pain and functional outcomes. We then
developed multivariable mathematical models that predict patient-specific outcomes of surgical
and nonsurgical treatment, using statistical and analytic methods to adjust, to the extent possible,
for the inherent biases in the databases. We also performed a variety of analyses to understand
how to best model and represent the predicted outcomes. We incorporated these models into a
stakeholder-informed prototype decision support software for potential incorporation into EHRs.
KOMET exemplifies a tool that can be used to provide shared decision support for RCT inclusion
and clinical care that is responsive to the perspectives and needs of patients and clinicians in
34
supporting shared decision making for RCT enrollment and treatment.
We believe the impact of such a method on the field of CER could be substantial. The
impact of CER is based on evidence generation, which leads to evidence synthesis, interpretation,
application, dissemination, implementation in widespread practice, and then feedback for the
generation of new evidence. This entire chain rests on having unbiased generalizable, ideally RCT,
evidence. Were there such a method for patient-centered enrollment into RCTs that could be
incorporated into EHRs, far more targeted comparative effectiveness trials could be conducted,
more diverse clinical sites could be included, and more representative patients could be enrolled.
This would lead to results that are applicable to more special groups and to more care settings.
This would also facilitate clinicians’ and the public’s understanding of, and enrolling into, clinical
trials, which could help improve the public’s engagement in the biomedical research enterprise.
Additionally, clinical trial duration, a scientifically and financially important component of drug
development pipeline time, could be much shorter. If instead of only 10% of eligible patients being
enrolled enrollment pace was increased by up to 10-fold, trials would finish much faster. All these
advantages should result in better clinical trials and greater impact on the public’s health.
If successful in demonstrating that this method has applicability to the many important
conditions for which RCTs have not yet been conducted, providing on-site real-time decision
support for trial enrollment, it could transform how comparative effectiveness research could be
conducted across the spectrum of health care. This would address the failure of current clinical
trial approaches to enroll sufficient numbers of patients, facilitate the need to identify and
ethically handle treatment of all potential subjects, and engage a conversation between clinician
and patients based on data specific to that patient. Thereby, it could have use in broad areas of
clinical care and could help enable the great promise of CER in improving clinical care.
Uptake of Study Results
In this project, based on input from multiple stakeholders and potential users, we
implemented prototype KOMET software and tested it in clinical settings. Although it is not ready
for widespread impementation in its user interface, content, and connectabiltiy to EHRs, it did
function as intended and thus is an important step in the ultimate goal of clinical use. We believe
35
its promise is sufficient to warrant further development with the explicit intent of being a tool
that can be implemented in clinical settings and linked to EHRs, to serve both treatment and RCT
enrollment purposes. Toward that end, we will seek further opportunities to move toward that
goal.
Study Limitations Although largely accomplishing its intended objectives, as an early stage in the
development of mathematical equipoise for shared clinical decision making, this project has a
number of limitations related to the available data, the modeling methods, the model variables,
and the prototype software.
An important limitation of our approach is that the models were created on potentially
biased data. Although we sought data from studies that had both surgical and medical treatment
of knee osteoarthritis, 2 of our studies fit that requirement while 2 other registries were of only
one treatment (surgery). Both types of sources provide challenges for creating comparable
patients who underwent the 2 treatments, which is needed to make accurate models of the 2
treatments. In contrast, our prior clinical predictive models, including the first examples of
mathematical equipoise, used data from RCTs. This allowed for representation of the alternative
treatment effects based on comparable samples undergoing the treatment choices, providing
confidence that the effects and outcomes represented by the multivariable models would reflect
the actual treatment effects and not differences in the underlying characteristics of patients
receiving the alternative treatments. However, for mathematical equipoise to serve its intended
purpose of facilitating RCTs of treatments for which none have yet been conducted, its models will
need to be made on non-RCT data. Therefore, for this project we intentionally chose a condition
for which our only data were from registries that posed challenges for making models that could
be based on comparable patients receiving the 2 treatments. We undertook many checks to
maximize the comparability and to accurately represent effects despite the likely biased samples.
For example, we chose to use matching for our study design but acknowledge that while 1:1
matching improves control of confounding and enabled us to create a hypothetical sample of non-
TKR patients who could be considered as TKR eligible, this approach does not use data from all
available subjects and may therefore have the cost of less precision. While we believe KOMET
36
models have very good performance, despite the challenge of the available data, additional
sources of data for this approach should be developed.
The modeling methods also have limitations. Although such multivariable regression as we
used have advantages over some more computer-intensive methods like those used for machine
learning, including the clearly interpretable variable coefficients and robustness that is more
resistant to overfitting than some computer-driven methods,45 larger databases on which more
corrections might be made (eg, via the use of propensity scores) and newer computer methods
might advance the level of models that might be created. Indeed, we believe this approach will
benefit greatly from such advances in modeling.
Another limitation is that neither model was validated in an independent database; we
simply did not have sufficient data to support model development and to still have enough for a
test database. However, we believe the performance of the models and their variables are
consistent with clinical understanding and importance and are reasonable for use in this
demonstration project. Nonetheless, testing on an independent data set will be an important
future priority.
Beyond the methods, the modeling variables we used have limitations. While in general,
based on the collection of important variables in the available databases and based on published
clinical evidence and input by stakeholders, we believe we used very credible variables to
represent independent and dependent (treatment outcome) variables, there is one about which
we have reservations. The functional outcome we used, based on our wish to capture a holistic
physical function of the patient, was based on the SF-12 functional scale. In looking at the results
of KOMET predictions, we noticed that pain is often substantially changed by surgery but function
tends to have a relatively modest improvement. In discussing this with patients, we wondered if
we would have better captured their meaningful knee functional improvement if we had used a
more knee-specific function rather than overall physical functioning. We hoped to address this
limitation by performing further analyses of the treatment outcomes for the subsample of
patients in our databases for whom we have a knee-specific functional scale, the KSS, as a
treatment outcome. The results of these exploratory analyses, presented in Appendix M, suggest
that the WOMAC knee pain tracks well with other measures of knee pain and symptoms and, in
37
particular, Knee injury and Osteoarthritis Outcome Score (KOOS) knee pain. The SF-12 physical
component score, while positively correlated, does not track as strongly with other knee-related
quality of life and function variables. These results are somewhat to be expected in that, while
there may be overlap in physical function and knee-related function, they are not the same thing.
Our meetings with stakeholders suggest both overall and knee-related function are important,
and we have come to believe future work to develop predictions of the more specific knee-related
function would be useful to both patient and clinical stakeholders.
Certainly, as a prototype the KOMET software has limitations. The creation of full-featured,
user-friendly, robust software is beyond the scope of this project. Our prototype has significant
distances to go in these and other dimensions before it could be used in routine care.
Nonetheless, we believe it is quite attractive and functional and, in the context of its intended role
in this project, a successful product of this project.
Finally, in putting the use of KOMET in the context of clinical decision making, this
approach does not consider how the patient might feel about the outcome states (pain and
function). This would involve translating the WOMAC and SF-12 scores into familiar terms for
patients and making sure the idea of overlap of predictions that suggests equivalence are all
understood. Also, it would include ensuring that these features are readily incorporated into
patients’ understanding and decision making for their own and shared consideration. Beyond
these user issues, as additional information, patients would have to know about the downsides
and potential complications of surgery, delays of surgery, and adverse consequences of other
treatments. Thus, while KOMET provides an important foundation for the shared decision-making
process, to provide complete and optimal support additional work is needed in a number of
dimensions.
Future Research The limitations listed above suggest areas for future research. Approaches must be
developed that lessen the biases inherent in clinical registry data. Although having more data,
such as might be obtained from EHR data warehouses and other wide sources of clinical data, will
not eliminate biases, finding ways to mitigate the biases using selection and sampling methods
38
and other approaches will be extremely important for work on mathematical equipoise, as well as
for many other efforts to harvest clinically important insights from clinical data. Beyond
developing such methods, validation of these approaches will be crucial.
In future efforts of this type, we would like to have more complete accounting for ancillary
issues and complications. For example, in the one RCT done to date,9 serious adverse events were
more common in the TKR group than in the nonsurgical treatment group (8 versus 1 involving the
index knee [P = .05] and 24 versus 6 overall [P = .005]), with the 2 most common serious adverse
events involving the index knee having deep venous thrombosis (in 3 patients) and stiffness
requiring brisement force (in 3 patients). Unfortunately, we did not have access to such
information in the databases available to us. In this Danish study, 9 adverse events that occurred
before the 12-month follow-up were identified in hospital records, by self-report at follow-up
visits, and by the physiotherapist and were then categorized. In future work exploring
mathematical equipoise, in an analogous way, we intend to methodically collect such data.
Modeling clinical outcomes based on data is evolving rapidly, and increasingly
sophisticated computer-based methods, such as artificial intelligence and machine learning, are
being applied to analysis of clinical data. Although computer-based algorithms have a tendency to
overfit,45 compromising their generalizability to new populations, methods are advancing and an
investigation of best methods is certainly warranted.
As indicated above, we believe the functional outcome we used for physical function
might benefit from being a more knee-specific outcome variable.46,47 There are examples in other
diseases in which, for specific conditions, disease-specific outcomes are more useful than more
general functional outcomes, such as we used.48,49 We believe additional research that uses a
more specific functional outcome would be worth conducting.
In terms of the prototype software, it is clear more research is needed for this and similar
decision support that provides full-featured, user-friendly, interoperable, robust software. Badly
needed will be attractive and functional software for this and similar purposes.
Finally, in developing and testing such decision support software, we will need to further
investigate how to better understand, make clear, and use the patient-specific determination of
equipoise that could be the basis of a comparative effectiveness RCT. We believe the method has
39
important advantages for such studies, but before it can be widely deployed and used it must be
fully understood and transparent to all stakeholders. We look forward to advancing this work.
CONCLUSIONS
This project demonstrated the use of predictive instruments and mathematical equipoise
as a way to discern patient-specific equipoise and thereby as a method for providing patient-
specific decision support for shared patient–physician decision making for the selection between
alternative treatments and as the basis for enrollment into comparative effectiveness trials. Based
on its predictive models, KOMET provides individualized predictions of pain and functional
outcomes of medical and surgical treatment of knee osteoarthritis designed to be embedded in
EHRs. It can help identify patients for whom one or the other treatment seems likely to yield
better outcomes based on their specific characteristics as well as patients for whom there is
insufficient evidence to favor one treatment. This still can be part of a shared decision-making
process that incorporates the patient’s preferences and priorities for the outcomes the models
predict (ie, pain and function but not others), and, by identifying potential clinical equipoise, it
also can support enrollment into an RCT.
The next step will be to conduct a larger-scale test and then to implement it for its
intended use—the conduct of a comparative effectiveness trial in usual care settings in which
KOMET would support patient–clinician shared decision making about treatment selection for
knee osteoarthritis.
40
REFERENCES
1. Lawrence RC, Felson DT, Helmick CG, et al. Estimates of the prevalence of arthritis and other rheumatic conditions in the United States. Part II. Arthritis Rheum. 2008;58(1):26-35.
2. Guccione AA, Felson DT, Anderson JJ, et al. The effects of specific medical conditions on the functional limitations of elders in the Framingham Study. Am J Public Health. 1994;84(3):351-358.
3. Mankin, HJ. Clinical features of osteoarthritis. In: Kelly WN HE, Ruddy S, Sledge CB, eds. Textbook of Rheumatology. 4th ed. Philadelphia, PA: W.B. Saunders Co; 1993:1374-1384.
4. The Incidence and Prevalence Database for Procedures. Sunnyvale, CA: Timely Data Resources; 1995.
5. Kosorok MR, Omenn GS, Diehr P, Koepsell TD, Patrick DL. Restricted activity days among older adults. Am J Public Health. 1992;82(9):1263-1267.
6. Kramer JS, Yelin EH, Epstein WV. Social and economic impacts of four musculoskeletal conditions. A study using national community-based data. Arthritis Rheum. 1983;26(7):901-907.
7. Selten EM, Vriezekolk JE, Geenen R, et al. Reasons for treatment choices in knee and hip osteoarthritis: a qualitative study. Arthritis Care Res. 2016;68(9):1260-1267.
8. Weng HH, Kaplan RM, Boscardin WJ, et al. Development of a decision aid to address racial disparities in utilization of knee replacement surgery. Arthritis Rheum. 2007;57(4):568-575.
9. Skou ST, Roos EM, Laursen MB, et al. A randomized, controlled trial of total knee replacement. New Engl J Med. 2015;373(17):1597-1606.
10. Eyles JP, Mills K, Lucas BR, et al. Can we predict those with osteoarthritis who will worsen following a chronic disease management program? Arthritis Care Res. 2016;68(9):1268-1277.
11. Fagerlin A, Sepucha KR, Couper MP, Levin CA, Singer E, Zikmund-Fisher BJ. Patients' knowledge about 9 common health conditions: the DECISIONS survey. Med Decis Making. 2010;30(suppl 5):35S-52S.
12. Kent DM, Ruthazer R, Griffith JL, et al. A percutaneous coronary intervention-thrombolytic predictive instrument to assist choosing between immediate thrombolytic therapy versus delayed primary percutaneous coronary intervention for acute myocardial infarction. Am J Cardiol. 2008;101(6):790-795.
13. Selker HP, Beshansky JR, Griffith JL, et al. Use of the acute cardiac ischemia time-insensitive predictive instrument (ACI-TIPI) to assist with triage of patients with chest pain or other symptoms suggestive of acute cardiac ischemia. A multicenter, controlled clinical trial. Ann Intern Med. 1998;129(11):845-855.
14. Selker HP, Griffith JL, Beshansky JR, et al. Patient-specific predictions of outcomes in myocardial infarction for real-time emergency use: a thrombolytic predictive instrument. Ann Intern Med. 1997;127(7):538-556.
15. Selker HP, Beshansky JR, Griffith JL, Investigators TPIT. Use of the electrocardiograph-based thrombolytic predictive instrument to assist thrombolytic and reperfusion therapy for acute myocardial infarction. A multicenter, randomized, controlled, clinical effectiveness trial. Ann Intern Med. 2002;137(2):87-95.
16. Stacey D, Légaré F, Col NF, et al. Decision aids for people facing health treatment or screening decisions. Cochrane Database Syst Rev. 2014;(1):CD001431. doi: 10.1002/14651858.CD001431.pub4.
17. Stacey D, Taljaard M, Dervin G, et al. Impact of patient decision aids on appropriate and timely access to hip or knee arthroplasty for osteoarthritis: a randomized controlled trial. Osteoarthritis Cartilage. 2016;24(1):99-107.
18. Bozic KJ, Belkora J, Chan V, et al. Shared decision making in patients with osteoarthritis of the hip
41
and knee: results of a randomized controlled trial. J Bone Joint Surg. 2013;95(18):1633-1639. 19. de Achaval S FL, Volk RJ, Cox V, Suarez-Almazor ME. Impact of educational and patient decision aids
on decisional conflict associated with total knee arthroplasty. Arthritis Care Res. 2012;64(2):229-237.
20. Stacey D, Hawker GA, Dervin G, et al. Decision aid for patients considering total knee arthroplasty with preference report for surgeons: a pilot randomized controlled trial. BMC Musculoskelet Disord. 2014;15(54):1-10.
21. Hip and knee osteoarthritis toolkit. Dartmouth-Hitchcock Center for Shared Decision Making website. http://med.dartmouth-hitchcock.org/csdm_toolkits/hip_and_knee_osteoarthritis_toolkit.html. Published 2017. Accessed January 7, 2018.
22. Selker HP, Ruthazer R, Terrin N, Griffith JL, Concannon T, Kent DM. Random treatment assignment using mathematical equipoise for comparative effectiveness trials. J Clin Transl Sci. 2011;4(1):10-16.
23. Forsythe LP, Ellis LE, Edmundson L, et al. Patient and stakeholder engagement in the PCORI pilot projects: description and lessons learned. Int J Gen Med. 2016;31(1):13-21.
24. Deverka PA, Lavallee DC, Desai PJ, et al. Stakeholder participation in comparative effectiveness research: defining a framework for effective engagement. J Comp Eff Res. 2012;1(2):181-194.
25. PCORI engagement rubric. PCORI website. http://www.pcori.org/sites/default/files/Engagement-Rubric.pdf. Published 2014. Accessed October 27, 2016.
26. Concannon TW, Meissner P, Grunbaum JA, et al. A new taxonomy for stakeholder engagement in patient-centered outcomes research. Int J Gen Med. 2012;27(8):985-991.
27. Multicenter Osteoarthritis Study (MOST) database. San Francisco, CA: University of California; 2009. http://most.ucsf.edu. Accessed May 15, 2014.
28. Osteoarthritis Initiative (OAI) database. Bethesda, MD: National Institutes of Health; 2013. https://nda.nih.gov/oai/. Specific data sets: V 0.2.2, 1.2.1, 2.2.2, 3.2.1, 4.2.1, 5.2.1, 6.2.2, 7.2.1, 8.2.1, 9.2.1, 24, 25, and 9. Accessed June 25, 2014.
29. Hawker GA, Wright JG, Coyte PC, et al. Differences between men and women in the rate of use of hip and knee arthroplasty. New Engl J Med. 2000;342(14):1016-1022.
30. Hawker GA, Wright JG, Coyte PC, et al. Determining the need for hip and knee arthroplasty: the role of clinical severity and patients' preferences. Med Care. 2001;39(3):206-216.
31. Hawker GA, Wright JG, Glazier RH, et al. The effect of education and income on need and willingness to undergo total joint arthroplasty. Arthritis Rheum. 2002;46(12):3331-3339.
32. New England Baptist Hospital (NEBH) orthopedic surgery registry. Boston, MA: New England Baptist Hospital; 2018. https://www.nebh.org/health-professionals/research/orthopedic-registry/. Accessed February 22, 2017.
33. Tufts Medical Center (TMC) orthopedic surgery registry. Boston, MA: Tufts University School of Medicine.
34. WOMAC osteoarthritis index. http://www.womac.org/womac/index.htm. Accessed February 22, 2017.
35. SF-12 health survey. http://www.outcomes-trust.org/instruments.htm#SF-12. Accessed February 22, 2017.
36. Ware J Jr, Kosinski M, Keller SD. A 12-Item Short-form health survey: construction of scales and preliminary tests of reliability and validity. Med Care. 1996;34(3):220-233.
37. Lacson E Jr, Xu J, Lin SF, Dean SG, Lazarus JM, Hakim RM. A comparison of SF-36 and SF-12 composite scores and subsequent hospitalization and mortality risks in long-term dialysis patients. Clin J Am Soc Nephrol. 2010;5(2):252-260.
38. Kosanke JB, Bergstralh E. GMATCH macro for greedy matching.
42
http://bioinformaticstools.mayo.edu/research/gmatch/. Accessed February 22, 2017. 39. Riddle DL, Kong X, Jiranek WA. Factors associated with rapid progression to knee arthroplasty:
complete analysis of three-year data from the osteoarthritis initiative. Joint Bone Spine. 2012;79(3):298-303.
40. Gantz MG. Creating RTF tables with univariate analyses of multiply imputed data. Poster presented at: Southeast SAS Users Group (SESUG) Conference; October 8-10, 2006; Atlanta, GA.
41. SAS (for Windows) [computer program]. Version 9.4 TS Level 1M2. Cary, NC: SAS Institute; 2002-2012.
42. Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Sci. 2010;25(1):1-21.
43. Stacey D, Légaré F, Lewis K, et al. Decision aids for people facing health treatment or screening decisions. Cochrane Database Syst Rev. 2017;4:CD001431. doi: 10.1002/14651858.CD001431.pub5.
44. HCUPnet. Healthcare Cost and Utilization Project. US Department of Health & Human Services/Agency for Healthcare Research and Quality. https://hcupnet.ahrq.gov/#setup. Published 2014. Accessed April 25, 2017.
45. Selker HP, Griffith JL, Patil S, Long WJ, D'Agostino RB. A comparison of performance of mathematical predictive methods for medical diagnosis: identifying acute cardiac ischemia among emergency department patients. J Investig Med. 1995;43(5):468-476.
46. Brazier JE, Harper R, Munro J, Walters SJ, Snaith ML. Generic and condition-specific outcome measures for people with osteoarthritis of the knee. Rheumatology. 1999;38(9):870-877.
47. Bombardier C, Melfi CA, Paul J, et al. Comparison of a generic and a disease-specific measure of pain and physical function after knee replacement surgery. Med Care. 1995;33(suppl 4):AS131-AS144.
48. Binkley JM, Stratford PW, Lott SA, Riddle DL. The lower extremity functional scale (LEFS): scale development, measurement properties, and clinical application. North American Orthopaedic Rehabilitation Research Network. Phys Ther. 1999;79(4):371-383.
49. Patrick DL, Deyo RA. Generic and disease-specific measures in assessing health status and quality of life. Med Care. 1989;27(suppl 3):S217-S232.
50. Apache.org Tomcat [computer program]. Version 8. Wakefield, MA: Apache Software Foundation; 1999-2019.
43
Acknowledgments
The authors wish to thank our patient and clinician stakeholders for their valuable
contributions and guidance: Debra Band-Entrup, Kathie Bernstein, Melvin Bernstein, Jaclyn Chu,
Deane Felter, William Harvey, Helen Herzer, Cristina MacDonald, Vincent MacDonald, Susan Nesci,
John Richmond, Kimberly Schelling, Eric Smith, and Steven Vlad. The authors thank Kaila Dion and
Rajeev Chorghade for support with scale development and data management, and Ben Hannon
for user interface design. We acknowledge Brendan Harrison, Nikolai Klebanov, and Esha
Sondhi for data collection, and Mary Pevear and Gary Schneider for their assistance with
Orthopedic Surgery Registries.
The OAI is a public–private partnership comprising 5 contracts (N01-AR-2-2258, N01-AR-2-
2259, N01-AR-2-2260, N01-AR-2-2261, and N01-AR-2-2262) funded by the National Institutes of
Health, a branch of the Department of Health and Human Services, and conducted by the OAI
study investigators. Private funding partners include Merck Research Laboratories, Novartis
Pharmaceuticals Corporation, GlaxoSmithKline, and Pfizer Inc. Private sector funding for the OAI is
managed by the Foundation for the National Institutes of Health. This manuscript was prepared
using an OAI public use data set and does not necessarily reflect the opinions or views of the OAI
investigators, the NIH, or the private funding partners.
MOST comprises 4 cooperative grants (Felson—AG18820, Torner—AG18832, Lewis—
AG18947, and Nevitt—AG19069) funded by the National Institutes of Health, a branch of the
Department of Health and Human Services, and conducted by MOST study investigators. This
manuscript was prepared using MOST data and does not necessarily reflect the opinions or views
of MOST investigators. Recommended additional documentation describing various aspects of the
design and methods of MOST is available by request sent to MOSTOnline@psg.ucsf.edu and
should be paraphrased and referenced as appropriate.
Data were provided from the Ontario Hip and Knee Osteoarthritis Cohort conducted by
the Canadian Osteoarthritis Research Program, led by Dr Gillian Hawker. Data provided from CORP
are made possible through grants by the Canadian Institutes of Health Research and the Arthritis
Society.
44
Research reported in this report was funded through a PCORI Award (ME-1306-02327).
The views, statements, and opinions presented in this report are solely the responsibility of the
authors and do not necessarily represent the views of PCORI, its Board of Governors, or its
Methodology Committee.
45
APPENDIX A: Matching The goal of this project was to make prediction models using available non-RCT data
sources. To accomplish this we used multiple registries to create a modeling database that
included matched sets of paired knees (one with and one without TKR) that were similar in all
respects except for the surgical procedure. Our process for creating a database of matched TKR to
non-TKR knees has limitations which should be kept in mind when using the resulting models and
predictions.
First, we only matched subjects based on data available within each study. While a TKR:
non-TKR knee dyad may have come from 2 patients with the same gender, similar age and
baseline knee pain, etc., other characteristics that were not part of the matching process may
have differed. Second, we allowed for non-exact matches because we wanted to include knees
that had TKR in our analysis even if we could not find a perfect match. We planned to adjust for
covariates in the modeling process to account for remaining residual imbalances between the TKR
and non-TKR groups. Third, we excluded subjects who did not have 1-year pain outcome data
from the matching process. Non-TKR knees, that had a TKR during the follow-up period, were
excluded from the matching process if 1-year follow-up in the non-TKR state was not available.
The predicted outcomes for non-TKR are based on the assumption that the knee did not have a
TKR within a year. Fourth, we also excluded knees with TKR that did not have 1-year follow-up
data. There could be several reasons for lack of follow-up data, some of which may not lead to
bias (study ended before follow-up could be done) while other reasons could lead to biased
predictions. For example, if a patient had TKR and major surgical complications led to death, then
the 1-year outcome data would not be available, and exclusion of these bad outcomes would lead
to favorable predictions.
Lastly, our ‘baseline’ data may not truly capture status at the time a patient decided
whether or not to have TKR. The NEBH and TMC databases of surgical cases did capture baseline
information in a timely manner. However, the OAI and MOST databases were registries of subjects
with knee osteoarthritis with timed data collection points (that included questions about whether
or not a TKR took place since the last timed measure). Evaluating data at the knee-visit level
allowed us to find subjects who had TKR, and we could then look back in time to find the nearest
46
assessment. For some patients that may have been within a month or two of the surgery, while
for others, it may have been within a year. Since follow-up for TKR subjects started at the time of
surgery, this also meant that the time between when baseline measures were done and the 1-
year follow-up was longer for TKR subjects than non-TKR subjects. If one believes knee pain and
function worsen over time in subjects that decide to get TKR, then our tool may underestimate
the benefit of TKR as a result of our not having a true baseline assessment. However, our final
regression models were built using data from all four databases where over 40% of knees that had
TKR were from the surgical databases lessening the impact of varying elapsed times between the
baseline assessment and actual TKR surgery.
The matching was done using SAS software41 and the SAS Macro %GMATCH for greedy
matching38 downloaded in February 2014 from:
http://www.mayo.edu/research/documents/gmatchsas/doc-10027248.
47
APPENDIX B: Missing Data We addressed the challenge of missing data by using a multiple imputation methodology
and by restricting the number of variables examined. For the multiple imputation, we used the
SAS procedures PROC MI to do the imputation and PROC MIANALYSE to combine results from
analyses run on the different imputed datasets.41 For each matched dataset (OAI, MOST, NEBH,
TMC) we created ten imputed datasets. We imputed main effects only and later calculated
interactions from the imputed data for the component main effects. We did discuss the
alternative of creating interactions first, then imputing data for the missing interactions. We opted
against that approach since we were concerned we could be generating main effects and
interactions at a subject level that did not correspond to each other. We used available data,
including 1-year outcome data, for each dataset when creating the imputed data. This means that
the variables used for the imputation for study differed based on availability, but maximized the
information we had for each study.
48
APPENDIX C: WOMAC Pain Score
I. WOMAC Knee Pain Score agreement with other measures of Pain and Function WOMAC Knee pain was selected as our primary outcome in collaboration with our
stakeholders and study team. To confirm the importance, and better understand the meaning of
the outcome, we reviewed correlations and scatter plots of WOMAC knee pain (WOMKP) with the
following variables: baseline SF-12 physical score (HSPSS), the Physical activity scale for the Elderly
(PASE), the WOMAC disability scales (WOMADL), the KOOS sport and recreational activity scale
(KOOSFSR), the KOOS quality of life (KOOSQOL) and the KGLRS scale assessing the effects of knee
pain and arthritis on function. We used the matched set of TKR and non-TKR knees in the OAI
database for these evaluations. For some scales, higher scores represent worse outcomes
(WOMKP, WOMADL, KGLRS) and for others, higher scores represent better outcomes (PASE,
KOOS, SF-12). Most patterns we found were as expected. For both Control and TKR subjects,
worse baseline WOMAC Knee pain (XWOMKP) was significantly (p<.0001) associated with worse
scores for physical function (ADL, KOOS, and KGLRS) (Table 1).
49
Figure 1. Distributions of WOMAC and Estimated WOMAC
50
II. Estimation of a WOMAC knee pain score using results from the KSS Both the OAI and MOST data sources had data for WOMAC knee pain but the NEBH and
TMC data sources did not. In order to create a common WOMAC or WOMAC like knee pain
variable across data sources, we constructed a new variable based on the KSS data available in the
NEBH and TMC datasets. Different versions of the KSS were used for NEBH versus TMC so we used
different approaches to estimate a WOMAC score for each. We examined how the WOMAC scale
was constructed and used that information to estimate a WOMAC score from KSS data. Figure 1
shows the distributions of the WOMAC and resulting estimated WOMAC from each of the four
studies stratified by timing (baseline verses follow-up).
Creation of a WOMAC Pain score from the NEBH version of KSS We used the OAI database to establish and explore relationships between the WOMAC
pain score, based on five components, and the estimated WOMAC score based on fewer
components that would be captured in a KSS.
The KSS captures data on walking, stairs, and rest. Keeping similar weighting as the
WOMAC, we developed the following mapping of the KSS to make an estimated WOMAC
instrument.
Table 2. NEBH WOMAC/KSS Mapping Schemes
WOMAC Scale
KSS Scale (NEBH)
Walking Stairs Rest
KSS score
Estimated WOMAC
component score
KSS score
Estimated WOMAC
component score
KSS score
Estimated WOMAC
component score
None (0) None 35 0 15 0 0 0
Mild (1) Mild/Occasional 30 1 10 1 -5 3
Moderate (2) Moderate 15 2 5 2 -10 6
51
Severe (3)
Severe 0 3.5 0 3.5 -15 10.5 Extreme (4)
WOMAC is a sum of 5 pain scores on a 0 to 4 point scale (Stairs + Walking + In Bed + Sit/Lie
down + Standing), where high numbers mean high pain. The KSS is a sum of 3 pain scores, each
with their own scale (Stairs + Walking + Rest), where low numbers mean high pain.
Both scales give more ‘weight’ for rest pain than stair or walking pain. The KSS weights
stair pain as worse than walking pain, unless it is severe, then they are the same. WOMAC weights
stair pain and walking pain the same and distinguishes different types of rest pain.
Our estimated WOMAC pain score weights stair and walking pain the same, as with the
WOMAC. The range is 0 to 17.5 (rather than 0 to 20). It assumes that the (“KSS Pain at Rest” x 3) is
the same as (WOMAC Pain in Bed + WOMAC Pain Sitting/Lying down + WOMAC Pain Standing).
Creation of a WOMAC Pain score from the TMC version of KSS
The Tufts database used an older version of the KSS and presented the biggest challenge
for the pain score outcome. We asked members of the research team to use the questions
collected on the Tufts KSS and review the WOMAC scale and weightings and try to score items
that they believed would approximate the WOMAC. We then calculated scores and plotted
distributions at baseline and follow-up. After reviewing the distribution of scores at baseline and
follow-up the team decided to use the version shown in Table 3 which looked most like what was
seen in the NEBH, the OAI, and MOST databases.
52
Table 3. TMC WOMAC/KSS Mapping Schemes KSS Pain KSS Walk (I. Walking) Estimated
WOMAC component
score
KSS Stairs (Stairs) Estimated WOMAC
component score
None Unlimited, > 10 Blocks, 5 - 10 Blocks
0 Normal Up & Down, Normal Up; Down with Rail; Up & Down with Rail
0
< 5 blocks, Housebound, Unable 0.5 Up with Rail; Unable 0.5 Mild or Occasional
Unlimited, > 10 Blocks, 5 - 10 Blocks
1 Normal Up & Down, Normal Up; Down with Rail; Up & Down with Rail
1
< 5 blocks, Housebound, Unable 1.5 Up with Rail; Unable 1.5 Mild or Occasional, Stairs Only
Unlimited, > 10 Blocks, 5 - 10 Blocks
2 Normal Up & Down, Normal Up; Down with Rail; Up & Down with Rail
2
< 5 blocks, Housebound, Unable 2.5 Up with Rail; Unable 2.5 Mild or Occasional, Walking & Stairs
Unlimited, > 10 Blocks, 5 - 10 Blocks
2 Normal Up & Down, Normal Up; Down with Rail; Up & Down with Rail
2
< 5 blocks, Housebound, Unable 2.5 Up with Rail; Unable 2.5 Moderate Occasional
Unlimited, > 10 Blocks, 5 - 10 Blocks
3 Normal Up & Down, Normal Up; Down with Rail; Up & Down with Rail
3
< 5 blocks, Housebound, Unable 3.5 Up with Rail; Unable 3.5 Moderate Continual
Unlimited, > 10 Blocks, 5 - 10 Blocks
4 Normal Up & Down, Normal Up; Down with Rail; Up & Down with Rail
4
< 5 blocks, Housebound, Unable 4.5 Up with Rail; Unable 4.5 Severe -- 5 -- 5
In summary, we re-scaled the estimated WOMAC pain scores from 0 to 100 in our
matched datasets. The distributions of scores pre and post-TKR were similar to those observed in
the MOST and OAI databases. Also, we looked at follow-up scores and controlled for baseline
scores, which were based on the same scale within studies. We reviewed pain scores between
controls and matched TKR subjects within each study and did not see any gross inconsistencies.
While we are aware that our methods are not validated, we believe they are adequate and
reasonable based on patient and clinician stakeholder and research team input.
53
APPENDIX D: The process for creating predictive models for the pain and function outcomes
Model Development Initial variable selection was done (development and updating
phases) using a stepwise selection process in one dataset with one observation per knee created
from averaging 10 multiply-imputed copies of the OAI database (and later the pooled OAI and
MOST databases). The candidate variables and interactions included in the selection process were
chosen from those available in the OAI and MOST database and that stakeholders and the project
team considered important and plausible. The selected variables then were used to make a
separate model for each of the individual imputed datasets, and we combined the results from
these 10 models to get the parameter estimates and associated standard errors that accounted
for both variation in the data and the amount of missingness. If variables in the model were no
longer significant at the p < 0.10 level, then they were removed using a backward selection
process.
Model Validation Performance of the linear regression models derived on the OAI
matched database was tested on the MOST dataset. We looked at scatter plots of predicted
outcomes (based on the equation from the OAI model) versus true 1-year outcomes in the MOST
database, and also r-square values. We did this for the pooled data and stratified by treatment
status.
Model Updating The MOST data and OAI databases were then pooled together (10
imputed copies of each) and the beta coefficients from the validated model were re-estimated on
the larger pooled database. We then re-explored adding additional interactions of variables with
the treatment indicator variable and removing variables based on significance, clinical, or
pragmatic reasons. These final models were called the P1 and F1 models for pain and function.
Re-derivation A project objective was to make patient-specific predictions accounting for
their individual characteristics. For a statistical model, this implies interactions of variables
representing patient characteristics with those representing the treatment. To better screen for
54
interactions, we used a larger dataset that included OAI and MOST data, and also the NEBH and
TMC matched datasets. One tradeoff with using these latter data sources is that they had fewer
candidate variables to draw upon. The model development process we used for the pooled data
from the four data sources was similar to that used for the model development. The final models
created from the pooled data from the four data sources were called the P2 and F2 models.
55
APPENDIX E:User Interface Testing
User interface testing included three components:
1. Entering demographic data and completing questionnaires to provide the information
needed for the predictive models.
2. Interpreting predictive model results through data displays including:
a. Data Table - Identifying current pain and function scale results and identifying
predicted pain and function outcomes with surgical or nonsurgical treatment.
b. Bar Charts – Understanding the meaning of scales, identifying current and
predicted outcomes, and understanding uncertainty.
c. Combined pain and function plot – Understanding the meaning of a data point,
identifying current and predicted outcomes, and comprehending the concept of
uncertainty around the predictions.
3. Determining user understanding of the predictive models, mathematical equipoise and
clinical trial randomization through case-based discussions.
Usability testing included a “think aloud” protocol and usability testing script. The former
involved asking users to accomplish a series of tasks including completing questionnaires and
describing their interpretation of the predicted outcomes. This allowed us to understand
participants’ thoughts as they used the application and to identify aspects of the tool that were
unclear. Any questions or problems were noted to improve future versions of the application. The
latter allowed determination of users’ understanding and interpretation of the current and
predicted outcome results page .
The usability testing was conducted in two stages. Both versions of the Usability Testing
Plan are provided on the following pages.
56
Version 1 - Usability and Cognitive Testing Plan
This version of the usability testing plan was used for the initial testing conducted with a
diverse group of research institute staff and members of our patient and clinician stakeholder
panels.
Introduction: Imagine that you (or someone you know) has knee osteoarthritis and is
contemplating different options, including surgical and nonsurgical treatments. Now imagine that
your clinician asks you to complete these questions in advance of the appointment [or possibly,
you will complete the questions together with your clinician during that appointment]. Next, you
and your clinician are going to make some decisions about the next steps for how to treat your
knee osteoarthritis.
Please complete this online questionnaire, which will take about 15 minutes. We expect
the total time will be about 1 hour for both completing the questionnaire and answering some
questions about using this application.
Note: When you see the term “physical function,” it refers to how your health affects your
ability to perform activities that you might do during a typical day.
Part 1: Questionnaire Usability: Please follow the prompts and instructions on the screen.
The bar at the top will show your progress, and you can use the arrows to go backwards and
forwards. Some of the pages require you to scroll down. You need to complete all of the questions
on a page before you can go onto the next page. If you have any questions while completing the
forms, please ask me. I will record your questions, so we can improve future versions of this
questionnaire. Please think aloud while you complete the questionnaire.
A. If you need help on this page, what would you do?
Part 2: Usability of Numeric Predictions: The predictive model built into the application
used information that you entered to calculate the current level of knee pain and physical
functioning. It also calculated four predictions about the level of knee pain and physical
functioning at 1 year. [This includes knee pain without surgery, knee pain with surgery, physical
function without surgery, and physical function with surgery].
57
[The patient is directed to the screen displaying the bar graphs for current and predicted
pain and function with and without surgical treatment.]
A. Expectations Physical Function:
I have a few questions before we look at all of your results. [Research Assistant
turns the computer away and writes out the current scores on a piece of paper along with
the scale and later the predicted scores.]
o What bothers you more knee pain or physical function? [Ask about whichever one
bothers the patient more first.]
o Your current physical function score is __ on a scale of 0 to 100 where 0 is poor
function and 100 is excellent function.
o In a year, would you expect your physical function with usual care to be higher or
lower than your current physical function?
o What activities would you hope to be able to do after a year of usual care?
o If you were to have knee replacement surgery, would you expect your physical
function to be higher or lower in a year?
o What activities would you hope to be able to do a year after knee replacement
surgery?
Knee Pain:
o Your current knee pain score is ___ on a scale of 0 to 100 where 0 is no pain and
100 is extreme pain. In a year, would you expect your knee pain with usual care to
be higher or lower than your current knee pain? If you were to have knee
replacement surgery, would you expect your knee pain to be higher or lower in a
year than your current knee pain?
58
B. Usability of Graphical Depiction of Predictions Interpret the Bar graphs and Legend:
o Just looking at the bar graph for current pain, where on the bar graph would knee
pain be worse?
o Where would it be better?
o Would your friends (or the people that you know) understand where severe knee
pain and where mild knee pain are on the graph?
o Brief description of uncertainty: Just like with a weather forecast, there is some
uncertainty for any prediction. When you hear that there is a 60% chance of
showers, you know there is some uncertainty around that number. The most likely
prediction (the average) is shown by the height of the color in the bar. 95 out of
100 people who answered like you would have knee pain predictions within the
colored area. Please take a look at this graph.
[On the top next to the pain tab, click on the function tab.]
o On the current function graph, where on the graph would physical function be
worse?
o On the current function graph, where on the graph would physical function be
better?
o Would your friends (or the people that you know) understand where poor function
and where excellent function are on the graph?
o What does the colored area around the value marked in the bar graph mean to
you?
o Does it convey anything to you about uncertainty or error?
o [If the user does not think that the colored area conveys uncertainty, then ask this
question.] If the colored area around the value in the bar graph does not convey
uncertainty, what would convey uncertainty to you for this bar chart?
o Given the bar graphs, do they help you interpret the results? [We will show the bar
graph for pain and the bar graph for function separately.]
Interpreting combine pain and function plot:
59
o [The Research Assistant first moves to the Combined Knee Pain and Function
Graph, and walks through the different layers that we can add on the graph, using
the text below.]
o The purple star on this graph displays your current pain level with your current
physical function level.
o The blue diamond shows the predictions for your knee pain and physical function
at one year with usual care. Just like the line on the bar graph, the blue colored
cloud represents the uncertainty around the predictions for knee pain and physical
function at one year with usual care.
o The green circle shows your predicted knee pain and physical function for one year
after knee replacement. Just like the line on the bar graph, the green colored cloud
represents the uncertainty around the knee pain and physical function prediction
one year after knee replacement surgery.
o Brief description of uncertainty: There is some uncertainty for any (model)
prediction. If we give you a prediction of 32, there is some probability that the
actual outcome will be higher or lower than that. On the graph, we are showing
knee pain and physical function predicted together. 9 out of 10 people like you
would fall within the colored area around the prediction. Please take a look at this
diagram/graph.
o The overlap between the dotted blue and green clouds shows that some people
with either usual care or knee replacement have the same predicted knee pain and
physical function outcomes.
o If the current value (purple star) is within the blue dotted cloud around the usual
care prediction, then this means that there is a chance that with usual care after
one year this individual will still be at the same level of knee pain and physical
function.
o Can you point to your current level of pain and function?
o Where is your prediction if you do have surgery?
o Where is your prediction if you have usual care?
60
o What are your likely outcomes if you do have surgery? Would you get better,
worse, or do about the same?
o Given the uncertainty oval, how confident are you about that?
o [If the current score is within the uncertainty shape, ask this question.] What does
it mean that your current score is within the dashed shape? [We are looking for an
articulation of whether the graph is unclear and whether more explanation is
needed. E.g. Can the user realize that his/her current score is in the surgical circle?
Perhaps, it could come out that the scores would be no better off with surgery.]
o What are your likely outcomes if you do not have surgery? Would you get better,
worse, or do about the same?
o Given the uncertainty shape, how confident are you about that?
o [If the current score is within the uncertainty circle, ask this question.] What does it
mean that your current score is within the dashed shape? [We are looking for an
articulation of whether the graph is unclear and whether more explanation is
needed. E.g. Can the user realize that his/her current score is in the surgical circle?
Perhaps, it could come out that the scores would be no better off with surgery.]
o How do you interpret the colored shape around the prediction?
o How do you interpret the colored shape around the prediction?
o [Researcher points outside of the shape.] Is this a likely outcome? [No, would be
the correct answer. It is possible but not likely.]
o You picked the (surgical/nonsurgical) treatment option before, would you still
choose (surgery/ nonsurgical treatment)? OR (You were not sure which treatment
option that you would chose before, are you still undecided?)
C. Current Score Function:
o [Please look at the table on the left and tell me] what is your current physical
function score?
61
o Would you say that your current score suggests your physical function is excellent,
very good, good, fair, or poor?
Pain:
o What is your current knee pain score?
o Would you say that your knee pain score suggests that your knee pain is none, mild,
moderate, severe, or extreme?
B. Predicted Scores
Physical function:
o What is your predicted 1-year physical function score with usual care?
o What is your predicted 1-year physical function score with surgery?
o Compared to your current physical function score, would you say that you would
be better, worse, or the same in your physical functioning in one year with usual
care?
o Compared to your current physical function score, would you say that you would
be better, worse, or the same in your physical functioning in one year with
surgery?
Knee Pain:
o What is your predicted 1-year knee pain score with usual care?
o What is your predicted 1-year knee pain score with surgery?
o Compared to your current knee pain score, would you say that your knee pain
would be better-off, worse-off, or the same with usual care?
o Compared to your current knee pain score, would you say that your knee pain
would be better-off, worse-off, or the same with surgery?
Interpretation:
o Based on this information about surgical or nonsurgical treatment options, does
one option look better than the other for improving knee pain?
o Does one option look better than the other for improving your physical function?
o Is the improvement what you would have expected?
62
o Which treatment option would you chose given all four predictions?
o What do you understand about the improvement that makes you select it?
[The patient is directed to the screen displaying the bar graphs of pain and
function.]
Part 3: Equipoise 1. Please look at the three scenarios below depicted from the results page.
o Let’s imagine that a clinician is telling you about a research study with two options
(surgical or nonsurgical treatment of knee OA). If you choose to participate in the
study, you will be randomized to the surgical or nonsurgical treatment with equal
chances of each treatment being the one you receive.
o Looking at figure 1, would you choose randomization?
o Looking at figure 2, would you choose randomization?
o Looking at figure 3, would you choose randomization?
o What’s the amount of overlap that suggests your future pain and function will be
about the same with or without surgery?
Small Overlap Moderate Overlap Large Overlap
63
Additional possible questions to add about the knee pain and physical function graphs
to test for understanding:
Order:
o Does the order of Knee Pain results, then Physical Function results, and ultimately the
results with Knee Pain and Physical Function together provide a logical and helpful
organization of the information?
Concluding Questions:
o If you were unable to make a decision about treatment options after viewing this tool,
what further information would you need?
o Did this questionnaire aid your decision making process in choosing between
nonsurgical and surgical treatment for knee OA?
o Are there any other parts of this results page that you found unclear or confusing that
you have not already mentioned?
o Are there any further changes that you would recommend for this questionnaire?
64
Version Two - Usability and Cognitive Testing Plan This version of the usability testing plan was used for the final testing conducted with
patients in the Rheumatology, Orthopedic and Primary care clinics.
Introduction: Imagine that you (or someone you know) has knee osteoarthritis and is contemplating
different options, including surgical and nonsurgical treatments. Now imagine that your clinician
asks you to complete these questions in advance of the appointment [or possibly, you will
complete the questions together with your clinician during that appointment]. Next, you and your
clinician are going to make some decisions about the next steps for how to treat your knee
osteoarthritis.
Note: When you see the term “physical function,” it refers to how your health affects your
ability to perform activities that you might do during a typical day.
Part 1: Questionnaire Usability Please follow the prompts and instructions on the screen. The bar at the top will show
your progress, and you can use the arrows to go backwards and forwards. Some of the pages
require you to scroll down. You need to complete all of the questions on a page before you can go
onto the next page. If you have any questions while completing the forms, please ask me. I will
record your questions, so we can improve future versions of this questionnaire.
Part 2: Usability of Graphical Depiction of Predictions Interpret the Bar graphs and Legend
o Tell me what you think this is telling you.
o What does the colored area around the value marked in the bar graph mean to you?
o (Based on the information displayed, what does this tell you about the level of knee
pain at one year with usual care and with knee replacement?)
o (Based on the information displayed, what does this tell you about the level of
physical function at one year with usual care and with knee replacement?)
65
Interpreting combined pain and function plots:
o What does this tell you?
o (Based on the information displayed, what does this tell you about the outcomes at
one year with usual care and with knee replacement?)
o What are your likely outcomes if you do have surgery? Would you get better, worse,
or do about the same?
o Given the uncertainty oval, how confident are you about that?
o What are your likely outcomes if you do not have surgery? Would you get better,
worse, or do about the same?
o (Based on the information displayed, what does this tell you about the outcomes at
one year with usual care and with knee replacement?)
o Given the uncertainty shape, how confident are you about that?
o How do you interpret the colored shape around the prediction?
Concluding Questions:
o Are there any other parts of this results page that you found unclear or confusing that
you have not already mentioned?
o Only ask if the patient is eligible for randomization as determined by the model with
the orange bar in the application itself. Imagine that there was a research study in
which you were eligible to participate. The study is testing two different treatments for
knee osteoarthritis; nonsurgical treatment or knee replacement. Considering that the
model predicts that you would benefit from either knee replacement surgery or
nonsurgical treatment, would you be willing to be randomized to a study in which you
would have equal chances of you being assigned to either total knee replacement or
to nonsurgical treatment? If no, why not? If yes, why would you choose to participate?
66
Figure 1a. Consort Diagram - OAI OAI [May 2014]
4796 Patients
EXCLUDED n pts Patients not in OAI 'incidence' or 'progression' cohort 122
Patient with no follow-up or outcome data 254
Patients with TKR in both knees >90 days apart 39
Patients with TKR prior to OAI entry 2
4379 Patients 8713 Knees
EXCLUDED n Knees (non-TKR) from patients who had TKR on contralateral knee 2
Knees with TKR that had no pre-TKR visit within one year of the TKR 2 Knees with TKR that had no post-TKR visit 6-60 months post-surgery 4
CONTROL (non-
TKR Sample 4049 Patients 253 Patients 8095 Knees 278 Knees ** **note: For patients with bilateral (or TKR on both knees close in time only 1
knee was used. If bilateral, choice was random; if two close in time, data
from first TKR was used.
MATCH TKR SAMPLE WITH CONTROL SAMPLE [KNEE VISITS] Matching Variables: Age (<55, 55-65, >65),
Gender EXCLUDED WOMAC Pain + Disability (Riddle based) :on Incident Knee 1 TKR Patient/Knee that could not be matched to Control knee-visit
WOMAC Pain + Disability (Riddle based) :on contralateral knee Location (Riddle Category) K-L (Riddle): moderate/severe vs. not SF 12 ( <44 , 44-56, >56) Charlson (0, 1, >2, missing) Change in WOMAC Pain from Prior Visit (>=2 points vs. not)
CONTROL TKR
252 Knees 252 Knees
APPENDIX F
67
Figure 1b. Consort Diagram - MOST
Most [January
3026 Patients EXCLUDED n
Patients with TKR in both knees >90 days apart 69 2957 Patients 5914 Knees EXCLUDED n
Knees with TKR that had no pre-TKR visit within one year of the
221 OR Knees with TKR that had no post-TKR visit 6-60 months post-
No Post-Surgery WOMAC 133 Knees (non-TKR) form patients who had TKR on contralateral
115
CONTROL (non-TKR) Sample TKR Sample 2652 Patients 2652 Patients 5071 Knees 5071 Knees **note: For patients with bilateral (or TKR on both knees close in time only 1 knee
was used. If bilateral, choice was random; if two close in time, data from first TKR was used. MATCH TKR SAMPLE WITH CONTROL SAMPLE [KNEE VISITS] Matching Variables: Age (<55, 55-65, >65) Gender EXCLUDED n
WOMAC Knee pain [0-20 scale] (0-3, 4-9, 10-20, missing) 1 TKR Patient/Knee that could not be matched to Control knee-
0 WOMAC contralateral knee pain (0-2, 3-8, 9-20, missing)
SF 12 ( <44 , 44-56, >56) Charlson (0, 1, >2, missing) Change in WOMAC Pain from Prior Visit (>=2 points vs. not)
CONTROL CONTROL 154 Knees 154 Knees
68
Figure 1c. Consort Diagram - NEBH
NEBH
(December 2014) 5519 Subjects EXCLUDED n
Knee OA diagnosis 5205 Prior Surgery No Follow-up information/Coding Issues 314 Subjects
CONTROL (non-TKR) Sample TKR Sample 2652 Patients (5071 Knees)
314 Subjects
4049 Patients (8095 Knees)
314 Knees **note: For patients with bilateral (or TKR on both knees close in time only 1 knee was used. If bilateral, choice was random; if two close in time, data from first TKR was
MATCH TKR SAMPLE WITH CONTROL SAMPLE [KNEE VISITS] Matching Variables: Age (<55, 55-65, >65) Gender EXCLUDED n
WOMAC Knee pain [0-100] (11-50, 51-75, 75-100, missing) Missing Follow-up information 66 WOMAC contralateral knee pain [0-100] (11-50, 51-75, 75-100, missing) SF 12 ( <44 , 44-56, >56) Charlson (0, 1, >2, missing)
CONTROL TKR 248 Knees 248 Knees
69
Figure 1d. Consort Diagram - TMC Tufts Medical Center [July 2015]
535 Subjects EXCLUDED n= 99 pts
Patients with TKR in both knees > 90 days apart
.
Patients with TKR revision as their primary procedure Patients with TKR due to rheumatoid arthritis Patients without TKR
436 Subjects EXCLUDED n= 319 pts Patients with TKR that had no post-TKR visit 6-60 months post-surgery
Patients with TKR in both knees without an available one year follow-up for the first TKR before the second TKR
Patients who did not clearly meet the inclusion criteria (2)
Patients with unavailable medical records Duplicate or ambiguously identified patients or those with inconsistent or missing data
Patients that, due to limited resources, we were unable to collect additional data to supplement the original registry data
117 Subjects
EXCLUDED n= 21 pts **note: For patients with bilateral (or TKR on both knees close in time) only 1 Issues with TKA classification and miss-coded
knee was used. If bilateral, choice was random; if two close in time, data from first TKR was used.
96 Subjects
CONTROL (non-TKR) Sample TKR Sample 2652 Patients (5071 Knees) [MOST] 96 Subjects
4049 Patients (8095 Knees) [OAI] 96 Knees
MATCH TKR SAMPLE WITH CONTROL SAMPLE [KNEE VISITS] Matching Variables: Age (<55, 55-65, >65) Gender EXCLUDED n= 25 pts WOMAC Knee pain [0-100] (11-50, 51-75, 75-100, missing) Missing Follow-up
SF 12 ( <44 , 44-56, >56) Charlson (0, 1, >2, missing)
CONTROL TKR 72 Knees
72 Knees
70
Table 1a.i - Comparison of Distribution of Variables used for Matching – OAI database
Variable TKR
Non-TKR MATCHING VARIABLES
Age riddle category N=252 N=252 1. <55 7.9% ( 20) 7.5% ( 19) 2. 55-65 32.5% ( 82) 32.9% ( 83) 3. >65 59.5% ( 150) 59.5% ( 150) Gender (%, fraction male) 42.9% (108/252) 43.7% (110/252) WOMAC riddle based_sum pain+dis N=252 N=252 1. slight: p+d <=11 8.3% ( 21) 9.1% ( 23) 2. mod: p+d 12-22 18.3% ( 46) 19.4% ( 49) 3. intense: p+d 23-33 25.0% ( 63) 24.6% ( 62) 4. severe: p+d >33 46.0% ( 116) 44.4% ( 112) 99. N/A 2.4% ( 6) 2.4% ( 6) WOMAC cat on contralat N=252 N=252 1. slight: p+d <=11 54.8% ( 138) 54.8% ( 138) 2. mod: p+d 12-22 18.7% ( 47) 18.7% ( 47) 3. intense: p+d 23-33 12.3% ( 31) 12.3% ( 31) 4. severe: p+d >33 12.7% ( 32) 12.7% ( 32) 99. N/A 1.6% ( 4) 1.6% ( 4) Location riddle category N=252 N=252 0. Unicomp tibio-fem 1.2% ( 3) 2.8% ( 7) 1. Unicomp+ patello-fem 47.6% ( 120) 46.4% ( 117) 2. Tricompart 1.2% ( 3) 0.8% ( 2) 99. N/A 50.0% ( 126) 50.0% ( 126) K-L riddle category N=252 N=252 0. not mod/sev 20.2% ( 51) 21.8% ( 55) 1. yes mod/sev 29.8% ( 75) 28.2% ( 71) 99. N/A 50.0% ( 126) 50.0% ( 126) Base SF-12 Physical groups N=252 N=252 1. 0 to <43 50.4% ( 127) 49.2% ( 124) 2. >=43 to <56 23.4% ( 59) 24.6% ( 62) 3. >=56 2.0% ( 5) 2.0% ( 5) 99. N/A 24.2% ( 61) 24.2% ( 61) Base Charlson Comorbidity score N=252 N=252 0. none 34.1% ( 86) 34.9% ( 88) 1. one 11.1% ( 28) 10.3% ( 26) 2. two or more 5.6% ( 14) 5.6% ( 14) 99. N/A 49.2% ( 124) 49.2% ( 124) Delta WOMAC pain category (on 20pt scale) N=252 N=252 0. no more than 1 point increase 26.6% ( 67) 27.0% ( 68) 1. increase of at least 2 points 64.7% ( 163) 64.3% ( 162) 99. N/A 8.7% ( 22) 8.7% ( 22)
71
Table 1a.ii - Comparison of Distribution of Variables used for Matching – MOST database
Variable TKR
Non-TKR MATCHING VARIABLES
Age (riddle category) N=154 N=154 01. <55 8.4% ( 13) 8.4% ( 13) 02. 55-65 40.3% ( 62) 42.9% ( 66) 03. >65 51.3% ( 79) 48.7% ( 75) Gender (%, fraction male) 31.2% (48/154) 30.5% (47/154) Base WOMAC knee pain groups N=154 N=154 01. 0 to 3 5.8% ( 9) 5.8% ( 9) 02. 4 to 9 39.6% ( 61) 40.3% ( 62) 03. 10- 20 (max) 25.3% ( 39) 24.7% ( 38) 99. N/A 29.2% ( 45) 29.2% ( 45) Base WOMAC contra knee pain groups N=154 N=154 01. 0 to 2 26.0% ( 40) 26.0% ( 40) 02. 3 to 8 35.7% ( 55) 35.7% ( 55) 03. 9- 20 (max) 9.1% ( 14) 9.1% ( 14) 99. N/A 29.2% ( 45) 29.2% ( 45) Base WOMAC disability score groups N=154 N=154 01. 0 to 17 7.8% ( 12) 8.4% ( 13) 02. 18 to 32 39.0% ( 60) 40.9% ( 63) 03. >=33 21.4% ( 33) 18.8% ( 29) 99. N/A 31.8% ( 49) 31.8% ( 49) Base SF-12 Physical groups N=154 N=154 01. 0 to 43 51.3% ( 79) 49.4% ( 76) 02. 44 to 56 13.6% ( 21) 16.2% ( 25) 03. >=57 0.6% ( 1) . ( .) 99. N/A 34.4% ( 53) 34.4% ( 53) Base Charlson Comorbidity score N=154 N=154 00. None 61.7% ( 95) 61.0% ( 94) 01. One 26.0% ( 40) 24.7% ( 38) 02. Two or more 12.3% ( 19) 14.3% ( 22) Delta WOMAC pain groups N=154 N=154 00. Negative/Decrease 7.1% ( 11) 8.4% ( 13) 01. No change or 1 point increase 7.1% ( 11) 6.5% ( 10) 02. Increase of 2 or more points 18.2% ( 28) 17.5% ( 27) 99. N/A 67.5% ( 104) 67.5% ( 104)
72
Table 1a.iii - Comparison of Distribution of Variables used for Matching – NEBH database
Variable TKR
Non-TKR MATCHING VARIABLES
Age riddle category N=248 N=248 01. <55 18.1% ( 45) 18.5% ( 46) 02. 55-65 40.7% ( 101) 40.7% ( 101) 03. >65 41.1% ( 102) 40.7% ( 101) Gender (%, fraction male) 41.9% (104/248) 40.3% (100/248) Base WOMAC knee pain groups N=248 N=248 01. 0 to 10 0.4% ( 1) 0.8% ( 2) 02. 11 to 50 37.1% ( 92) 39.5% ( 98) 03. 51 to 75 44.8% ( 111) 47.2% ( 117) 04. 75 to 100(max) 14.5% ( 36) 9.3% ( 23) 99. N/A 3.2% ( 8) 3.2% ( 8) Base WOMAC contralat knee pain groups N=248 N=248 01. 0 to 10 35.5% ( 88) 35.5% ( 88) 02. 11 to 50 51.2% ( 127) 50.8% ( 126) 03. 51 to 75 8.1% ( 20) 8.5% ( 21) 04. 75 to 100(max) 2.0% ( 5) 2.0% ( 5) 99. N/A 3.2% ( 8) 3.2% ( 8) Base SF-12 Physical groups N=248 N=248 01. 0 to 43 69.0% ( 171) 69.4% ( 172) 02. 44 to 56 25.4% ( 63) 25.0% ( 62) 03. >=57 1.2% ( 3) 1.2% ( 3) 99. N/A 4.4% ( 11) 4.4% ( 11) Base Charlson Comorbidity N=248 N=248 00. None 75.4% ( 187) 75.4% ( 187) 01. One 16.9% ( 42) 16.9% ( 42) 02. Two or more 6.5% ( 16) 6.5% ( 16) 99. N/A 1.2% ( 3) 1.2% ( 3)
73
Table 1a.iv - Comparison of Distribution of Variables used for Matching – TUFTS database
Variable TKR ( 72)
Non-TKR ( 72) MATCHING VARIABLES
Age Riddle category N=72 N=72 01. <55 26.4% ( 19) 26.4% ( 19) 02. 55-65 33.3% ( 24) 34.7% ( 25) 03. >65 40.3% ( 29) 38.9% ( 28) Gender (%, fraction male) 69.4% (50/72) 69.4% (50/72) Base WOMAC knee pain groups N=72 N=72 02. 11 to 50 50.0% ( 36) 55.6% ( 40) 03. 51 to 75 40.3% ( 29) 31.9% ( 23) 04. 75 to 100(max) 1.4% ( 1) 2.8% ( 2) 99. N/A 8.3% ( 6) 9.7% ( 7) Base SF-12 Physical groups N=72 N=72 01. 0 to 43 47.2% ( 34) 47.2% ( 34) 02. 44 to 56 5.6% ( 4) 5.6% ( 4) 99. N/A 47.2% ( 34) 47.2% ( 34) Base Charlson Comorbidity N=72 N=72 00. None 48.6% ( 35) 48.6% ( 35) 01. One 19.4% ( 14) 19.4% ( 14) 02. Two or more 30.6% ( 22) 30.6% ( 22) 99. N/A 1.4% ( 1) 1.4% ( 1)
74
Table 1b.i - Comparisons of subject characteristics in raw data – OAI
Variable TKR (n=252) Non-TKR (n=252) TKR minus non-TKR Delta
(∆) and [95% CI] Effect Size
mean +/- standard deviation (SD) or %(n) (∆/SD) Age in years 67.88 +/- 8.64 ( 252) 66.82 +/- 8.36 ( 252) 1.06[ -0.43 to 2.55] 0.12 BMI 29.80 +/- 4.60 ( 184) 30.39 +/- 4.66 ( 179) -0.59[ -1.55 to 0.36] -0.13 SF-12 Physical 38.75 +/- 9.17 ( 191) 39.82 +/- 9.31 ( 191) -1.07[ -2.93 to 0.79] -0.12 SF-12 Mental 55.57 +/- 8.64 ( 191) 54.39 +/- 9.18 ( 191) 1.18[ -0.61 to 2.98] 0.13 Depression Scale 7.47 +/- 7.13 ( 233) 7.99 +/- 6.98 ( 231) -0.52[ -1.81 to 0.77] -0.07 Back Pain 62.2% +/- 48.6% ( 188) 66.8% +/- 47.2% ( 193) ( 4.6%)[( 14.3%) to 5.0% ] -0.10 ADL/Disability 24.49 +/- 11.70 ( 247) 23.27 +/- 11.72 ( 247) 1.21[ -0.86 to 3.28] 0.10 WOMAC Total score 35.66 +/- 15.99 ( 246) 33.40 +/- 16.20 ( 246) 2.26[ -0.59 to 5.11] 0.14 WOMAC Pain subscores, 0=none, 4=extreme WOMAC Pain- Walking 1.77 +/- 0.98 ( 252) 1.46 +/- 0.90 ( 250) 0.32[ 0.15 to 0.48] 0.34 WOMAC Pain- Stairs 2.31 +/- 0.97 ( 248) 1.96 +/- 0.92 ( 248) 0.35[ 0.18 to 0.52] 0.37 WOMAC Pain- In Bed 1.14 +/- 1.14 ( 251) 1.05 +/- 1.11 ( 250) 0.09[ -0.11 to 0.29] 0.08 WOMAC Pain- Sit/Lie down 0.96 +/- 0.94 ( 251) 1.12 +/- 0.97 ( 250) -0.17[ -0.34 to -0.00] -0.18 WOMAC Pain- Standing 1.42 +/- 0.94 ( 250) 1.30 +/- 0.99 ( 250) 0.13[ -0.04 to 0.30] 0.13 WOMAC Knee pain (0-100) 38.00 +/- 18.92 ( 251) 34.41 +/- 19.70 ( 250) 3.59[ 0.20 to 6.98] 0.19 Contralateral Knee Pain (0-100) 15.05 +/- 17.30 ( 252) 15.67 +/- 16.59 ( 251) -0.62[ -3.59 to 2.35] -0.04 Hip Pain or Pain/Ache/Stiff 56.0% +/- 49.8% ( 191) 60.1% +/- 49.1% ( 193) ( 4.1%)[( 14.0%) to 5.8% ] -0.08 Homunculus (0 to 100%) 24.63 +/- 12.82 ( 155) 26.47 +/- 14.42 ( 164) -1.84[ -4.85 to 1.17] -0.13 Narcotics 14.3% +/- 35.1% ( 252) 9.6% +/- 29.5% ( 251) 4.7% [( 1.0%) to 10.4% ] 0.15 Charlson (approximate) N=128 N=128
0 67.2% ( 86) 68.8% ( 88) 1 21.9% ( 28) 20.3% ( 26) 2 7.0% ( 9) 9.4% ( 12) 3 3.9% ( 5) 1.6% ( 2)
Baseline Charlson_approx 0.48 +/- 0.79 ( 128) 0.44 +/- 0.73 ( 128) 0.04[ -0.15 to 0.23] 0.05 Follow-up (FU) FU SF-12 Physical 44.55 +/- 9.53 ( 129) 41.29 +/- 10.10 ( 171) 3.26[ 1.00 to 5.52] 0.33 FU WOMAC Knee Pain (0-100) 11.78 +/- 15.69 ( 252) 26.62 +/- 20.33 ( 252) -14.84[ -18.01 to -11.66] -0.82 Time from Baseline to FU (months) Median <q1-q3> (n) 22.6<13.1-24.1> ( 252) 12.0<11.1 -12.7>( 252) Mean +/-sd (n) 19.76 +/- 5.60 ( 252) 12.20 +/- 2.81 ( 252) 7.56[ 6.78 to 8.33] 1.71
75
Table 1b.ii - Comparisons of subject characteristics in raw data – MOST
Variable TKR (n=154) Non-TKR (n=154) TKR minus non-TKR Delta
(∆) and [95% CI] Effect Size
mean +/- standard deviation (SD) or %(n) (∆/SD) Age in years 65.49 +/- 6.88 ( 154) 65.21 +/- 7.47 ( 154) 0.28[ -1.33 to 1.89] 0.04 BMI 32.19 +/- 5.69 ( 106) 31.77 +/- 6.15 ( 107) 0.42[ -1.18 to 2.02] 0.07 SF-12 Physical 36.73 +/- 8.26 ( 101) 37.25 +/- 8.08 ( 101) -0.53[ -2.79 to 1.74] -0.06 SF-12 Mental 55.75 +/- 9.41 ( 101) 53.38 +/- 9.65 ( 101) 2.37[ -0.28 to 5.01] 0.25 Depression Scale 8.13 +/- 7.42 ( 97) 10.70 +/- 9.02 ( 96) -2.56[ -4.91 to -0.22] -0.31 Back Pain 71.1% +/- 45.5% ( 97) 81.3% +/- 39.2% ( 96) ( 10.1%)[(22.2%) to 2.0% ] -0.24 ADL/Disability 28.08 +/- 10.44 ( 105) 26.84 +/- 10.85 ( 105) 1.24[ -1.66 to 4.13] 0.12 WOMAC Total Score 40.07 +/- 14.43 ( 105) 37.93 +/- 15.02 ( 105) 2.14[ -1.86 to 6.15] 0.15 WOMAC Pain subscores, 0=none, 4=extreme WOMAC Pain- Walking 1.78 +/- 0.77 ( 109) 1.50 +/- 0.82 ( 109) 0.28[ 0.07 to 0.50] 0.36 WOMAC Pain- Stairs 2.51 +/- 0.87 ( 109) 2.18 +/- 0.94 ( 109) 0.33[ 0.09 to 0.57] 0.36 WOMAC Pain- In Bed 1.27 +/- 0.96 ( 109) 1.26 +/- 0.98 ( 109) 0.01[ -0.25 to 0.27] 0.01 WOMAC Pain- Sit/Lie down 0.81 +/- 0.87 ( 109) 0.90 +/- 0.80 ( 109) -0.09[ -0.31 to 0.13] -0.11 WOMAC Pain- Standing 1.50 +/- 0.82 ( 109) 1.54 +/- 0.90 ( 109) -0.05[ -0.28 to 0.18] -0.05 WOMAC Knee pain (0-100) 40.28 +/- 17.52 ( 109) 38.39 +/- 17.72 ( 109) 1.88[ -2.82 to 6.58] 0.11 Contralateral Knee Pain (0-100) 22.32 +/- 18.97 ( 109) 23.30 +/- 19.09 ( 109) -0.99[ -6.07 to 4.09] -0.05 Hip Pain or Pain/Ache/Stiff 46.4% +/- 50.0% ( 153) 61.4% +/- 48.8% ( 153) (15.0%)[(26.2%) to (3.9%)] -0.30 Homunculus (0 to 100%) 17.34 +/- 17.25 ( 154) 20.93 +/- 19.01 ( 154) -3.59[ -7.66 to 0.48] -0.20 Narcotics 20.8% +/- 40.7% ( 106) 14.0% +/- 34.9% ( 107) 6.7% [( 3.5%) to 17.0% ] 0.18 Charlson (approximate) N=154 N=154
0 61.7% ( 95) 61.0% ( 94) 1 26.0% ( 40) 24.7% ( 38) 2 11.0% ( 17) 9.1% ( 14) 3 1.3% ( 2) 2.6% ( 4)
4 . ( .) 0.6% ( 1) 5 . ( .) 0.6% ( 1) 7 . ( .) 1.3% ( 2)
Baseline Charlson_approx 0.52 +/- 0.74 ( 154) 0.66 +/- 1.15 ( 154) -0.14[ -0.35 to 0.08] -0.14 Follow-up (FU) FU SF-12 Physical 40.03 +/- 11.66 ( 133) 40.69 +/- 10.72 ( 137) -0.67[ -3.35 to 2.02] -0.06 FU WOMAC Knee Pain (0-100) 14.01 +/- 16.35 ( 154) 23.93 +/- 18.13 ( 154) -9.92[ -13.79 to -6.05] -0.05 Time from Baseline to FU (months) Median <q1-q3> (n) 36 < 17- 39> ( 154) 25 < 16- 37> ( 152) -0.57 Mean +/-sd (n) 30.05 +/- 11.63 ( 154) 26.72 +/- 10.64 ( 152) 3.33[ 0.82 to 5.84] 0.3
76
Table 1b.iii - Comparisons of subject characteristics in raw data – NEBH
Variable TKR (n=248) Non-TKR (n=248) TKR minus non-TKR Delta (∆) and [95% CI]
Effect Size
mean +/- standard deviation (SD) or %(n) (∆/SD) Age in years 63.33 +/- 8.52 ( 248) 63.28 +/- 8.85 ( 248) 0.05[ -1.48 to 1.58] 0.01 BMI 31.50 +/- 6.71 ( 244) 31.18 +/- 5.00 ( 233) 0.32[ -0.75 to 1.39] 0.05 SF-12 Physical 37.33 +/- 8.92 ( 237) 38.06 +/- 9.17 ( 237) -0.72[ -2.36 to 0.91] -0.08 SF-12 Mental 48.23 +/- 8.05 ( 237) 53.08 +/- 10.57 ( 237) -4.85[ -6.54 to -3.15] -0.52 Depression Scale 10.06 +/- 9.43 ( 238) Back Pain 75.6% +/- 43.0% ( 234) ADL/Disability 29.07 +/- 14.46 ( 229) WOMAC Total Score 43.05 +/- 19.85 ( 228) WOMAC Pain subscores, 0=none, 4=extreme WOMAC Pain- Walking 1.96 +/- 0.96 ( 241) . WOMAC Pain- Stairs 2.58 +/- 0.91 ( 236) WOMAC Pain- In Bed 1.70 +/- 1.30 ( 241) . WOMAC Pain- Sit/Lie down 1.46 +/- 1.09 ( 240) WOMAC Pain- Standing 1.90 +/- 1.02 ( 240) . WOMAC Knee pain (0-100) 56.07 +/- 19.02 ( 240) 48.41 +/- 22.67 ( 240) 7.67[ 3.91 to 11.42] 0.37 Contralateral Knee Pain (0-100) 21.11 +/- 22.23 ( 240) 23.99 +/- 21.84 ( 240) -2.89[ -6.84 to 1.06] -0.13 Hip Pain or Pain/Ache/Stiff 12.5% +/- 33.1% ( 248) 64.7% +/- 47.9% ( 238) (52.2%)[(59.5%) to(44.9%)] -1.27 Homunculus (0 to 100%) 30.22 +/- 17.05 ( 211) . Narcotics 8.5% +/- 27.9% ( 248) 15.4% +/- 36.1% ( 241) ( 6.9%)[( 12.6%) to ( 1.2%)] -0.21 Charlson (approximate) N=245 N=245 .
0 76.3% ( 187) 76.3% ( 187) 1 17.1% ( 42) 17.1% ( 42) . 2 6.1% ( 15) 2.4% ( 6) 3 0.4% ( 1) 2.0% ( 5) .
4 . ( .) 1.2% ( 3) 5 . ( .) 0.8% ( 2) .
Baseline Charlson_approx 0.31 +/- 0.60 ( 245) 0.37 +/- 0.85 ( 245) -0.07[ -0.20 to 0.07] -0.09 Follow-up (FU) FU SF-12 Physical 48.92 +/- 9.45 ( 228) 40.37 +/- 10.00 ( 152) 8.55[ 6.56 to 10.54] 0.88 FU WOMAC Knee Pain (0-100) 15.33 +/- 18.87 ( 248) 33.53 +/- 23.83 ( 247) -18.20[ -21.99 to -14.40] -0.85 Time from Baseline to FU (months) Median <q1-q3> (n) 12 < 12- 12> ( 248) 12.2 < 11.46- 22> ( 244) Mean +/-sd (n) 12.77 +/- 3.33 ( 248) 16.90 +/- 9.29 ( 244) -4.12[ -5.35 to -2.89] -0.59
77
Table 1b.iv- Comparisons of subject characteristics in raw data – TUFTS
Variable TKR (n=72) Non-TKR (n=72) TKR minus non-TKR Delta (∆) and [95% CI]
Effect Size
mean +/- standard deviation (SD) or %(n) (∆/SD) Age in years 62.6 +/- 10.0 ( 72) 61.8 +/- 8.1 ( 72) 0.75[ -2.23 to 3.73] 0.08 BMI 35.8 +/- 19.3 ( 65) 31.2 +/- 4.5 ( 49) 4.57[ -1.01 to 10.15] 0.31 SF-12 Physical 31.5 +/- 7.6 ( 38) 35.5 +/- 6.7 ( 38) -4.05[ -7.32 to -0.78] -0.57 SF-12 Mental 48.5 +/- 12.2 ( 38) 51.7 +/- 11.8 ( 38) -3.23[ -8.72 to 2.27] -0.27 Depression Scale not avail 11.0 +/- 8.9 ( 47) Back Pain not avail 81.3% +/- 39.4% ( 48) ADL/Disability not avail 31.1 +/- 9.9 ( 57) WOMAC Total Score not avail 44.6 +/- 12.8 ( 57) WOMAC Pain subscores, 0=none, 4=extreme WOMAC Pain- Walking not avail 1.9 +/- 0.8 ( 65) WOMAC Pain- Stairs not avail 2.5 +/- 0.8 ( 62) WOMAC Pain- In Bed not avail 1.6 +/- 0.9 ( 65) WOMAC Pain- Sit/Lie down not avail 1.5 +/- 0.8 ( 65) WOMAC Pain- Standing not avail 1.8 +/- 0.7 ( 65) WOMAC Knee pain (0-100) 48.0 +/- 13.1 ( 66) 47.2 +/- 14.5 ( 65) 0.80[ -3.98 to 5.58] 0.06 Contralateral Knee Pain (0-100) not avail 36.0 +/- 21.4 ( 65) Hip Pain or Pain/Ache/Stiff 9.9% +/- 30.0% ( 71) 60.0% +/- 49.4% ( 65) ( 50.1%)[( 63.9%) to ( 36.4%)] -1.24 Homunculus (0 to 100%) 13.0 +/- 7.7 ( 23) 23.5 +/- 21.5 ( 64) -10.49[ -19.61 to -1.36] -0.56 Narcotics 16.7% +/- 37.5% ( 72) 19.3% (11/57) ( 2.6%)[( 16.2%) to 10.9% ] -0.07 Charlson (approximate) N=71 N=71
0 49.3% ( 35) 49.3% ( 35) 1 19.7% ( 14) 19.7% ( 14) 2 22.5% ( 16) 14.1% ( 10) 3 2.8% ( 2) 8.5% ( 6)
4 1.4% ( 1) 1.4% ( 1) 5 2.8% ( 2) 7.0% ( 5) 6 1.4% ( 1) . ( .)
Baseline Charlson_approx 1.01 +/- 1.34 ( 71) 1.14 +/- 1.50 ( 71) -0.13[ -0.60 to 0.34] -0.09 Follow-up (FU) FU SF-12 Physical 39.2 +/- 9.1 ( 52) 37.2 +/- 7.8 ( 42) 1.91[ -1.60 to 5.42] 0.22 FU WOMAC Knee Pain (0-100) 16.4 +/- 15.3 ( 72) 35.0 +/- 21.0 ( 72) -18.65[ -24.71 to -12.58] -1.01 Time from Baseline to FU (months) Median <q1-q3> (n) 14 < 13- 16> ( 70) 15.36 < 12.12- 29.5> ( 72) Mean +/-sd (n) 15.84 +/- 12.00 ( 70) 21.28 +/- 11.54 ( 72) -5.43[ -9.34 to -1.53] -0.46
78
Table 1c.i - Comparisons of subject characteristics in imputed data – OAI
Variable TKR
(n=252)
Non-TKR
(n=252) TKR minus non-TKR Delta
(∆) and [95% CI] Effect Size
mean +/- standard deviation (SD) (∆/SD)
Age 67.88 ± 8.50 66.82 ± 8.50 1.06 [-0.42 , 2.54] 0.06
Male, N(%) 0.43 ± 0.50 0.44 ± 0.50 -0.01 [-0.09 , 0.08] -0.01
Baseline BMI 29.84 ± 5.84 30.51 ± 5.69 -0.67 [-1.68 , 0.33] -0.06
Baseline SF-12 Physical 38.68 ± 10.51 39.77 ± 10.64 -1.10 [-2.94 , 0.75] -0.05
Baseline SF-12 Mental 55.60 ± 9.81 54.28 ± 9.61 1.31 [-0.38 , 3.01] 0.07
Baseline WOMAC Knee pain (0-100) 37.97 ± 19.33 34.40 ± 19.38 3.56 [0.18 , 6.94] 0.09
Baseline Knee Pain in Contralateral (0-100) 15.05 ± 16.95 15.70 ± 16.99 -0.65 [-3.62 , 2.31] -0.02
Baseline Hip Pain or Pain/Ache/Stiff 0.56 ± 0.54 0.60 ± 0.56 -0.04 [-0.14 , 0.06] -0.04
At least one comorbidity, N (%) 0.30 ± 0.61 0.27 ± 0.70 0.03 [-0.08 , 0.14] 0.02
Narcotics, N (%) 0.14 ± 0.32 0.10 ± 0.33 0.05 [-0.01 , 0.10] 0.07
Follow-up SF-12 Physical 44.20 ± 13.65 41.17 ± 10.24 3.03 [0.92 , 5.13] 0.13
Follow-up WOMAC Knee Pain (0-100) 11.78 ± 18.15 26.62 ± 18.15 -14.84 [-18.01 , -11.67] -0.41
*Shaded rows indicate variables where definitions varied between databases so that these variables ultimately were exlcuded as candidates in the building of final models.
79
Table 1c.ii - Comparisons of subject characteristics in imputed data – MOST
Variable TKR
(n=154)
Non-TKR
(n=154) TKR minus non-TKR
Delta (∆) and [95% CI] Effect Size
mean +/- standard deviation (SD) (∆/SD)
Age 65.49 ± 7.18 65.21 ± 7.18 0.28 [-1.32 , 1.88] 0.02
Male, N(%) 0.31 ± 0.46 0.31 ± 0.46 0.01 [-0.10 , 0.11] 0.01
Baseline BMI 32.33 ± 9.48 31.44 ± 7.18 0.89 [-0.98 , 2.77] 0.05
Baseline SF-12 Physical 36.80 ± 7.83 38.62 ± 10.43 -1.82 [-3.87 , 0.24] -0.1
Baseline SF-12 Mental 55.99 ± 11.00 54.23 ± 11.80 1.76 [-0.79 , 4.31] 0.08
Baseline WOMAC Knee pain (0-100) 40.23 ± 24.82 35.33 ± 22.24 4.90 [-0.36 , 10.17] 0.1
Baseline Knee Pain in Contralateral (0-100) 21.46 ± 20.06 19.38 ± 24.10 2.08 [-2.87 , 7.03] 0.05
Baseline Hip Pain or Pain/Ache/Stiff 0.46 ± 0.50 0.61 ± 0.50 -0.15 [-0.26 , -0.04] -0.15
At least one comorbidity, N (%) 0.38 ± 0.49 0.39 ± 0.49 -0.01 [-0.12 , 0.10] -0.01
Narcotics, N (%) 0.20 ± 0.44 0.13 ± 0.39 0.07 [-0.03 , 0.16] 0.08
Follow-up SF-12 Physical 40.01 ± 12.11 40.10 ± 11.99 -0.09 [-2.78 , 2.60] 0
Follow-up WOMAC Knee Pain (0-100) 14.01 ± 17.26 23.93 ± 17.26 -9.92 [-13.77 , -6.06] -0.29
*Shaded rows indicate variables where definitions varied between databases so that these variables ultimately were excluded as candidates in the building of final models.
80
Table 1c.iii - Comparisons of subject characteristics in imputed data – NEBH
Variable TKR
(n=248)
Non-TKR
(n=248) TKR minus non-TKR
Delta (∆) and [95% CI] Effect Size
mean +/- standard deviation (SD) (∆/SD)
Age 63.33 ± 8.69 63.28 ± 8.69 0.05 [-1.48 , 1.58] 0
Male, N(%) 0.42 ± 0.49 0.40 ± 0.49 0.02 [-0.07 , 0.10] 0.02
Baseline BMI 31.56 ± 5.86 31.14 ± 6.07 0.41 [-0.64 , 1.46] 0.03
Baseline SF-12 Physical 37.33 ± 9.22 38.20 ± 9.13 -0.87 [-2.49 , 0.74] -0.05
Baseline SF-12 Mental 48.22 ± 9.60 53.17 ± 9.72 -4.95 [-6.65 , -3.25] -0.26
Baseline WOMAC Knee pain (0-100) 56.11 ± 21.64 48.11 ± 21.16 8.00 [4.23 , 11.77] 0.19
Baseline Knee Pain in Contralateral (0-100) 21.27 ± 22.18 23.86 ± 22.18 -2.58 [-6.49 , 1.32] -0.06
Baseline Hip Pain or Pain/Ache/Stiff 0.13 ± 0.41 0.65 ± 0.42 -0.52 [-0.60 , -0.45] -0.63
At least one comorbidity, N (%) 0.24 ± 0.43 0.23 ± 0.42 0.00 [-0.07 , 0.08] 0
Narcotics, N (%) 0.08 ± 0.32 0.15 ± 0.33 -0.07 [-0.13 , -0.01] -0.1
Follow-up SF-12 Physical 48.94 ± 10.42 39.32 ± 10.37 9.63 [7.80 , 11.46] 0.46
Follow-up WOMAC Knee Pain (0-100) 15.33 ± 21.49 33.48 ± 21.53 -18.14 [-21.93 , -14.36] -0.42
*Shaded rows indicate variables where definitions varied between databases so that these variables ultimately were excluded as candidates in the building of final models.
81
Table 1c.iv - Comparisons of subject characteristics in imputed data – TMC
Variable TKR
(n=72)
Non-TKR
(n=72) TKR minus non-TKR
Delta (∆) and [95% CI] Effect Size
mean +/- standard deviation (SD) (∆/SD)
Age 62.57 ± 9.06 61.82 ± 9.06 0.75 [-2.21 , 3.71] 0.04
Male, N(%) 0.69 ± 0.46 0.69 ± 0.46 -0.00 [-0.15 , 0.15] 0
Baseline BMI 33.42 ± 6.77 30.99 ± 7.28 2.43 [0.13 , 4.73] 0.17
Baseline SF-12 Physical 32.00 ± 10.14 35.74 ± 12.93 -3.74 [-7.54 , 0.05] -0.16
Baseline SF-12 Mental 49.48 ± 14.97 51.58 ± 14.97 -2.10 [-6.99 , 2.79] -0.07
Baseline WOMAC Knee pain (0-100) 47.48 ± 14.48 46.43 ± 14.43 1.05 [-3.67 , 5.77] 0.04
Baseline Knee Pain in Contralateral (0-100) 21.46 ± 13.72 19.38 ± 16.48 2.08 [-2.87 , 7.03] 0.07
Baseline Hip Pain or Pain/Ache/Stiff 0.10 ± 0.41 0.60 ± 0.45 -0.50 [-0.64 , -0.36] -0.59
At least one comorbidity, N (%) 0.51 ± 0.51 0.51 ± 0.51 0.00 [-0.16 , 0.17] 0
Narcotics, N (%) 0.17 ± 0.38 0.19 ± 0.40 -0.02 [-0.15 , 0.11] -0.03
Follow-up SF-12 Physical 39.66 ± 10.09 36.10 ± 10.04 3.56 [0.27 , 6.85] 0.18
Follow-up WOMAC Knee Pain (0-100) 16.35 ± 18.40 35.00 ± 18.40 -18.65 [-24.66 , -12.63] -0.51
*Shaded rows indicate variables where definitions varied between databases so that these variables ultimately were excluded as candidates in the building of final models.
82
APPENDIX G: Example of stakeholder influence on the model development process
Stakeholders were strongly supportive of using both the pain and functional outcomes in
patients’ decision-making processes, although they were not of equal importance to all patient
stakeholders. As the project progressed, the team continued to get more input from patient and
clinician stakeholders, which influenced modeling. For example, the initial model built on the
pooled OAI and MOST dataset predicted 1-year physical function with inclusion of a depression
score, which was a variable included in the OAI and MOST data sets. We used the larger pooled
dataset to see if any variables alone, or interacting with a treatment variable, should be added or
removed to improve the model, and had the model re-reviewed by clinician and patient
stakeholders and the team developing the user interface. In this case, the depression score had a
p-value between .05 and .10 from model of the 1-year SF-12 outcome after corrected for
imputation error. The depression scores available in our database came from a multi-item
questionnaire. The research team was concerned the additional items needed to compute the
depression score might be burdensome for the patient and/or clinician to collect, and we thus
considered removing the variable from the model. We then compared model performance (r-
square, calibration) with and without the variable, and although performance was slightly worse
without the variable, the decline in r-square of <1% was considered insufficient to justify the data
collection burden of retaining it, and the team decided to use the simpler version of the model.
83
APPENDIX H
Table 2a. Models of 1-year Knee pain (scored as 0 = no pain, 100= extreme pain)
Initial Model built on OAI
and tested on MOST
P1. Model
(from pooled MOST+OAI)
P2. model
(from 4-source data)
R-square 0.36(OAI), 0.32 (MOST) 0.32 0.32
Beta Coeff(stderr) p-value Beta Coeff(stderr) p-value Beta Coeff(stderr) p-value
Model intercept (constant) 1.95 (2.60) p=0.4550 -2.99(4.04) p=0.4597 31.44(5.52) p=<.0001
Treatment (1=TKR, 0=control) -4.59 (3.41) p=0.1781 -5.00(2.77) p=0.0718 -3.33(2.16) p=0.1246
WOMAC knee pain (base) 0.49 (0.05) p=<.0001 0.42(0.04) p=<.0001 0.49(0.03) p=<.0001
Interaction: Treatment * WOMAC knee pain -0.21 (0.07) p=0.0044 -0.18(0.06) p=0.0026 -0.33(0.05) p=<.0001
Age (in years) -0.12(0.05) p=0.0225
SF-12 mental component [base] -0.11(0.05) p=0.033
SF-12 physical component [base] -0.21(0.07) p=0.0017
Age, dichotomized (less than 60 years old: 1=yes,0=no)) 4.20 (1.79) p=0.0186 4.44(1.41) p=0.0017
WOMAC [base] contralateral knee pain 0.13(0.04) p=0.0562
Homunculus % 0.26 (0.08) p=0.0008 0.11(0.05) p=0.0155
Body mass index [base], kg/m2 0.22(0.12) p=0.0628
Hip Pain (1=yes, 0=no) -0.31 (2.48) p=0.8992 2.00(1.80) p=0.2694
Interaction: Treatment*Hip Pain -5.69 (2.96) p=0.0545 -3.82(2.29) p=0.0948
84
Table 2b. Models of 1-year Physical Function (Physical Component Score of SF-12)
Initial Model built on OAI and
tested on MOST
F1. model
(from pooled MOST+OAI)
F2. model
(from 4-source data)
Adjusted R-square (range from 10 imputed datasets) 0.42(OAI), 0.18 (MOST) 0.35 0.34
Beta Coeff(stderr) p-value Beta Coeff(stderr) p-value
Model Intercept (constant) 5.49 (8.70) p=0.5337 25.84(5.15) p=<.0001 17.40(4.27) p=<.0001
Treatment (1=TKR, 0=control) 3.43 (1.00) p=0.0017 2.58(0.74) p=0.0008 25.41(4.33) p=<.0001
Gender (1=male, 0=female) 1.25 (0.95) p=0.1961 1.60(0.75) p=0.037 0.99(0.57) p=0.0873
Age (in years) -0.11 (0.04) p=0.0080 -0.13(0.04) p=0.0017 -0.05(0.04) p=0.2397
SF-12 mental component [base] 0.32 (0.10) p=0.0058 0.12(0.05) p=0.0196 0.19(0.04) p=<.0001
SF-12 physical component [base] 0.65 (0.06) p=<.0001 0.57(0.04) p=<.0001 0.55(0.03) p=<.0001
Body mass index [base], kg/m2 -0.16(0.08) p=0.0664 -0.19(0.05) p=0.0008
Charlson Comorbidity Score >=1 (vs. 0) -2.05(0.60) p=0.0009
Interaction: hadTKR*age -0.15(0.06) p=0.0084
Interaction: hadTKR*SF-12 mental score -0.18(0.06) p=0.0013
WOMAC [base] contralat knee pain -0.05 (0.03) p=0.1335 -0.07(0.02) p=0.0033
Homunculus (% of sites positive) -0.19 (0.06) p=0.0059
Hip Pain (1=yes, 0=no) 3.90 (1.03) p=0.0003
85
Table 2c. Summary of variables used in multivariable models built on pooled data sources
Summary of Data
(% or 5th to 95th percentile from imputed data) [for variables used for models]
Included in model built on pooled 2-source (P1/F1) or pooled 4-source
database (P2/F2)
Label Pooled (2-source) database
OAI and MOST Pooled (4-source) database
OAI/MOST/TUFTS/NEBH P1 F1 P2 ++ F2 ++
Treatment (1=TKR, 0=control) (50% -matched set) (50% -matched set) Yes Yes Yes Yes
Gender (1=male, 0=female) 39% male 42% male no Yes no Yes
Age (in years) 53 to 79 51 to79
(min-max: 40 to 88) (dichot) Yes Yes Yes
SF-12 mental component [base] 37 to 67 (high is good) 34 to 66 no Yes Yes Yes
SF-12 physical component [base] 24 to 53 (high is good) 23 to 53 no Yes Yes Yes
WOMAC [base] contralat knee pain (100 pt. scale) 0 (no pain) to 51 not available Yes Yes no no
Body mass index [base], kg/m2 22 to 41 23 to 41 Yes Yes no Yes
WOMAC knee pain (base) 5 to 70 (high is bad) 10 to 80 Yes no Yes no
Homunculus, % sites with symptom 0% (no sites with pain) to 53% not available Yes no no no
Hip Pain (1=yes, 0=no)* 56% 48% Yes no no no
Less than 60 years old (1=yes,0=no) 21% 27% Yes no no no
Charlson comorbidity score >=1 (1=yes, 0=no) 32% 31% no no no Yes
* OAI definition: Right or Left hip pain, aching or stiffness: any, past 12 months (includes pain in groin and in front and sides of upper thigh)
*Variables in gray were used in P1 and/or F1 but were NOT used in P2/F2 models
++ Model used for interface
86
APPENDIX I: Exploration of Linear and Alternative Regression Models for WOMAC knee pain
One challenge encountered in this project was related to distribution of the 1-year
WOMAC scores. We realized that the follow-up WOMAC Pain scores at 1-year (or closest visit to 1-
year) were right skewed with most subjects having low scores (less pain). In addition to looking at
linear regression models as planned, we also explored using general linear mixed models
assuming the outcome had either a negative binomial model (with the outcome rounded to
integer values of 0 to 100) or a beta distribution (outcome was rescaled as .01 to 0.99 with 0 equal
to .01 and 1 equal to 0.99)
We then ranked the true 1-year WOMAC pain outcome by quintile, and compared
observed mean WOMAC pain and predicted WOMAC pain (from each model) to see if we might
improve our predictions using a non-linear regression model.
The next page shows some preliminary models that were run on the pooled MOST and
OAI databases and plots of predicted values. After reviewing these results, we opted to continue
using linear regression for this project, as neither of the alternatives we explored appeared much
better than the simpler and pre-planned approach using linear regression.
87
88
89
APPENDIX J: Figure 1. Early version of the physical function predicted outcome results page
Figure 2. Final version of the physical function predicted outcome results page
90
Figure 3. Early combined pain and function predicted outcome results page
91
Figure 4. Final combined pain and function predicted outcome results page
92
APPENDIX K: Mathematical Equipoise between Pain and Function Predictions with Nonsurgical Care or Total Knee Replacement
Mathematical equipoise is defined within KOMET as a condition when pain and
functioning outcome predictions with nonsurgical care and TKR are relatively close and fall within
each other’s circles of “zones of uncertainty,” i.e., their circles of uncertainty overlap. These circles
are illustrated on the graph below. The uncertainty circle is defined by the shaded area extending
around each of the point estimates and represents the uncertainty associated with the
predictions. The blue diamond represents the outcome prediction point estimate for nonsurgical
care and the green circle represents the point estimate for TKR. The large shaded blue and green
overlapping circles around the point estimates represent the uncertainty associated with the
predictions. We computed the mathematical distance between the nonsurgical and TKR
predictions as the distance between the two coordinates on the pain and function graph.
93
Figure 1. Depiction of “zone of uncertainty” or mathematical equipoise
Equation for Calculating the Distance between Predictions with nonsurgical care and with knee
replacement
The equation used to calculate the distance between the predictions for pain and function
is below.
d1 = �((x2 − x1) ∗ (x2 − x1) + (y2 − y1) ∗ (y2 − y1)) where the coordinate for
pain and function predictions with nonsurgical care is represented as (x1, y1) and the coordinate
for pain and function predictions with TKR as (x2, y2).
94
APPENDIX L Three Sample Graphs with Small, Moderate, and Large Amounts of “Uncertainty Circle” Overlap
Small Overlap
Moderate
Large Overlap
94
APPENDIX M Consistency of knee pain and function outcomes used for models with other measure of knee
pain and function
The team felt it important to evaluate other outcomes to look for consistency of effect.
These were exploratory analyses done after the models for pain and function were finalized. The
evaluations were done using the OAI database.
For pain, the study outcome was WOMAC knee pain. For the consistency of effect
evaluation, we also looked at KOOS knee pain, and KOOS symptom scales.
For function, the study outcome was SF-12 physical function score. For the consistency of
effect evaluation, we also looked at the KOOS function, sports, recreation (FSR) scale, the KOOS
quality of life (QOL) scale, and the KGLRS scale. The KGRLS is another quality of life index that asks
responders to ‘consider all the ways that knee pain and knee arthritis affect you’ rated on a 10
point scale of how they ‘are doing’ ranging from very good to very poor. For the purposes of these
evaluations, all of these scales/instruments were re-scaled to 0 to 100 where a low value
indicated poorer function and/or higher pain and high values indicated good function and/or
lower pain.
The results of these exploratory analyses suggest that the WOMAC knee pain tracks well
with other measures of knee pain and symptoms, and in particular, KOOS knee pain. The SF-12
physical function score, while positively correlated, does not track as strongly with other knee-
related quality of life and function variables. These results are somewhat to be expected in that
while there may be overlap in physical function and knee-related function they are not the same
thing. Our stakeholders suggest both overall and knee-related function are important and we have
come to believe future work to develop predictions of the more specific knee-related function
would be useful to both patient and clinical stakeholders.
94
I. SUBJECT PLOTS: For illustrative purposes we are showing baseline (pre) and 1-year follow-up (pos) raw (knee) pain and
function scores for a random sample of subjects. The header for each panel in each figure tells if the subject got a total knee
replacement (TKR). If the different scales are all capturing the same information, the lines within each panel should be overlapping. The
panel on the left shows the different pain scales (WOMAC knee pain (KP), KOOS KP, and KOOS Symptom. The panel on the right shows
the different function scales (SF-12 physical component score, KOOS SFR, KOOS QOL, KGLRS) . In general, the lines were reasonably
parallel and going in the same direction, although there was variability.
96
II. DISTRIBUTIONS: The distribution of scores at baseline (PRE), at the approximate 1-year follow-up (POS), and the POS minus
PRE change from baseline delta (DEL) were plotted for each scale. Different colors were used to show the distributions for both the
group of subjects that got TKR (red) and did not get TKR (blue). The results for distributions of the 3 pain scores are on the left panel,
and of the 4 function scores on the right panel.
Consistency of the scores would best be illustrated by finding similarities of the distributions between ROWs of the figures, while
there still may be differences between columns. This is shown clearly for the plot of pain scores on the left. For the PRE, the
distributions are reasonably symmetric and centered near a value of 60. For POS, the scores are higher (better) and skewed to the right,
especially for the TKR (red) subjects. The delta scores for all (3) pain measures are symmetric, but one can see more separation
between the TKR and non-TKR (red and blue respectively) subjects with the TKR subjects having greater improvements captured by all
three scales.
97
III. CORRELATION: We next wanted to look at consistency of scores at the subject level using simple bivariate scatter plots. If
scores for any two scales were the same, one would expect the points on the scatter plot to all fall along a diagonal line on the plot. The
corresponding correlation coefficient would be 1.0. Again, the panel on the left shows the 3 bivariate scatter plots for the 3 pain scores,
and the panel on the right shows the 6 bivariate plots for the 4 function scores. The red dots and red smoothed line are the data for the
subjects with TKR, and the blue dots and blue smoothed line are for the subjects who did not have TKR. All correlations were positive,
and nearly all having associated p-value <0.05.
98
IV. AGREEMENT: The last evaluation we did was categorize the change from baseline to follow-up as an improvement of over 8
points, worsening of over 8 points, or a change of no more than +/- 8 points. This was done for each subject for each scale. Again,
bivariate tables were constructed looking at agreement for the change categories. The choice of a change of 8 points on a 100-point
scale was based on the KOOS User’s Guide 1.1 Updated August 2012 (http://www.koos.nu/) which notes “The Minimal Important
Change (MIC) is currently suggested to be 8-10” with an acknowledgment that there are limitations to this suggestion. We evaluated
“agreement” with a kappa statistic. A Kappa of 1 indicates perfect agreement. The results of these analyses are displayed below. For the
pain scales, the WOMAC knee pain (KP) and KOOS KP had the highest Kappa (consistent with the largest correlation seen in part III).
Kappa’s were lower for the function scales than pain scales. The SF-12 agreeing more with the KOOS than KGLRS. Among the function
measures, the kappa was highest for the 2 KOOS scales (KOOS FSR and KOOS QOL). These results are shown on the following page.
99
A. AGREEMENT: Pain Scales
100
B. AGREEMENT: Function Scales
101
B. AGREEMENT: Function Scales (continued)
102
V. Summary of Scales
103
VI. Screenshots of components of KOOS and KGLRS Scales from OAI database
KOOS Knee Pain KOOS Function, Sports, Recreational Activities
104
VII. Screenshots of components of KOOS and KGLRS Scales from OAI database (continued)
KOOS SYMPTOMS KOOS QOL and KGLRS
96
Copyright© 2019. Tufts Medical Center. All Rights Reserved.
Disclaimer:
The [views, statements, opinions] presented in this report are solely the responsibility of the author(s) and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute® (PCORI®), its Board of Governors or Methodology Committee.
Acknowledgement:
Research reported in this report was [partially] funded through a Patient-Centered Outcomes Research Institute® (PCORI®) Award (#ME-1306-02327) Further information available at:
https://www.pcori.org/research-results/2013/developing-software-predict-patient-responses-knee-osteoarthritis-treatments