Developing Software to Predict Patient Responses to Knee ... · Osteoarthritis Mathematical...

1

Developing Software to Predict Patient Responses to Knee Osteoarthritis Treatments and to Identify Patients for Possible Enrollment in Randomized Controlled Trials

Harry P. Selker, MD, MSPH, Denise H. Daudelin, RN, MPH, Robin Ruthazer, MPH, Manlik Kwong,

BSEE, BSCS, Rebecca C. Lorenzana, BA, Daniel J. Hannon, MS, PhD, John B. Wong, MD, David M.

Kent, MD, CM, MS, Norma Terrin, PhD, Alejandro D. Moreno-Koehler, BS, MPH, Timothy E.

McAlindon, MD, MPH

Original Project Title: A Method for Patient-Centered Enrollment in Comparative Effectiveness Trials: Mathematical EquipoisePCORI ID: ME-1306-02327 HSRProj ID: 20143597

_______________________________ To cite this document, please use: Selker HP, Daudelin DH, Ruthazer R, et al. 2019. Developing Software to Predict Patient Responses to Knee Osteoarthritis Treatments and to Identify Patients for Possible Enrollment in Randomized Controlled Trials. Washington, DC: Patient-Centered Outcomes Research Institute (PCORI). https://doi.org/10.25302/9.2019.ME.130602327

2

Table of Contents ABSTRACT ......................................................................................................................................... 3

BACKGROUND .................................................................................................................................. 5

PATIENT AND STAKEHOLDER PARTICIPATION ................................................................................. 9

Selection of Outcomes ................................................................................................................. 9

Modeling Database Creation ..................................................................................................... 10

Predictive Model Development and Results.............................................................................. 10

User Interface Development and Testing .................................................................................. 10

METHODS ....................................................................................................................................... 12

Selection of Data Sets and Description of Outcomes ................................................................ 12

Evaluation of Registry Variables ................................................................................................ 14

Creating the Modeling Database ............................................................................................... 14

Creating Predictive Models for Outcomes ................................................................................. 17

Prototype Decision Support Software Development, Interface Design, and Usability Testing . 18

RESULTS ......................................................................................................................................... 20

Study Design and Database Creation ......................................................................................... 20

Study Sample ............................................................................................................................. 19

Model Development .................................................................................................................. 20

Prototype Decision Support Software Development, Interface Design, and Usability Testing . 29

DISCUSSION.................................................................................................................................... 32

Study Results in Context ............................................................................................................ 32

Uptake of Study Results ............................................................................................................. 34

Study Limitations ....................................................................................................................... 35

Future Research ......................................................................................................................... 37

CONCLUSIONS ................................................................................................................................ 39

REFERENCES ................................................................................................................................... 40

Acknowledgments.......................................................................................................................... 43

APPENDIX ..................................................................................................................................... 45

3

ABSTRACT

Background: Although they represent a standard of evidence, randomized controlled trials (RCTs) often fall short because of insufficient or unrepresentative enrollment, and many needed trials are never conducted. This leaves gaps in evidence to inform patient care decisions and creates a need for a method to facilitate RCTs in usual care settings.

As medical therapies become increasingly less satisfactory for patients with osteoarthritis, an average of 680 886 patients receive surgical knee replacement per year in the United States. Yet, there have been no substantial comparative effectiveness RCTs of medical versus surgical total knee replacement (TKR). The question about TKR for knee osteoarthritis is suitable for exploring a method that would facilitate the conduct of comparative effectiveness RCTs by assisting discernment of patient-specific equipoise between treatments.

Clinical equipoise is a prerequisite for enrollment into an RCT; likewise, mathematical equipoise is the use of mathematical models to predict and compare patient-specific outcomes of alternative treatment options that should be considered when enrolling patients into an RCT. When the predictions are similar, suggesting equipoise, then random treatment assignment may be justified, and the patient may feel more comfortable enrolling in the RCT. When the predictions suggest one treatment is better than another, trial enrollment may be inappropriate, but the predictions still can inform clinical decision making. Objectives: This project aimed to use mathematical equipoise for making patient-specific comparisons of alternative treatment outcomes of TKR versus nonsurgical treatment of knee osteoarthritis as a way to consider enrollment into a comparative effectiveness RCT.

Methods: We first obtained the views of patient stakeholders with knee osteoarthritis to identify key pain and physical function outcomes. After creating a consolidated database from non-RCT sources of knee osteoarthritis outcomes, and adjusting for the inherent differences between the databases, we developed multivariable mathematical models that predict patient-specific pain and physical function outcomes for TKR or nonsurgical treatment. We then developed the Knee Osteoarthritis Mathematical Equipoise Tool (KOMET) user interface based on these models to discern patient-specific equipoise. We pilot tested the interface to assess usability and responsiveness to the needs of patients and physicians and its adequacy for supporting shared decision making, both for RCT enrollment and for treatment.

Results: We incorporated KOMET regression models into prototype KOMET decision support software, which we successfully pilot tested in a range of clinics. Patients found it very helpful in making treatment decisions, but only 7 of the 12 understood the concept of equipoise. Conclusions: This project demonstrated the use of mathematical equipoise as a method for providing patient-specific decision support for shared patient–physician decision-making for selecting between alternative treatments and considering enrollment into a comparative effectiveness RCT.

4

Limitations and subpopulation considerations: Although largely accomplishing its intended objectives, as an early stage in the development of mathematical equipoise decision support, this project has limitations related to the available clinical data, the modeling methods and variables, and the prototype software. The next step will be to conduct a larger-scale test, and then to implement it for its intended use—the conduct of a comparative effectiveness trial in usual care settings.

5

BACKGROUND

Symptomatic knee osteoarthritis has an estimated prevalence of 17% to 34% in US adults1

and is the most frequent cause of dependency in lower limb tasks, especially in elderly patients.2

It has considerable economic and societal costs, including 68 million work-loss days per year, and

is the cause for more than 5% of the annual retirement rate and for hundreds of thousands of

hospital admissions.3-6 For many patients, as osteoarthritis progresses, medical and physical

therapy become less satisfactory, making this the most frequent reason for joint replacement

surgery.4

There are concerted efforts to develop drugs that retard the progression of osteoarthritis,

many through preserving cartilage. Ultimately, effective intervention will require addressing the

multistructure failure inherent to osteoarthritis, which includes periarticular bone as well as soft-

tissue structures within the joint. Meanwhile, total knee replacement (TKR) has become the

ultimate standard for treatment, now completed for an average of 680 886 patients per year in

the United States, with aggregate charges greater than $36 billion.7

Shared patient–clinician decision making is particularly germane to deciding between

medical treatments and surgical knee replacement. Not only do patient preferences have great

relevance, but the availability of treatments, their inconvenience and expense, and the

accumulation of comorbidities over time are all salient.7 Compromising these decisions are gaps in

patient-specific information about alternatives and their effects in different populations.8 At the

time we initiated this project, we found decision aids but no explicit predictive models in the

literature or published randomized controlled trials (RCTs) of medical versus surgical treatment of

knee osteoarthritis. At the time of this writing, a Danish study of 100 patients with knee

osteoarthritis who were eligible for unilateral total knee replacement was the only such known

trial to show that TKR followed by nonsurgical treatment resulted in greater pain relief and

functional improvement after 12 months versus nonsurgical treatment alone.9 However, TKR was

associated with more serious adverse events than nonsurgical treatment, and most patients who

were randomly assigned to nonsurgical treatment alone did require TKR within the study’s 12-

month follow-up.9 Thus, the question is far from settled at this point.

6

The description and measurement of clinical change in knee osteoarthritis is not

necessarily reliable, undermining comparisons of alternative treatments.10 Moreover, the cross-

sectional US national DECISIONS survey found that more than half of patients discussing knee or

hip surgery underestimated the harm from surgery, and only 28% correctly estimated the amount

of pain relief following surgery.11

As clinical decision support, we previously created and tested predictive instruments

based on multivariable logistic regression models that provide 0% to 100% predictions of medical

diagnoses and outcomes of treatments.12-15 They have been used successfully for short-term

decisions such as whether to hospitalize a patient and/or to treat for acute myocardial infarction.

These emergency decisions are dominated more by physician judgment than are decisions about

longer-term and more complex treatments. Decision support for more complex decisions—for

which shared patient–clinician decision making is central—has been well studied. A 2014

Cochrane systematic review of 115 RCTs found that decision aids increased patient knowledge,

improved accuracy of risk perceptions when expressed in probabilities, enhanced concordance

with patient values when including a values clarification exercise, and reduced decisional conflict

due to feeling uninformed and unclear about personal values.16 Similar decision aid benefits have

been seen for patients with osteoarthritis considering hip or knee arthroplasty.17-21

Accordingly, the objective of this project was to create the Knee Osteoarthritis

Mathematical Equipoise Tool (KOMET), intended to be embedded in electronic health records

(EHRs) as decision support for shared clinical decision making about patients’ choices of

treatment, especially between medical treatment and TKR. Additionally, this shared decision-

making is intended to identify patients for whom, based on their specific characteristics, there is

insufficient evidence to favor 1 of 2 or more treatment alternatives. This situation is referred to as

clinical equipoise, the ethical and scientific basis for enrolling patients in a randomized clinical

trial. Shared patient–clinician decision making is important in this circumstance, when patients’

personal preferences and objectives can dominate what otherwise might appear to be a toss-up

treatment decision.22 By illustrating the generation and use of patient-specific equipoise, KOMET

also is intended to support shared decision making about participation in RCTs, as an example

implementation of mathematical equipoise, for practical, ethical, targeted enrollment into

7

comparative effectiveness RCTs. If successful, presumably this approach could be used in many

other conditions and clinical decisions.

In developing our cardiac predictive instruments, we were fortunate to have extensive

patient-level data from RCTs. A great advantage of such data is that random assignment of

treatments helps avoid having treatment effects biased by the selection of treatments and their

use among patients. RCT data allow the multivariable regressions to accurately reflect the effect

of a treatment when used in comparable patients; however, RCTs are expensive and time-

consuming, and there are many conditions and treatments for which RCT-generated data are not

available. Moreover, for the circumstances in which we might want to run a new RCT—for which

we would potentially use mathematical equipoise for participant selection—there often will be

few or no RCTs. In this case, to create predictive models, we must use data from observational

studies, registries, EHR-based data warehouses, patient-acquired data feeds, and other sources.

Registries of various patient groups and populations are relatively inexpensive and common, and

EHRs generate increasingly more data available in databases and data warehouses. If these non-

RCT sources could be used for creating predictive models, there would be vast opportunities for

the mathematical equipoise approach to facilitate the conduct of clinical effectiveness RCTs, but

there are protean challenges and limitations to this.

Clinical equipoise—the ethical and scientific basis for randomly assigning patients different

treatments—is considered no longer present after a pivotal clinical trial shows one treatment is

better than alternatives. All patients then must be offered the most-effective known therapy.

Typically, however, this is not an individual patient-centered determination; only group-based

general inclusion and exclusion criteria are available. Mathematical equipoise is intended as a

method by which, for a given condition, only those individuals for whom there still is uncertainty

could be enrolled in a comparative effectiveness trial, while individuals for whom the question is

settled would not be enrolled.22 The objective is to generate RCT evidence based on

individualization of treatments consistent with the principle of equipoise. This ultimately could

allow treatment that accounts for the heterogeneity of treatment effects among different

individuals and groups.

If embedded in EHRs and computerized physician ordering systems, potentially,

8

determination of mathematical equipoise could serve as a practical way in routine clinical care to

detect all eligible patients for possible RCT enrollment. It also could identify those patients not

suitable for enrollment, for whom it could enhance clinical care by indicating the potentially best

treatment. Also, the basis for selection for a clinical study could be transparent to patients and

clinicians in real time to enhance truly informed consent during clinical care.

In this project we sought to create KOMET as an example of mathematical equipoise. To

represent the prevailing circumstances in which this approach would be used, we used patient-

level data from existing non-RCT sources to build predictive models of treatment outcomes; these

models determined the presence or absence of mathematical equipoise to inform decision

making. We sought to illuminate limitations of available data and to explore strategies for

overcoming such limitations to optimize modeling. Success in using non-RCT data in this way

would support the goal of widespread use of the mathematical equipoise method.

We also sought to demonstrate through this project the utility of incorporating

stakeholder input to ensure relevance of the ultimate predictive models to patient–physician

decision making. Although research into the engagement of stakeholders in research is still

evolving in its terminology and frameworks,23,24 the criterion we used for this project—intended

for comparative effectiveness research (CER)—was “individuals, organizations, or communities

that have a direct interest in the process and outcomes of a project, research, or policy endeavor.”

9

PATIENT AND STAKEHOLDER PARTICIPATION

We engaged stakeholders throughout the entire project to ensure the relevance of the

ultimate models and the decision support to patient–physician decision making. Patient,

researcher, and clinician stakeholders were involved in the selection of study questions, choice of

study outcomes, selection of candidate variables for the modeling database and the predictive

model, and development and testing of the user interface. To foster this, we used the Patient-

Centered Outcomes Research Institute’s (PCORI’s) 6 engagement principles: reciprocal

relationships, co-learning, partnerships, transparency, honesty, and trust, all of which allow for

effective engagement in research.25

We held quarterly in-person meetings to build reciprocal relationships among stakeholders

and the research team, to educate stakeholders about the research methods being used, and to

solicit patient, researcher, and clinician stakeholder input. Participating groups included (1)

patients with or at risk of having knee osteoarthritis, (2) patient advocates for those with arthritis,

(3) clinicians who cared for these patients, and (4) knee osteoarthritis researchers.26

We identified interested patient and advocate stakeholders through discussions with

clinicians, knee osteoarthritis researchers, and the Arthritis Foundation. The patient panel

included 3 women and 4 men representing people at risk for knee osteoarthritis due to existing

osteoarthritis in other joints, people actively considering treatment options for their existing knee

osteoarthritis, and patients who had received TKR for osteoarthritis. We recruited clinician

stakeholders from primary care, orthopedics, and rheumatology. The clinician panel included 2

rheumatologists, 2 primary care physicians, 2 orthopedic surgeons, and 1 physical therapist, some

of whom had a dual role representing researchers.

Selection of Outcomes We chose the 2 outcome scales on which we built our models, the Western Ontario and

McMaster Universities Arthritis Index (WOMAC) and SF-12 Health Survey (SF-12) physical

component scores, after discussions with clinician and patient stakeholders. Factors considered

included the time frame of the outcome beyond surgery and the meaningfulness to someone

10

making a decision about surgery, taking into account constraints imposed by our available data

sources. Stakeholders were strongly supportive of using both the pain and functional outcome

scores, as both were part of patients’ decision-making processes.

Modeling Database Creation We created a modeling database from 4 data sets, matching patients who had surgical

treatment with ones who had nonsurgical treatment. To guide our choice of the variables on

which they would be matched, we gathered input from clinicians on the research team, clinician

and patient stakeholders, and results from prior published literature. Variable choices for these

models were informed by the needs of stakeholders who would use decision support for knee

osteoarthritis, focusing on their views about the representation of pain and functional outcomes.

Predictive Model Development and Results We provided all stakeholders with an orientation to the modeling process to foster their

ability to provide input on selection of candidate variables for model development. Interaction

terms in the statistical models allow differences in predicted benefit for different patients, so

receiving input on plausible interactions was important. The candidate primary and interaction

variables included in the model selection process were those stakeholders considered important,

plausible, and easily and reliably provided. We considered outcome variables based on

stakeholder ranking of how much the variable would be related to pain and functional outcomes 1

year in the future.

We sought clinician and patient stakeholder input on the clinical significance of the results

of predictive modeling. As the project evolved, the research team and stakeholders concluded

that many of the variables under consideration were too burdensome to collect or too difficult to

ascertain. To accommodate this, we adjusted models that did not have significant impact on

performance characteristics.

User Interface Development and Testing Both clinician and patient stakeholders contributed extensively to the design of the user

11

interface of the decision support application. They reviewed its presentation of outcome

predictions and its usability. Their recommendations led to improvements in the wording and

ordering of the questions, instructions, and display of predicted outcomes.

12

METHODS

To develop KOMET predictive models for outcomes of TKR and of nonsurgical treatments,

we created a consolidated database with treatment outcomes of knee osteoarthritis from a

variety of clinical study and registry data. We selected model variables based on input from

patients and clinicians about the best capture of important determinants of outcomes and

measurements of the clinical outcomes as well as on variables’ contributions to models’

predictive performance. We incorporated these models into prototype decision support software

and tested them with stakeholders, clinicians, and patients.

Selection of Data Sets and Description of Outcomes To create the modeling database, we considered a range of knee osteoarthritis databases

(briefly described below) as well as the scales used in these databases: the WOMAC (for pain) and

the SF-12 (for functional status). We selected 3 of the databases (MOST, OAI, and CORP) because

they are large, well established, and publicly available epidemiological studies of knee

osteoarthritis. The 2 additional databases are knee osteoarthritis registries (NEBH and TMC)

determined to have adequate cases and the required indexes, and that were available from

collaborating organizations.

Multicenter Osteoarthritis Study (MOST)27: MOST is an NIH-sponsored longitudinal,

prospective, observational study of knee osteoarthritis in adults with osteoarthritis or at increased

risk of developing osteoarthritis.27 The database includes a community-based sample of 3026

participants aged 50-79, with preexisting osteoarthritis or those at high risk for osteoarthritis

based on weight, knee symptoms, or a history of knee injuries or operations. Approximately 60%

are women and 15% are African Americans. The cohort was followed for 84 months and the data

was collected through clinical assessments, radiological studies, several measures and

instruments, and telephone interviews. The study focused on mechanical risk factors, causes of

knee symptoms and pain, and the long-term disease trajectory of knee osteoarthritis. Data used

in this article were obtained from the MOST, available for public access at http://most.ucsf.edu.

Osteoarthritis Initiative (OAI)28: The OAI is an NIH-sponsored multicenter, longitudinal,

http://most.ucsf.edu/

13

prospective observational study of osteoarthritis intended as a public domain research resource.

Its database includes clinical evaluation data, radiological (X-ray and MRI) images, and a

biospecimen repository for 4796 men and women aged 45-79 who have, or are at high risk for

developing, symptomatic knee osteoarthritis. Data used in this article were obtained from the OAI

database available for public access at http://www.oai.ucsf.edu/.

Canadian Osteoarthritis Research Program (CORP)29-31: The Women’s College Hospital

CORP data set includes 2200 participants of this prospective, population-based cohort with at

least moderately severe knee osteoarthritis, aged 55 or older. Ultimately, because of the

challenges with this data set, we did not use it for this project.

New England Baptist Hospital (NEBH) Orthopedic Surgery Registry32: The NEBH registry

includes 2462 patients who have underwent TKR there since 2011. Assessments occur prior to

surgery, at 6 weeks, and at 12 months. Data collected include demographic, vital signs, clinical

measures, medications, knee examination, the Knee Society Score (KSS) pain and physical function

score, the SF-12 health status score, surgical complications, and procedure outcomes. The mean

age of patients is 68 years, and 57% are women.

Tufts Medical Center (TMC) Orthopedic Surgery Registry33: The TMC registry includes 535

patients who had received TKR since 2007. Assessments occur prior to surgery, at 6 weeks, 12

months, and 24 months. Data collected include demographic, vital signs, clinical measures,

medications, knee examination, pain and physical function (KSS), health status (SF-12), surgical

complications, and procedure outcomes. The mean age of patients is 62, and 61% are women.

The Western Ontario and McMaster Universities Arthritis (WOMAC) Index34: The WOMAC,

developed in 1982, is widely used in the evaluation of hip and knee osteoarthritis and is available

in more than 100 languages. It is a self-administered questionnaire of 24 items, divided into 3

subscales: (1) pain (5 items) during walking, using stairs, in bed, sitting or lying, and standing

upright; (2) stiffness (2 items) after first waking and later in the day; and (3) physical function (15

items) using stairs, rising from sitting, standing, bending, walking, getting in and out of a car,

shopping, putting on and taking off socks, rising from bed, lying in bed, getting in and out of a

bath, sitting, getting on and off the toilet, heavy domestic duties, and light domestic duties. We

used the knee pain scale as the primary outcome in this project. In its raw form the WOMAC knee

http://www.oai.ucsf.edu/

14

pain scale ranges from 0 to 20. To make it easier to interpret and represent in the final models, we

rescaled it to 0 to 100, with 0 representing absence of pain and 100 representing extreme pain.

SF-12® Health Survey: The SF-12 is a multipurpose short-form generic measure of health

status.35,36 It was developed to be a much shorter, yet valid, alternative to the SF-36® for use in

large surveys of general and specific populations and for large longitudinal studies of health

outcomes. We used its physical functioning summary score as the second predicted outcome for

this project. The SF-12 scores range from 0 to 100, with higher scores indicating better function.37

Evaluation of Registry Variables We used a consensus process involving clinician investigators and stakeholders to select

variables for model development. First, clinicians were asked to rank variables based on their

impact on (1) predicting prognosis for pain or function, with or without surgery, and/or (2)

predicting assignment to medical or surgical treatment (ie, indications or contraindications for

treatment).

They a priori ranked each variable from A to D:

A. Variables that almost certainly must be included in the model; eg, age

B. Variables that would be desirable to have established risk factors for the outcome; eg,

body mass index (BMI)

C. Variables that would be desirable to have for exploratory analyses; eg, history of falls

over the past 12 months

D. Variables not likely to be needed; eg, family history of arthritis

Finally, a few variables were ranked by clinicians for importance and ease of collection

using a scale of 1 to 10, with 10 being very important or very hard to collect. We collapsed the

importance rankings into 3 categories: not at all important (1-3), fairly important (4-7), and very

important (8-10). Clinicians ranked most of the variables as easy to collect. We included in the

modeling database the final list of variables deemed as fairly important and very important.

Creating the Modeling Database The database for creating KOMET models included 2 types of registries. Two databases,

15

MOST and OAI, had data collected on knee osteoarthritis at fixed intervals per their protocols.

During the course of follow-up, some patients had TKR and continued to be followed afterward.

The 2 other registries, NEBH and TMC, were from hospitals that collected baseline and follow-up

data only on their patients who had TKR.

For this project, our target sample was patients who had knee osteoarthritis and had

reached the clinical stage at which they would be deciding whether to have TKR. Lacking a cohort

of such patients randomized to the medical or surgical options, we used data from patients who

had TKR and matched them to patients (knees) who did not have TKR but who had similar

characteristics. Where possible, we matched non-TKR knees to TKR knees within the same

database (OAI, MOST). We matched TKR knees from the NEBH and TMC registries to non-TKR

knees from MOST and OAI based on the best match. In practice, we created a database in which

we used the knee as the unit of analysis, and we conducted matching based on characteristics of

the knee and the patient. Thereby, we created a study sample of patients who would or could be

considering this therapeutic choice.

For the MOST and OAI registries, we identified all knees that underwent TKR and then

designated the data collected at the closest previous visit as the baseline visit for that TKR. We

then extracted baseline data on these TKR knees from the patients’ registry data, including

demographics, knee characteristics, comorbidities, mental and physical function, and other

clinical features. To find non-TKR control knees, we created a sub-database of all knee visits from

all patients, excluding any that occurred after a TKR. We then used a greedy matching computer

algorithm38 to select control knees for each TKR knee (within the same database, OAI or MOST ).

It should be noted that the variables used for matching differed among the databases, based on

data availability. As a guide to determine variables to use for matching, we used input from

research team clinicians, stakeholders, and the literature. For matching, we converted continuous

variables to categories. We loosely based categories on Riddle et al., which presented an

algorithm to judge the appropriateness of TKR39. Our research team considered the factors used

in that algorithm as reasonable factors to match on where possible. Categories were ordered, and

we did not allow matches beyond one category of difference. We did not always require exact

matches because we did not want to lose patients who had TKR from the model-building sample,

16

and we could statistically adjust for differences between the TKR and non-TKR groups in the

modeling process. Thereby, we matched each TKR knee in OAI with a similar non-TKR knee in OAI

based on values of matching variables at baseline. The same was true for MOST.

Because the TMC and NEBH samples included only TKR subjects, we drew their matched

non-TKR controls from a pooled data set of knee visits from the OAI and MOST registries.

We established exclusion criteria based on discussions with the research team members

and applied them before we performed modeling. We excluded any knee that did not have

follow-up information (9 months to 5 years after the baseline visit or TKR) on the same knee in the

same state (TKR versus non-TKR). If a knee visit was a candidate control but had TKR at some

point between that visit and a follow-up at least 9 months later, we excluded it from the pool of

non-TKR knees used for matching. If a knee had TKR but did not have pre-TKR baseline data within

12 months of the TKR, we excluded that TKR. If a knee had TKR, we excluded the contralateral

knee from the pool of non-TKR knees used as controls. If a patient had TKR on 2 knees, more than

90 days apart, we excluded both knees; with an interval of >90 days, we were concerned that the

1-year evaluation of pain and function for the first knee could still be during the recovery period

of the surgery for the second knee, which would confound the assessment of the outcome. If

bilateral surgery was completed on 2 knees within 90 days of each other, we used the first knee or

randomly chose one if both knees were completed on the same day. There was 1 exception in the

MOST data for which 14 patients were counted twice, including and following each bilateral knee

separately. In the full database, 104 other patients were counted 2 times (92 patients) or 3 times

(12 patients). Overall, 1322 patients contributed data for 1452 matched knees for these analyses.

We did allow single patients to contribute both a control and TKR knee when surgeries were far

enough apart in time to allow full follow-up on each independently. We also allowed OAI and

MOST control knees to be reused for the matching process for TKR knees from the NEBH and TMC

registries. See Appendix A for details and limitations of this approach.

On the matched data set, we compared baseline characteristics between knees with and

without TKR, using chi-square tests and t tests. To account for missing data, we used multiple

imputation, creating 10 imputed data sets for each study source. We also compared baseline

characteristics on imputed data sets as we used these for model development. We adjusted P-

17

values from the analysis of the multiple imputation data set to account for imputation variability.40

We used SAS software for these analyses using the model information (MI)procedure to impute

the data and MIANALYZE to process the results of analyses on the imputed data. See Appendix

B.41

Creating Predictive Models for Outcomes We conducted analyses using SAS for Windows (Version 9.4 TS Level 1M2. Cary NC: SAS

Institute, 2002-2012) and SAS Enterprise Guide (Version 7.13 HF3. Cary, NC: SAS Institute, 2016).

We developed a multivariable linear regression model to predict the 1-year knee pain

outcome based on the WOMAC score or, when a database lacked WOMAC items, using an

estimated WOMAC score, as described in Appendix C. Our approach was to develop the model

using a set of matched TKR to non-TKR knees from the OAI database and then to validate/test it

on a set of matched TKR to non-TKR knees from the MOST database. We then pooled the OAI and

MOST data sets and built a new model, starting with variables used in the model developed in the

OAI data and tested on the MOST data. We also rederived models on a database that included all

4 data sets (OAI, MOST, NEBH, and TMC). We used a similar variable selection process but with a

more limited set of candidate predictor variables because NEBH and TMC did not capture as many

variables as the OAI and MOST registries. We repeated this entire process for the functional

outcome (SF-12 physical component at 1 year). To create models that could provide predicted

estimates of 1-year knee pain and 1-year function, with and without TKR, for any patient based on

their characteristics, all models included an indicator variable for treatment type. We explored

covariates and interactions of treatment type with covariates in the different phases of the

modeling process. We did not adjust for matching in the linear regression during modeling

because the purpose of matching was to create a reasonably balanced study sample, and

covariates in the models could account for remaining imbalances between groups.42 We describe

further details of our approach in Appendix D.

18

Prototype Decision Support Software Development, Interface Design, and Usability Testing

The goal of software development and usability testing was to translate the results of the

predictive models into easily understood, patient-specific reports with predictions of 1-year

outcomes that could be produced in real time in the course of clinical care, for shared treatment

decision making and, if appropriate, enrollment into an RTC.

Decision Support Software Development: There were 2 KOMET development tasks, for the

analytics and for the user interface. Analytics development included implementing the predictive

models as reusable, multiplatform software components to generate both the current and 1-year

predicted pain and function outcomes for nonsurgical and surgical treatments. In addition, the

analytics software calculated the respective 95% confidence intervals around each prediction as

the basis for considering the degree of overlap that would suggest near equivalence, or equipoise.

User interface development included creating a web browser–based questionnaire interface to

collect patient demographics, items for computing the WOMAC pain score, the SF-12 physical

functioning scale, and comorbidities. Together, the user interface and analytics component

included methods for data retention and presentation of the predicted outcome results. We then

incorporated the predictive models into the web-based decision support application for iterative

user testing.

Interface Design: The user interface design process involved iterative prototyping of

methods to collect data for the predictive models, displaying the predictions through data tables,

bar charts, data plots, dynamic text descriptions, and printed reports, and determining and

alerting users about mathematical equipoise. We began with image mockups and storyboards,

then used online prototyping tools (www.axshare.com) to establish page layout, content

placement, and workflow. Once we identified key user interface elements, we finalized general

layout and content placement and conducted subsequent user interface design iterations on a live

website. We implemented the analytics components and user interface on a stand-alone web-

based application server using an Apache.org Tomcat 8 webserver (Wakefield, MA: Apache

Software Foundation, 1999-2019.50

Usability Testing: We tested the prototype decision support application and iteratively

http://www.axshare.com/

19

redesigned it to address patient and clinician user needs. We conducted initial testing with 12

research institute staff members as well as members of our patient and clinician stakeholder

panels. We tested the final design with 10 patients and 6 physicians in 3 clinical settings during

typical clinic and research-specific visits. Testing included (1) entering demographic data and

completing questionnaires to provide the information needed for the predictive models, (2)

interpreting predictive model results through data displays, and (3) determining user

understanding of the predictive models and mathematical equipoise and clinical trial

randomization through case-based discussions. Usability testing included a “think-aloud” protocol

and a usability testing script, as described in Appendix E.

A research assistant and the project director conducted testing. All sessions were recorded

and transcribed. Testing with research institute staff and stakeholders was conducted virtually or

in a conference room, and testing with patients and clinicians was conducted in the clinic setting.

The IRB determined the project was exempt from IRB review.

20

RESULTS

Study Design and Database Creation The final database included 1452 knees (726 with TKR and 726 without) of 1322 patients.

Of patients, 91% (1204) had a single knee included in the database, 8% (106) had 2 knees used or

a single knee used 2 times, and 1% (12) had knees used 3 times. We matched TKR knees from OAI

to control knees from OAI, and we matched TKR knees from MOST to controls from MOST.

Because NEBH and TMC included only TKR knees, we drew the controls for those databases from

non-TKR knees from OAI and MOST. In the final matched database, the relative contributions of

TKR knees were OAI, 252; MOST, 154; NEBH, 248; and TMC, 72. For the control knees,

contributions were OAI, 472, and MOST, 254. Figure 1 and Appendix F: Figures 1a-1d provide

breakdowns of how we selected the final analysis sample from each database in CONSORT-type

figures.

18

Figure 1. Description of Final Analysis Sample Selection

OAI MOST NEBH TMC [May 2014] [January 2015] [December 2014] [July 2015]

4796 Patients 3026 Patients 5519 Subjects 117 Subjects Excluded because did not have osteoarthritis, no follow-up, bilateral TKR >90 days apart, prior to start

4379 Patients/8713 knees 2957 Patients/5914 knees 314 Subjects/knees 97 Subjects/knees

Control sample TKR sample Control sample

TKR sample Control sample TKR Sample Control (non-TKR) sample

TKR sample

4049 Patients 253 Patients 2652 Patients 2652 Patients

2652 Patients (5071 knees) [MOST]

314 Subjects 2652 Patients (5071 knees) [MOST]

97 Subjects

8095 Knees 278 Knees ** 5071 Knees 5071 Knees 4049 Patients (8095 knees) [OAI]

314 Knees 4049 Patients (8095 knees) [OAI]

97 Knees

Excluded because TKR on contralateral knee, no pre-TKR visit, no post-TKR visit

MATCH TKR SAMPLE WITH CONTROL

SAMPLE [KNEE VISITS] MATCH TKR SAMPLE WITH

CONTROL SAMPLE [KNEE VISITS] MATCH TKR SAMPLE WITH CONTROL

SAMPLE [KNEE VISITS] MATCH TKR SAMPLE WITH CONTROL SAMPLE [KNEE VISITS]

Matching variables: Matching variables: Matching variables: Matching variables: Age (<55, 55-65, >65) Age (<55, 55-65, >65) Age (<55, 55-65, >65) Age (<55, 55-65, >65)

Gender Gender Gender Gender WOMAC pain + disability (Riddle based): on

incident knee WOMAC knee pain [0-20 scale] (0-

3, 4-9, 10-20, missing) WOMAC knee pain [0-100] (11-50, 51-75, 75-100, missing) WOMAC knee pain [0-100] (11-50, 51-75, 75-

100, missing) WOMAC pain + disability (Riddle based): on

contralateral knee WOMAC contralateral knee pain

(0-2, 3-8, 9-20, missing) WOMAC contralateral knee pain [0-100] (11-50, 51-75, 75-100, missing)

Location (Riddle category) K-L (Riddle): moderate/severe versus not

SF-12 ( <44 , 44-56, >56) SF-12 ( <44, 44-56, >56) SF-12 ( <44 , 44-56, >56) SF-12 ( <44 , 44-56, >56) Charlson (0, 1, ≥2, missing) Charlson (0, 1, ≥2, missing) Charlson (0, 1, ≥2, missing) Charlson (0, 1, ≥2, missing)

Change in WOMAC pain from prior visit (≥2 points versus not)

Change in WOMAC pain from prior visit (≥2 points versus not)

Control TKR Control Control Control Control Control TKR

252 Knees 252 Knees 154 Knees 154 Knees 248 Knees 248 Knees 72 Knees 72 Knees

19

Study Sample We compared distributions of variables used for the matching process between TKR and

non-TKR knees for each data source; these results are presented in Appendix F: Table 1a. They

confirmed that the matching algorithm had worked. In each database, characteristics used for

matching were well balanced between the TKR and non-TKR knees. Baseline characteristics

considered for the modeling process, and of interest to clinicians and stakeholders, were

comparable between TKR and non-TKR knees, as presented in Appendix F: Table 1b. This also was

true of the variables used in the final multivariable models using the imputed data, as shown in

Appendix F: Table 1c.

Baseline characteristics and outcomes at follow-up of the matched study sample are

summarized in Table 1. Approximately 40% were men, the mean age was 65, and the mean BMI

was 31. On the 0 to 100 pain scale (100 indicating extreme pain), the mean baseline knee pain

was significantly higher in the TKR group than in the non-TRK group (mean = 45.6 versus 40.5; P =

< .01), despite efforts to match on this variable (categorized). Comparisons of mean baseline SF-

12 scores between TKR and non-TKR groups showed better physical and mental function in the

non-TKR groups than in the TKR groups, with the difference being significant for physical function

(mean = 37.2 versus 38.6; P =.008). Overall, at follow-up there was less knee pain and better

physical function in the TKR groups than in the non-TKR groups. Irrespective of significance, we

used all variables listed in Table 1 in building the multivariable models of long-term

(approximately 1-year) outcomes.

20

Table 1. Description of Pooled Study Sample Used for Model Derivation for n = 1462 Matched Knees (Imputed Data)

Variable TKR (N = 726)

Non-TKR (N = 726)

TKR Minus Non-TKR Delta (∆) and [95% CI]

Effect Sizea

Mean +/– standard deviation (SD) (∆/SD)

Baseline Characteristics

Age 65.29 ± 8.57 64.77 ± 8.57 0.52 [–0.36-1.40] 0.03

Male, N(%) 0.43 ± 0.49 0.42 ± 0.49 0.00 [–0.05-0.06] 0

Baseline BMI 31.31 ± 6.49 30.97 ± 6.36 0.34 [–0.32-1.00] 0.03

Baseline SF-12 physical 37.16 ± 9.46 38.59 ± 10.94 –1.44 [–2.49 to –0.38] –0.07

Baseline SF-12 mental 52.56 ± 11.48 53.62 ± 11.82 –1.07 [–2.27-0.13] –0.05

Baseline WOMAC knee pain (0-100) 45.59 ± 21.87 40.48 ± 21.76 5.11 [2.87-7.36] 0.12

Baseline knee pain, contralateral (0-100) 18.92 ± 21.06 19.66 ± 22.05 –0.74 [–2.96-1.48] –0.02

Baseline hip pain or pain/ache/stiffness 0.34 ± 0.50 0.62 ± 0.51 –0.27 [–0.33 to –0.22] –0.27

At least one comorbidity, N (%) 0.32 ± 0.52 0.31 ± 0.56 0.01 [–0.05-0.07] 0.01

Narcotics, N (%) 0.14 ± 0.36 0.13 ± 0.36 0.00 [–0.03-0.04] 0.01

Follow-up Results

Follow-up SF-12 physical 44.48 ± 11.88 39.81 ± 10.80 4.67 [3.51-5.84] 0.21

Follow-up WOMAC knee pain (0-100) 13.92 ± 19.44 29.22 ± 19.45 –15.30 [–17.30 to

–13.30] –0.39 a Shaded rows indicate variables in which definitions varied between databases such that these variables ultimately were excluded as candidates in the building of final models.

Model Development We used linear regression to model the 2 outcomes, the WOMAC knee pain scale

(rescaled 0 to 100; see Appendix C: WOMAC Knee Pain, Part II) and the SF-12 physical functioning

component score.

Based on the methods described above, we chose these outcomes (including timing),

prior to building models, following repeated discussions with clinician and patient stakeholders

and the research team. We chose 1 year as the target follow-up time to have a time point beyond

21

the recovery time from surgery, estimated as up to 9 months. Stakeholders felt benefits of surgery

were stable beyond that time point. To address inconsistencies and gaps, we allowed for use data

from up to 5 years past baseline in which there was no closer time to 1 year for a knee.

Stakeholders were strongly supportive of using both the pain and functional outcomes in

patients’ decision-making processes, although the outcomes were not of equal importance to all

patient stakeholders. As the project progressed, the team continued to receive more input from

patient and clinician stakeholders, which influenced modeling, an example of which is described

in Appendix G.

Models Built on OAI Database and Tested on MOST Database (Appendix H: Tables 2a-2b):

We tested the models built on the OAI database on the MOST database to check that the

statistical modeling had been effective, as reflected on an independent data set. The first model

built was for WOMAC knee pain at 1 year and used the matched OAI database that included 252

knees that underwent TKR and 252 knees that did not, using all knees for which there were

WOMAC knee pain data available for the 1-year endpoint. The final model, built on the imputed

data sets, included main effects for younger ages (defined as less than 60 years old) and a

measure of body pain based on data collected from a homunculus in which locations of pain could

be indicated by patients and a calculation could be made that measured the percentage of sites

on a diagram of a body that had symptoms, hip pain (yes versus no), baseline WOMAC knee pain,

and treatment (TKR or not). The model also included interactions of TKR with both baseline knee

pain and hip pain. The model r-square was 0.36 for WOMAC knee pain. We applied the

coefficients from the OAI model to the imputed MOST data set and compared the resulting fitted

values for 1-year knee pain with the observed 1-year knee pain values. There was a positive

association between observed and fitted values (r-square = 0.32). We conducted a similar analysis

for the 1-year physical functioning outcome. The model for the 1-year functional outcome built on

the OAI data included main effects for gender, age, baseline SF-12 mental and physical

components, homunculus, hip pain, depression score, and baseline knee pain in the contralateral

knee. There was also a main effect for treatment and no significant interactions of treatment with

any other variables. The model indicated that, on average, the 1-year physical function score (SF-

12 physical component score) was 3.4 points higher for patients who had TKR than those who had

22

not. This OAI model had an r-square of 0.42. When we applied this model to the MOST data set,

the fitted values for the physical function outcome were positively associated with the observed

results (some of which were imputed), although the r-square on the MOST data dropped to 0.18.

While the decline in performance was not what we wished for, the research team still decided to

combine the 2 databases and try to refit the model on the pooled data, with the objective that

with the large sample size a better model could be constructed.

Models for 1-year Knee Pain Built on Pooled Databases (Appendix H: Tables 2a-2b): We

built multivariable models on versions of the databases that included imputed values for 1-year

pain outcome. We constructed the 1-year knee pain models on the combined OAI and MOST data

sets (P1 model) and on the combined OAI, MOST, NEBH, and TMC (P2 model) databases. The 2

models included terms for a treatment indicator variable and for baseline knee pain and an

interaction of these 2 and had similar r-square values (0.32), suggesting equivalent performance.

In both models, the expected knee pain at 1 year was less for patients who had TKR than for those

who did not have TKR, with the difference being greater in those who had higher knee pain levels

at the start.

The P1 model also indicated worse knee pain at 1 year with younger age, more knee pain

at baseline in the contralateral knee, more total body pain (on the homunculus), and higher BMI.

There was also an interaction with baseline hip pain for which the benefit of TKR versus non-TKR

in knee pain reduction was greater in patients who had baseline hip pain versus those who did

not.

Some of the variables available in the OAI and MOST data sets were not available in the

other databases (eg, pain indicated on a homunculus, pain in contralateral knee), and some

variables, such as hip pain, had not been collected for the surgery databases (NEBH, TMC) in the

same way as for the OAI and MOST databases. Accordingly, we did not use these variables in

modeling in the larger database. The final P2 model included age as a continuous variable, with

more expected knee pain at younger ages, as was seen in the P1 model. The model also included

baseline SF-12 scores with less expected knee pain at 1 year, with higher baseline physical

component scores and mental component scores.

Models for 1-year Physical Function Built on Pooled Databases (Appendix H: Tables 2a- 2b):

23

The model-building process for the 1-year physical functioning models (F1, F2) was similar to the 1-

year pain models. Again, we built the F1 model on data from OAI and MOST that included many

possible predictor variables. We built the F2 model on a larger database that included the same OAI

and MOST data as well as data from the NEBH and TMC cohorts. This larger data set, however,

included fewer predictor variables common to all 4 data sets. The final physical function models are

presented in Appendix H: Table 2b. Both models had similar r-square values (0.34, 0.35). Both

indicated better 1-year physical function for males, younger patients, higher initial physical and

mental component scores, and lower BMI. The F1 model also included a main effect for baseline

knee pain in the contralateral leg, with more baseline pain being associated with a worse 1-year

physical function outcome. The F2 model also included interaction terms of TKR treatment with

both age and the SF-12 mental score. Results from the model indicate that the estimated benefit in

function at 1 year for patients treated with TKR versus standard of care is greater for younger

patients and for patients with lower baseline mental health scores. The F1 and F2 models are

presented in Appendix H: Tables 2a-2b.

Summary of Multivariable Models (Table 2, Figure 2, and Appendix H: Table 2c): Appendix

H: Table 2c shows a summary of variables included in all 4 final models (P1, P2, F1, F2) and the

distribution of each variable in the pooled databases. In the earlier phases of this project, we hoped

our P1 and F1 models would have better performance because we had a larger pool of variables

(although fewer patients) to use for the modeling process. As the project evolved, the research

team realized that many of the variables under consideration were burdensome to collect and/or

difficult to capture consistently. In the end we decided to use only models P2 and F2—which we

built on the data sets that had more patients (OAI, MOST, NEBC, TMC) but fewer independent

variables—for the development of the software. The coefficients for these models are presented in

Table 2. Although neither model was validated in an independent database, we believe the models

have sufficient performance, based on variables consistent with clinical understanding and

importance such that they are reasonable for use in this demonstration project. Based on the

results of testing our OAI model on the MOST data, we are optimistic the models can be useful in

patients similar to those used to develop the models. These patients, who are presumably at the

point of deciding whether to have TKR, have characteristics similar to those shown in Table 1.

25

Table 2. Final Models for 1-year Knee Pain (P2) and SF-12 Physical Function (F2)

a Beta coefficients, standard errors [stderr], and P values are from combined linear regression models built on an imputed data set.

Range in Data Set (5th-95th

Percentile)

P2. Knee Pain Model (Higher Scores Mean

More Knee Pain)

F2. Physical Function (SF-12)

(Higher Scores Mean Better Function)

Term in Model, Status at Baseline Adjusted r-square = 0.32 Adjusted r-square = 0.34

Beta Coeff (stderr) P Valuea Beta Coeff (stderr) P Value

Model intercept (constant) 31.44(5.52) P = < .0001 17.40(4.27) P = < .0001

Treatment (1 = TKR, 0 = control) –3.33(2.16) P = 0.1246 25.41(4.33) P = < .0001

WOMAC knee pain (base), 100-point scale 10-80 0.49(0.03) P = < .0001

Interaction: treatment aWOMAC knee pain –0.33(0.05) P = < .0001

Age (in years) 51-79 –0.12(0.05) P = .0225 –0.05(0.04) P = .2397

SF-12 mental component (base) 34-66 –0.11(0.05) P = .033 0.19(0.04) P = < .0001

SF-12 physical component (base) 23-53 –0.21(0.07) P = .0017 0.55(0.03) P = < .0001

Gender (1 = male, 0 = female) 42% male

0.99(0.57) P = .0873

Body mass index, kg/m2 23-41 –0.19(0.05) P = .0008

Charlson comorbidity score > = 1 (versus 0) 31% with at least 1 –2.05(0.60) P = .0009

Interaction: treatment aage –0.15(0.06) P =.0084

Interaction: treatment aSF-12 mental score –0.18(0.06) P =.0013

26

We used these models to estimate 1-year knee pain and physical functioning for the

treatment each subject actually underwent (TKR or non-TKR) and also for their counterfactual

situation, as if they received the alternative treatment. In other words, we calculated 2 predicted

values for each subject in our database (that we used to make our models). One prediction

assumed subjects received TKR and the other prediction assumed they did not. These data

allowed us to predict the difference in pain and function outcomes for each patient under 2

courses of treatment (TKR versus non-TKR). The distribution of predicted differences in pain and

function with and without TKR is shown in Figure 2. The figure shows that there was a range of

predicted improvement with TKR, and those patients predicted to have benefit in knee pain may

not have been the same as those for whom benefit in physical functioning is predicted. In this

project’s database, 9% of subjects had a predicted gain in function of TKR versus non-TKR of at

least 8 SF-12 physical function points and a predicted reduction in knee pain of at least 20 points

(on WOMAC scale of 0- 100). At the other end of the spectrum, 6% had predicted gains in physical

function of fewer than 4 points and reduction of knee pain of fewer than 10 points. Only 2% had

larger gains in physical function and smaller improvements in pain. Figure 2 also shows sample

subjects from each of the 9 combinations of estimated knee pain and physical function change.

Examples of subjects with the most, mid-, and least-estimated reduction of pain as well as their

gain in function with 95% prediction intervals for the estimates are shown in Table 3. Subjects

with higher baseline knee pain had the largest predicted reductions in knee pain with TKR versus

non-TKR. Younger patients with lower SF-12 scores had the largest predicted benefits in physical

function with TKR versus not having TKR. These differences in estimated benefits between

subjects are because of the interaction terms included in the multivariable models.

We ran into statistical questions regarding the use of the proposed linear model with 1-

year outcomes, specifically knee pain, for which the scores do not have normal distributions and

adjustment for covariates still produced a model in which the resulting residuals (the difference in

predicted and observed values) still had skewed distributions. We explored alternative nonlinear

models with little gain in model performance and ultimately used the linear form of the model.

See Appendix I.

27

Figure 2. Mosaic Plot Showing Distribution of Predicted Differences (TKR Versus Non-TKR) for 1-year Knee Pain and SF-12 Physical Function in Pooled Data (n = 1452 Subjects)

28

Table 3. Estimated Outcomes for a Sample of Cases

Predicted Change With TKR Compared With Non-TKR Baseline Characteristics

Knee Pain (1 Year): Estimate and 95% Prediction Interval

SF-12 Function (1 Year): Estimate and 95% Prediction Interval

Estimated

Reduction in Knee Pain

Estimated Improvement in

Function Gen

der

Age

BMI

Any

Com

orbi

ditie

s

SF-1

2 M

enta

l

WO

MAC

Kne

e Pa

in

SF-1

2 Ph

ysic

al

Non-TKR TKR (TKR Minus Non-TKR) Non-TKR TKR (TKR Minus

Non-TKR)

Most pain reduc (≥20 pts)

Most gain fcn (≥8 pt improve) F 58 33.3 N 38 65 30 46

(12 to 80) 21

(80 to –13) –24.5

(–72.4 to 23.5) 32

(15 to 49) 42

(49 to 25) 9.6

(–14.6 to 33.8)

Most pain reduc (≥20 pts)

Mid gain fcn (4-<8 pt improve) F 63 35.0 N 60 70 41 43

(09 to 77) 17

(77 to –17) –26.1

(–74.1 to 21.9) 42

(25 to 59) 47

(59 to 30) 4.7

(–19.5 to 28.8)

Most pain reduc (≥20 pt)

Least gain fcn (<4 pt improve) M 77 24.3 Y 60 55 36 35

(01 to 69) 14

(69 to –20) –21.2

(–69.2 to 26.7) 39

(22 to 56) 42

(56 to 25) 2.7

(–21.5 to 26.9)

Mid pain reduc (≥10 to <20 pts)

Most gain fcn (>=8 pt improve) M 63 28.3 N 34 45 36 34

(00 to 68) 16

(68 to –17) –18.0

(–65.9 to 30.0) 36

(19 to 53) 46

(53 to 29) 9.6

(–14.6 to 33.8)


Mid gain fcn (4-<8 pt improve) M 66 31.3 Y 62 25 50 18

(–16 to 52) 7

(52 to –27) –11.5

(–59.4 to 36.5) 46

(29 to 63) 50

(63 to 33) 4.0

(–20.2 to 28.2)


Least gain fcn (<4 pt improve) M 71 35.8 Y 59 30 44 22

(–12 to 55) 8

(55 to –25) –13.1

(–61.0 to 34.8) 42

(25 to 59) 46

(59 to 29) 3.7

(–20.4 to 27.9)

Least pain reduc (<10 pts)

Most gain fcn (≥8 pt improve) M 51 23.4 N 39 20 49 20

(–14 to 54) 11

(54 to –23) –9.8

(–57.9 to 38.2) 46

(29 to 63) 56

(63 to 39) 10.5

(–13.7 to 34.8)


Mid gain fcn (4-<8 pt improve) F 62 26.3 N 61 20 44 18

(–16 to 52) 8

(52 to –26) –9.8

(–57.8 to 38.1) 45

(28 to 62) 50

(62 to 33) 4.7

(–19.5 to 28.8)


Least gain fcn (<4 pt improve) F 74 27.4 Y 56 5 51 8

(–26 to 42) 3

(42 to –31) –5.0

(–53.0 to 43.1) 46

(28 to 63) 49

(63 to 32) 3.7

(–20.4 to 27.9)

29

Prototype Decision Support Software Development, Interface Design, and Usability Testing

The KOMET development process resulted in the creation of 1 web-based application for

clinicians (http://medicalequipoise.com/tkrclinician) and one for patients

(http://medicalequipoise.com/tkrpatient). Both applications are composed of an analytics

software library that also could be embedded into an EHR system.

The applications underwent user testing to assess the ease of data collection through the

web-based questionnaire and users’ ability to understand the outcome predictions when

presented in data tables, graphs, and as dynamic text. We also tested depictions of prediction

uncertainty and definitions of mathematical equipoise.

All users were able to easily enter demographic data and complete the questionnaire with

only minor questions or comments. We initially presented users with a table and bar graphs

describing current and predicted pain and function outcomes (Appendix J: Figure 1). After initial

testing, we refined the report to provide a dynamic text description (Appendix J: Figure 2). This

change improved users’ ability to identify their current pain and function scores and the predicted

1-year outcome scores with surgical and nonsurgical treatments.

The combined pain and function plot proved to be less intuitive. Many users immediately

understood that the single data point represented both the pain and function outcome

predictions, but others struggled to describe the data represented by the graph (Appendix J:

Figure 3).

User testing led to improvements in the way predicted outcome uncertainty was

communicated. The degree of uncertainty around the predicted pain and function outcomes,

initially represented by whiskers on the bar chart (Appendix J: Figure 1), was not understood by

users. We changed the chart by using shading within the bar that faded at the edges and added a

dynamic text explanation describing the range of possible values. (Appendix J: Figure 2). This

improved user understanding. Analogously, for the combined pain and function plot, we changed

the uncertainty around the prediction from a dotted circle around a data point (Appendix J:

Figure 3) to a shaded circle. (Appendix J: Figure 4). A limitation of our depiction of the results, not

of the interface per se, is that our methodology made separate statistical models for 1-year knee

http://medicalequipoise.com/tkrclinician

http://medicalequipoise.com/tkrpatient

30

pain and physical function; in reality, these 2 outcomes are likely related. Therefore, our

uncertainty regions may still not be accurately capturing, and are likely overestimating, the joint-

prediction areas. The true uncertainty region would be a subset of the circle if pain and

functioning were dependent.

Based on these uncertainty estimates regarding the predictions and based on the

mathematical equipoise approach, we used KOMET to identify patients for whom enrollment in a

randomized clinical trial might be appropriate. For the purpose of demonstration, we defined

mathematical equipoise as a condition when pain and functioning outcome predictions with

nonsurgical care and TKR are relatively close and fall within each other’s circles of zones of

uncertainty—ie, their circles of uncertainty overlap. These circles are created when the pain and

function outcome predictions are presented as point estimates on a 2-dimensional graph with

pain on the vertical axis (y) and function on the horizontal axis (x). The uncertainty circle is defined

by the shaded area extending around each of the point estimates and represents the uncertainty

associated with the predictions. In Appendix J: Figure 1, the blue diamond represents the

outcome prediction point estimate for nonsurgical care and the green circle represents the point

estimate for TKR. The large shaded blue and green overlapping circles are around the 95%

confidence intervals of the pain and function point estimates and represent the uncertainty

associated with the predictions. When we computed the mathematical distance between the

nonsurgical and TKR predictions the resulting distance between the 2 coordinates on the pain and

function graph was 43. Empirically, based on the input of rheumatologists, orthopedists, and

primary care clinician stakeholders, and after reviewing the 95% CI for a sample of cases, we

selected the distance of less than or equal to 20 to flag the possible presence of equipoise during

the usability testing.

When mathematical equipoise was present, an alert appeared on the user interface’s

results page and a patient contact and screening form was made available to the clinician. The

form could be used to begin the clinical trial recruitment process.

We asked participants about usefulness of the information for decision making. Each

stated that the tool was helpful or somewhat helpful. All wanted to discuss the results with their

physicians.

31

We used the combined pain and function predicted outcomes plot to discuss the idea of

equipoise in the context of random assignment of treatments in an RCT. We showed users 3

sample graphs depicting the predicted outcomes of the 2 potential treatments, with small,

moderate, and large amounts of overlap between the 2 circles that depicted uncertainty around

the predicted point value (see Appendix L). We assumed that if patients perceived greater overlap

in the predicted outcomes between the 2 treatments, then they would be more likely to consider

being randomized to 1 of the treatments. Only 7 of 12 users shown these scenarios responded

that they understood the concept of equipoise. Because of their personal preferences, some users

rejected the option of surgery despite predictions suggesting dramatic reductions in pain and

improvement in physical functioning. Other patients indicated they would consider randomization

only if the burden of surgery promised a far better outcome than nonsurgical treatment. Overall,

users did not respond to the depiction of the circles of uncertainty scenarios as we expected.

We conducted patient and clinician user testing to understand KOMET use in the clinical

setting during regularly scheduled clinic visits and research-specific visits. There were significant

challenges in allocating adequate time for the patient to complete the decision support tool and

for the clinician and patient to discuss the results and implications for decision making. We

determined that future dissemination should include patients completing the tool prior to their

visit to allow the patient and clinician more time during the visit to discuss the results, the

patient’s priorities and choices, and treatment options.

Through the efforts of our research team, stakeholders, and design consultants, we were

able to develop a software program that users found helpful in shared clinical decision making.

Although the final prototype seemed attractive and easy to use, there will need to be further

refinements for routine clinical care use and enrollment into RCTs.

32

DISCUSSION

Study Results in Context In deciding between treatment options and deciding whether to participate in a clinical

study, the patient is the ultimate decision maker. Ideally, these determinations will be made with

ample consultation and support by relevant clinicians. In this context, methods to share

information and support a shared conversation about these decisions can be very helpful.

Decision aids explicitly intended for shared patient–clinician decision making have been shown to

improve patient knowledge, patients’ satisfaction with decision making, and agreement between

choices for treatment and their health outcome preferences, among other positive effects.43 The

same kind of shared decision making is justified in the decision as to whether to participate in an

RCT. In developing KOMET, we sought to develop decision support that could support both a

clinical decision and a decision to participate in an RCT—in this case, the decision between

surgical (TKR) and nonsurgical treatment of knee osteoarthritis.

There are 2 general contexts for the results of this project: (1) the development of

mathematical equipoise as a basis for decision support and (2) the state of evidence for treatment

decisions for knee osteoarthritis. We discussed the latter of these 2 in the introduction; although

knee replacement surgery for osteoarthritis is very common in this country, until a relatively small

RCT was conducted as this project was being completed there were no RCT data to directly inform

this treatment choice.9 Although KOMET does not provide new data, it presents those data

available at the time of its development in a potentially helpful way. More to the general point of

this project, KOMET is intended to help generate the needed RCT data for knee osteoarthritis to

add to extant evidence. Thus, the main context for the results of this project is for the

development of the mathematical equipoise method.

Mathematical equipoise is based on the use of mathematical models that serve as clinical

predictive instruments to predict patient-specific outcomes of treatment options, which then can

be compared. By doing so, in a sense we are discerning patient-specific equipoise. When the

predictions are not discernibly or importantly different, which can suggest equipoise between

options, enrollment in an RCT that compares the treatments can be considered. When the

33

predictions suggest one treatment is likely to have better outcomes then trial enrollment would

not be appropriate. When this is the case, however, this identification of a potentially superior

treatment can inform patient–clinician decision making, thereby constituting an approach for

enrolling RCT participants that also supports clinical decision making for those not to be enrolled

in an RCT.

Our original examples of this approach used predictive model outcomes of acute

myocardial infarction that were built using RCT data, which are ideal sources of data for making

predictive models. However, for many treatments there are no prior RCTs—and indeed these are

the very conditions for which RCTs and, in particular, clinical effectiveness trials are needed.

For the widespread use of mathematical equipoise to help fill in gaps in RCTs, predictive

models for these conditions will need to be built on data from clinical registries, EHRs, and other

non-RCT data. Therefore, the purpose of this project was to further develop the method by

applying it to an important clinical treatment question for which there were essentially no prior

RCTs. With more than 680 886 total knee replacements done each year in the United States for

knee osteoarthritis,44 we considered this an important question for patients and society, and a

good opportunity to test whether this approach could work in this challenging but common

situation.

To do this, we created a consolidated database from non-RCT sources on knee

osteoarthritis outcomes on which we created predictive models of the outcomes of surgical knee

replacement and nonsurgical treatments. The choices for variables for these models were

informed by the needs of stakeholders who would use decision support for knee osteoarthritis,

with focus on their views on the representation of pain and functional outcomes. We then

developed multivariable mathematical models that predict patient-specific outcomes of surgical

and nonsurgical treatment, using statistical and analytic methods to adjust, to the extent possible,

for the inherent biases in the databases. We also performed a variety of analyses to understand

how to best model and represent the predicted outcomes. We incorporated these models into a

stakeholder-informed prototype decision support software for potential incorporation into EHRs.

KOMET exemplifies a tool that can be used to provide shared decision support for RCT inclusion

and clinical care that is responsive to the perspectives and needs of patients and clinicians in

34

supporting shared decision making for RCT enrollment and treatment.

We believe the impact of such a method on the field of CER could be substantial. The

impact of CER is based on evidence generation, which leads to evidence synthesis, interpretation,

application, dissemination, implementation in widespread practice, and then feedback for the

generation of new evidence. This entire chain rests on having unbiased generalizable, ideally RCT,

evidence. Were there such a method for patient-centered enrollment into RCTs that could be

incorporated into EHRs, far more targeted comparative effectiveness trials could be conducted,

more diverse clinical sites could be included, and more representative patients could be enrolled.

This would lead to results that are applicable to more special groups and to more care settings.

This would also facilitate clinicians’ and the public’s understanding of, and enrolling into, clinical

trials, which could help improve the public’s engagement in the biomedical research enterprise.

Additionally, clinical trial duration, a scientifically and financially important component of drug

development pipeline time, could be much shorter. If instead of only 10% of eligible patients being

enrolled enrollment pace was increased by up to 10-fold, trials would finish much faster. All these

advantages should result in better clinical trials and greater impact on the public’s health.

If successful in demonstrating that this method has applicability to the many important

conditions for which RCTs have not yet been conducted, providing on-site real-time decision

support for trial enrollment, it could transform how comparative effectiveness research could be

conducted across the spectrum of health care. This would address the failure of current clinical

trial approaches to enroll sufficient numbers of patients, facilitate the need to identify and

ethically handle treatment of all potential subjects, and engage a conversation between clinician

and patients based on data specific to that patient. Thereby, it could have use in broad areas of

clinical care and could help enable the great promise of CER in improving clinical care.

Uptake of Study Results

In this project, based on input from multiple stakeholders and potential users, we

implemented prototype KOMET software and tested it in clinical settings. Although it is not ready

for widespread impementation in its user interface, content, and connectabiltiy to EHRs, it did

function as intended and thus is an important step in the ultimate goal of clinical use. We believe

35

its promise is sufficient to warrant further development with the explicit intent of being a tool

that can be implemented in clinical settings and linked to EHRs, to serve both treatment and RCT

enrollment purposes. Toward that end, we will seek further opportunities to move toward that

goal.

Study Limitations Although largely accomplishing its intended objectives, as an early stage in the

development of mathematical equipoise for shared clinical decision making, this project has a

number of limitations related to the available data, the modeling methods, the model variables,

and the prototype software.

An important limitation of our approach is that the models were created on potentially

biased data. Although we sought data from studies that had both surgical and medical treatment

of knee osteoarthritis, 2 of our studies fit that requirement while 2 other registries were of only

one treatment (surgery). Both types of sources provide challenges for creating comparable

patients who underwent the 2 treatments, which is needed to make accurate models of the 2

treatments. In contrast, our prior clinical predictive models, including the first examples of

mathematical equipoise, used data from RCTs. This allowed for representation of the alternative

treatment effects based on comparable samples undergoing the treatment choices, providing

confidence that the effects and outcomes represented by the multivariable models would reflect

the actual treatment effects and not differences in the underlying characteristics of patients

receiving the alternative treatments. However, for mathematical equipoise to serve its intended

purpose of facilitating RCTs of treatments for which none have yet been conducted, its models will

need to be made on non-RCT data. Therefore, for this project we intentionally chose a condition

for which our only data were from registries that posed challenges for making models that could

be based on comparable patients receiving the 2 treatments. We undertook many checks to

maximize the comparability and to accurately represent effects despite the likely biased samples.

For example, we chose to use matching for our study design but acknowledge that while 1:1

matching improves control of confounding and enabled us to create a hypothetical sample of non-

TKR patients who could be considered as TKR eligible, this approach does not use data from all

available subjects and may therefore have the cost of less precision. While we believe KOMET

36

models have very good performance, despite the challenge of the available data, additional

sources of data for this approach should be developed.

The modeling methods also have limitations. Although such multivariable regression as we

used have advantages over some more computer-intensive methods like those used for machine

learning, including the clearly interpretable variable coefficients and robustness that is more

resistant to overfitting than some computer-driven methods,45 larger databases on which more

corrections might be made (eg, via the use of propensity scores) and newer computer methods

might advance the level of models that might be created. Indeed, we believe this approach will

benefit greatly from such advances in modeling.

Another limitation is that neither model was validated in an independent database; we

simply did not have sufficient data to support model development and to still have enough for a

test database. However, we believe the performance of the models and their variables are

consistent with clinical understanding and importance and are reasonable for use in this

demonstration project. Nonetheless, testing on an independent data set will be an important

future priority.

Beyond the methods, the modeling variables we used have limitations. While in general,

based on the collection of important variables in the available databases and based on published

clinical evidence and input by stakeholders, we believe we used very credible variables to

represent independent and dependent (treatment outcome) variables, there is one about which

we have reservations. The functional outcome we used, based on our wish to capture a holistic

physical function of the patient, was based on the SF-12 functional scale. In looking at the results

of KOMET predictions, we noticed that pain is often substantially changed by surgery but function

tends to have a relatively modest improvement. In discussing this with patients, we wondered if

we would have better captured their meaningful knee functional improvement if we had used a

more knee-specific function rather than overall physical functioning. We hoped to address this

limitation by performing further analyses of the treatment outcomes for the subsample of

patients in our databases for whom we have a knee-specific functional scale, the KSS, as a

treatment outcome. The results of these exploratory analyses, presented in Appendix M, suggest

that the WOMAC knee pain tracks well with other measures of knee pain and symptoms and, in

37

particular, Knee injury and Osteoarthritis Outcome Score (KOOS) knee pain. The SF-12 physical

component score, while positively correlated, does not track as strongly with other knee-related

quality of life and function variables. These results are somewhat to be expected in that, while

there may be overlap in physical function and knee-related function, they are not the same thing.

Our meetings with stakeholders suggest both overall and knee-related function are important,

and we have come to believe future work to develop predictions of the more specific knee-related

function would be useful to both patient and clinical stakeholders.

Certainly, as a prototype the KOMET software has limitations. The creation of full-featured,

user-friendly, robust software is beyond the scope of this project. Our prototype has significant

distances to go in these and other dimensions before it could be used in routine care.

Nonetheless, we believe it is quite attractive and functional and, in the context of its intended role

in this project, a successful product of this project.

Finally, in putting the use of KOMET in the context of clinical decision making, this

approach does not consider how the patient might feel about the outcome states (pain and

function). This would involve translating the WOMAC and SF-12 scores into familiar terms for

patients and making sure the idea of overlap of predictions that suggests equivalence are all

understood. Also, it would include ensuring that these features are readily incorporated into

patients’ understanding and decision making for their own and shared consideration. Beyond

these user issues, as additional information, patients would have to know about the downsides

and potential complications of surgery, delays of surgery, and adverse consequences of other

treatments. Thus, while KOMET provides an important foundation for the shared decision-making

process, to provide complete and optimal support additional work is needed in a number of

dimensions.

Future Research The limitations listed above suggest areas for future research. Approaches must be

developed that lessen the biases inherent in clinical registry data. Although having more data,

such as might be obtained from EHR data warehouses and other wide sources of clinical data, will

not eliminate biases, finding ways to mitigate the biases using selection and sampling methods

38

and other approaches will be extremely important for work on mathematical equipoise, as well as

for many other efforts to harvest clinically important insights from clinical data. Beyond

developing such methods, validation of these approaches will be crucial.

In future efforts of this type, we would like to have more complete accounting for ancillary

issues and complications. For example, in the one RCT done to date,9 serious adverse events were

more common in the TKR group than in the nonsurgical treatment group (8 versus 1 involving the

index knee [P = .05] and 24 versus 6 overall [P = .005]), with the 2 most common serious adverse

events involving the index knee having deep venous thrombosis (in 3 patients) and stiffness

requiring brisement force (in 3 patients). Unfortunately, we did not have access to such

information in the databases available to us. In this Danish study, 9 adverse events that occurred

before the 12-month follow-up were identified in hospital records, by self-report at follow-up

visits, and by the physiotherapist and were then categorized. In future work exploring

mathematical equipoise, in an analogous way, we intend to methodically collect such data.

Modeling clinical outcomes based on data is evolving rapidly, and increasingly

sophisticated computer-based methods, such as artificial intelligence and machine learning, are

being applied to analysis of clinical data. Although computer-based algorithms have a tendency to

overfit,45 compromising their generalizability to new populations, methods are advancing and an

investigation of best methods is certainly warranted.

As indicated above, we believe the functional outcome we used for physical function

might benefit from being a more knee-specific outcome variable.46,47 There are examples in other

diseases in which, for specific conditions, disease-specific outcomes are more useful than more

general functional outcomes, such as we used.48,49 We believe additional research that uses a

more specific functional outcome would be worth conducting.

In terms of the prototype software, it is clear more research is needed for this and similar

decision support that provides full-featured, user-friendly, interoperable, robust software. Badly

needed will be attractive and functional software for this and similar purposes.

Finally, in developing and testing such decision support software, we will need to further

investigate how to better understand, make clear, and use the patient-specific determination of

equipoise that could be the basis of a comparative effectiveness RCT. We believe the method has

39

important advantages for such studies, but before it can be widely deployed and used it must be

fully understood and transparent to all stakeholders. We look forward to advancing this work.

CONCLUSIONS

This project demonstrated the use of predictive instruments and mathematical equipoise

as a way to discern patient-specific equipoise and thereby as a method for providing patient-

specific decision support for shared patient–physician decision making for the selection between

alternative treatments and as the basis for enrollment into comparative effectiveness trials. Based

on its predictive models, KOMET provides individualized predictions of pain and functional

outcomes of medical and surgical treatment of knee osteoarthritis designed to be embedded in

EHRs. It can help identify patients for whom one or the other treatment seems likely to yield

better outcomes based on their specific characteristics as well as patients for whom there is

insufficient evidence to favor one treatment. This still can be part of a shared decision-making

process that incorporates the patient’s preferences and priorities for the outcomes the models

predict (ie, pain and function but not others), and, by identifying potential clinical equipoise, it

also can support enrollment into an RCT.

The next step will be to conduct a larger-scale test and then to implement it for its

intended use—the conduct of a comparative effectiveness trial in usual care settings in which

KOMET would support patient–clinician shared decision making about treatment selection for

knee osteoarthritis.

40

REFERENCES

1. Lawrence RC, Felson DT, Helmick CG, et al. Estimates of the prevalence of arthritis and other rheumatic conditions in the United States. Part II. Arthritis Rheum. 2008;58(1):26-35.

2. Guccione AA, Felson DT, Anderson JJ, et al. The effects of specific medical conditions on the functional limitations of elders in the Framingham Study. Am J Public Health. 1994;84(3):351-358.

3. Mankin, HJ. Clinical features of osteoarthritis. In: Kelly WN HE, Ruddy S, Sledge CB, eds. Textbook of Rheumatology. 4th ed. Philadelphia, PA: W.B. Saunders Co; 1993:1374-1384.

4. The Incidence and Prevalence Database for Procedures. Sunnyvale, CA: Timely Data Resources; 1995.

5. Kosorok MR, Omenn GS, Diehr P, Koepsell TD, Patrick DL. Restricted activity days among older adults. Am J Public Health. 1992;82(9):1263-1267.

6. Kramer JS, Yelin EH, Epstein WV. Social and economic impacts of four musculoskeletal conditions. A study using national community-based data. Arthritis Rheum. 1983;26(7):901-907.

7. Selten EM, Vriezekolk JE, Geenen R, et al. Reasons for treatment choices in knee and hip osteoarthritis: a qualitative study. Arthritis Care Res. 2016;68(9):1260-1267.

8. Weng HH, Kaplan RM, Boscardin WJ, et al. Development of a decision aid to address racial disparities in utilization of knee replacement surgery. Arthritis Rheum. 2007;57(4):568-575.

9. Skou ST, Roos EM, Laursen MB, et al. A randomized, controlled trial of total knee replacement. New Engl J Med. 2015;373(17):1597-1606.

10. Eyles JP, Mills K, Lucas BR, et al. Can we predict those with osteoarthritis who will worsen following a chronic disease management program? Arthritis Care Res. 2016;68(9):1268-1277.

11. Fagerlin A, Sepucha KR, Couper MP, Levin CA, Singer E, Zikmund-Fisher BJ. Patients' knowledge about 9 common health conditions: the DECISIONS survey. Med Decis Making. 2010;30(suppl 5):35S-52S.

12. Kent DM, Ruthazer R, Griffith JL, et al. A percutaneous coronary intervention-thrombolytic predictive instrument to assist choosing between immediate thrombolytic therapy versus delayed primary percutaneous coronary intervention for acute myocardial infarction. Am J Cardiol. 2008;101(6):790-795.

13. Selker HP, Beshansky JR, Griffith JL, et al. Use of the acute cardiac ischemia time-insensitive predictive instrument (ACI-TIPI) to assist with triage of patients with chest pain or other symptoms suggestive of acute cardiac ischemia. A multicenter, controlled clinical trial. Ann Intern Med. 1998;129(11):845-855.

14. Selker HP, Griffith JL, Beshansky JR, et al. Patient-specific predictions of outcomes in myocardial infarction for real-time emergency use: a thrombolytic predictive instrument. Ann Intern Med. 1997;127(7):538-556.

15. Selker HP, Beshansky JR, Griffith JL, Investigators TPIT. Use of the electrocardiograph-based thrombolytic predictive instrument to assist thrombolytic and reperfusion therapy for acute myocardial infarction. A multicenter, randomized, controlled, clinical effectiveness trial. Ann Intern Med. 2002;137(2):87-95.

16. Stacey D, Légaré F, Col NF, et al. Decision aids for people facing health treatment or screening decisions. Cochrane Database Syst Rev. 2014;(1):CD001431. doi: 10.1002/14651858.CD001431.pub4.

17. Stacey D, Taljaard M, Dervin G, et al. Impact of patient decision aids on appropriate and timely access to hip or knee arthroplasty for osteoarthritis: a randomized controlled trial. Osteoarthritis Cartilage. 2016;24(1):99-107.

18. Bozic KJ, Belkora J, Chan V, et al. Shared decision making in patients with osteoarthritis of the hip

41

and knee: results of a randomized controlled trial. J Bone Joint Surg. 2013;95(18):1633-1639. 19. de Achaval S FL, Volk RJ, Cox V, Suarez-Almazor ME. Impact of educational and patient decision aids

on decisional conflict associated with total knee arthroplasty. Arthritis Care Res. 2012;64(2):229-237.

20. Stacey D, Hawker GA, Dervin G, et al. Decision aid for patients considering total knee arthroplasty with preference report for surgeons: a pilot randomized controlled trial. BMC Musculoskelet Disord. 2014;15(54):1-10.

21. Hip and knee osteoarthritis toolkit. Dartmouth-Hitchcock Center for Shared Decision Making website. http://med.dartmouth-hitchcock.org/csdm_toolkits/hip_and_knee_osteoarthritis_toolkit.html. Published 2017. Accessed January 7, 2018.

22. Selker HP, Ruthazer R, Terrin N, Griffith JL, Concannon T, Kent DM. Random treatment assignment using mathematical equipoise for comparative effectiveness trials. J Clin Transl Sci. 2011;4(1):10-16.

23. Forsythe LP, Ellis LE, Edmundson L, et al. Patient and stakeholder engagement in the PCORI pilot projects: description and lessons learned. Int J Gen Med. 2016;31(1):13-21.

24. Deverka PA, Lavallee DC, Desai PJ, et al. Stakeholder participation in comparative effectiveness research: defining a framework for effective engagement. J Comp Eff Res. 2012;1(2):181-194.

25. PCORI engagement rubric. PCORI website. http://www.pcori.org/sites/default/files/Engagement-Rubric.pdf. Published 2014. Accessed October 27, 2016.

26. Concannon TW, Meissner P, Grunbaum JA, et al. A new taxonomy for stakeholder engagement in patient-centered outcomes research. Int J Gen Med. 2012;27(8):985-991.

27. Multicenter Osteoarthritis Study (MOST) database. San Francisco, CA: University of California; 2009. http://most.ucsf.edu. Accessed May 15, 2014.

28. Osteoarthritis Initiative (OAI) database. Bethesda, MD: National Institutes of Health; 2013. https://nda.nih.gov/oai/. Specific data sets: V 0.2.2, 1.2.1, 2.2.2, 3.2.1, 4.2.1, 5.2.1, 6.2.2, 7.2.1, 8.2.1, 9.2.1, 24, 25, and 9. Accessed June 25, 2014.

29. Hawker GA, Wright JG, Coyte PC, et al. Differences between men and women in the rate of use of hip and knee arthroplasty. New Engl J Med. 2000;342(14):1016-1022.

30. Hawker GA, Wright JG, Coyte PC, et al. Determining the need for hip and knee arthroplasty: the role of clinical severity and patients' preferences. Med Care. 2001;39(3):206-216.

31. Hawker GA, Wright JG, Glazier RH, et al. The effect of education and income on need and willingness to undergo total joint arthroplasty. Arthritis Rheum. 2002;46(12):3331-3339.

32. New England Baptist Hospital (NEBH) orthopedic surgery registry. Boston, MA: New England Baptist Hospital; 2018. https://www.nebh.org/health-professionals/research/orthopedic-registry/. Accessed February 22, 2017.

33. Tufts Medical Center (TMC) orthopedic surgery registry. Boston, MA: Tufts University School of Medicine.

34. WOMAC osteoarthritis index. http://www.womac.org/womac/index.htm. Accessed February 22, 2017.

35. SF-12 health survey. http://www.outcomes-trust.org/instruments.htm#SF-12. Accessed February 22, 2017.

36. Ware J Jr, Kosinski M, Keller SD. A 12-Item Short-form health survey: construction of scales and preliminary tests of reliability and validity. Med Care. 1996;34(3):220-233.

37. Lacson E Jr, Xu J, Lin SF, Dean SG, Lazarus JM, Hakim RM. A comparison of SF-36 and SF-12 composite scores and subsequent hospitalization and mortality risks in long-term dialysis patients. Clin J Am Soc Nephrol. 2010;5(2):252-260.

38. Kosanke JB, Bergstralh E. GMATCH macro for greedy matching.

http://med.dartmouth-hitchcock.org/csdm_toolkits/hip_and_knee_osteoarthritis_toolkit.html

http://med.dartmouth-hitchcock.org/csdm_toolkits/hip_and_knee_osteoarthritis_toolkit.html

http://www.pcori.org/sites/default/files/Engagement-Rubric.pdf

http://www.pcori.org/sites/default/files/Engagement-Rubric.pdf

http://www.womac.org/womac/index.htm

http://www.outcomes-trust.org/instruments.htm#SF-12

42

http://bioinformaticstools.mayo.edu/research/gmatch/. Accessed February 22, 2017. 39. Riddle DL, Kong X, Jiranek WA. Factors associated with rapid progression to knee arthroplasty:

complete analysis of three-year data from the osteoarthritis initiative. Joint Bone Spine. 2012;79(3):298-303.

40. Gantz MG. Creating RTF tables with univariate analyses of multiply imputed data. Poster presented at: Southeast SAS Users Group (SESUG) Conference; October 8-10, 2006; Atlanta, GA.

41. SAS (for Windows) [computer program]. Version 9.4 TS Level 1M2. Cary, NC: SAS Institute; 2002-2012.

42. Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Sci. 2010;25(1):1-21.

43. Stacey D, Légaré F, Lewis K, et al. Decision aids for people facing health treatment or screening decisions. Cochrane Database Syst Rev. 2017;4:CD001431. doi: 10.1002/14651858.CD001431.pub5.

44. HCUPnet. Healthcare Cost and Utilization Project. US Department of Health & Human Services/Agency for Healthcare Research and Quality. https://hcupnet.ahrq.gov/#setup. Published 2014. Accessed April 25, 2017.

45. Selker HP, Griffith JL, Patil S, Long WJ, D'Agostino RB. A comparison of performance of mathematical predictive methods for medical diagnosis: identifying acute cardiac ischemia among emergency department patients. J Investig Med. 1995;43(5):468-476.

46. Brazier JE, Harper R, Munro J, Walters SJ, Snaith ML. Generic and condition-specific outcome measures for people with osteoarthritis of the knee. Rheumatology. 1999;38(9):870-877.

47. Bombardier C, Melfi CA, Paul J, et al. Comparison of a generic and a disease-specific measure of pain and physical function after knee replacement surgery. Med Care. 1995;33(suppl 4):AS131-AS144.

48. Binkley JM, Stratford PW, Lott SA, Riddle DL. The lower extremity functional scale (LEFS): scale development, measurement properties, and clinical application. North American Orthopaedic Rehabilitation Research Network. Phys Ther. 1999;79(4):371-383.

49. Patrick DL, Deyo RA. Generic and disease-specific measures in assessing health status and quality of life. Med Care. 1989;27(suppl 3):S217-S232.

50. Apache.org Tomcat [computer program]. Version 8. Wakefield, MA: Apache Software Foundation; 1999-2019.

https://hcupnet.ahrq.gov/#setup

43

Acknowledgments

The authors wish to thank our patient and clinician stakeholders for their valuable

contributions and guidance: Debra Band-Entrup, Kathie Bernstein, Melvin Bernstein, Jaclyn Chu,

Deane Felter, William Harvey, Helen Herzer, Cristina MacDonald, Vincent MacDonald, Susan Nesci,

John Richmond, Kimberly Schelling, Eric Smith, and Steven Vlad. The authors thank Kaila Dion and

Rajeev Chorghade for support with scale development and data management, and Ben Hannon

for user interface design. We acknowledge Brendan Harrison, Nikolai Klebanov, and Esha

Sondhi for data collection, and Mary Pevear and Gary Schneider for their assistance with

Orthopedic Surgery Registries.

The OAI is a public–private partnership comprising 5 contracts (N01-AR-2-2258, N01-AR-2-

2259, N01-AR-2-2260, N01-AR-2-2261, and N01-AR-2-2262) funded by the National Institutes of

Health, a branch of the Department of Health and Human Services, and conducted by the OAI

study investigators. Private funding partners include Merck Research Laboratories, Novartis

Pharmaceuticals Corporation, GlaxoSmithKline, and Pfizer Inc. Private sector funding for the OAI is

managed by the Foundation for the National Institutes of Health. This manuscript was prepared

using an OAI public use data set and does not necessarily reflect the opinions or views of the OAI

investigators, the NIH, or the private funding partners.

MOST comprises 4 cooperative grants (Felson—AG18820, Torner—AG18832, Lewis—

AG18947, and Nevitt—AG19069) funded by the National Institutes of Health, a branch of the

Department of Health and Human Services, and conducted by MOST study investigators. This

manuscript was prepared using MOST data and does not necessarily reflect the opinions or views

of MOST investigators. Recommended additional documentation describing various aspects of the

design and methods of MOST is available by request sent to [email protected] and

should be paraphrased and referenced as appropriate.

Data were provided from the Ontario Hip and Knee Osteoarthritis Cohort conducted by

the Canadian Osteoarthritis Research Program, led by Dr Gillian Hawker. Data provided from CORP

are made possible through grants by the Canadian Institutes of Health Research and the Arthritis

Society.

mailto:[email protected]

44

Research reported in this report was funded through a PCORI Award (ME-1306-02327).

The views, statements, and opinions presented in this report are solely the responsibility of the

authors and do not necessarily represent the views of PCORI, its Board of Governors, or its

Methodology Committee.

45

APPENDIX A: Matching The goal of this project was to make prediction models using available non-RCT data

sources. To accomplish this we used multiple registries to create a modeling database that

included matched sets of paired knees (one with and one without TKR) that were similar in all

respects except for the surgical procedure. Our process for creating a database of matched TKR to

non-TKR knees has limitations which should be kept in mind when using the resulting models and

predictions.

First, we only matched subjects based on data available within each study. While a TKR:

non-TKR knee dyad may have come from 2 patients with the same gender, similar age and

baseline knee pain, etc., other characteristics that were not part of the matching process may

have differed. Second, we allowed for non-exact matches because we wanted to include knees

that had TKR in our analysis even if we could not find a perfect match. We planned to adjust for

covariates in the modeling process to account for remaining residual imbalances between the TKR

and non-TKR groups. Third, we excluded subjects who did not have 1-year pain outcome data

from the matching process. Non-TKR knees, that had a TKR during the follow-up period, were

excluded from the matching process if 1-year follow-up in the non-TKR state was not available.

The predicted outcomes for non-TKR are based on the assumption that the knee did not have a

TKR within a year. Fourth, we also excluded knees with TKR that did not have 1-year follow-up

data. There could be several reasons for lack of follow-up data, some of which may not lead to

bias (study ended before follow-up could be done) while other reasons could lead to biased

predictions. For example, if a patient had TKR and major surgical complications led to death, then

the 1-year outcome data would not be available, and exclusion of these bad outcomes would lead

to favorable predictions.

Lastly, our ‘baseline’ data may not truly capture status at the time a patient decided

whether or not to have TKR. The NEBH and TMC databases of surgical cases did capture baseline

information in a timely manner. However, the OAI and MOST databases were registries of subjects

with knee osteoarthritis with timed data collection points (that included questions about whether

or not a TKR took place since the last timed measure). Evaluating data at the knee-visit level

allowed us to find subjects who had TKR, and we could then look back in time to find the nearest

46

assessment. For some patients that may have been within a month or two of the surgery, while

for others, it may have been within a year. Since follow-up for TKR subjects started at the time of

surgery, this also meant that the time between when baseline measures were done and the 1-

year follow-up was longer for TKR subjects than non-TKR subjects. If one believes knee pain and

function worsen over time in subjects that decide to get TKR, then our tool may underestimate

the benefit of TKR as a result of our not having a true baseline assessment. However, our final

regression models were built using data from all four databases where over 40% of knees that had

TKR were from the surgical databases lessening the impact of varying elapsed times between the

baseline assessment and actual TKR surgery.

The matching was done using SAS software41 and the SAS Macro %GMATCH for greedy

matching38 downloaded in February 2014 from:

http://www.mayo.edu/research/documents/gmatchsas/doc-10027248.

http://www.mayo.edu/research/documents/gmatchsas/doc-10027248

47

APPENDIX B: Missing Data We addressed the challenge of missing data by using a multiple imputation methodology

and by restricting the number of variables examined. For the multiple imputation, we used the

SAS procedures PROC MI to do the imputation and PROC MIANALYSE to combine results from

analyses run on the different imputed datasets.41 For each matched dataset (OAI, MOST, NEBH,

TMC) we created ten imputed datasets. We imputed main effects only and later calculated

interactions from the imputed data for the component main effects. We did discuss the

alternative of creating interactions first, then imputing data for the missing interactions. We opted

against that approach since we were concerned we could be generating main effects and

interactions at a subject level that did not correspond to each other. We used available data,

including 1-year outcome data, for each dataset when creating the imputed data. This means that

the variables used for the imputation for study differed based on availability, but maximized the

information we had for each study.

48

APPENDIX C: WOMAC Pain Score

I. WOMAC Knee Pain Score agreement with other measures of Pain and Function WOMAC Knee pain was selected as our primary outcome in collaboration with our

stakeholders and study team. To confirm the importance, and better understand the meaning of

the outcome, we reviewed correlations and scatter plots of WOMAC knee pain (WOMKP) with the

following variables: baseline SF-12 physical score (HSPSS), the Physical activity scale for the Elderly

(PASE), the WOMAC disability scales (WOMADL), the KOOS sport and recreational activity scale

(KOOSFSR), the KOOS quality of life (KOOSQOL) and the KGLRS scale assessing the effects of knee

pain and arthritis on function. We used the matched set of TKR and non-TKR knees in the OAI

database for these evaluations. For some scales, higher scores represent worse outcomes

(WOMKP, WOMADL, KGLRS) and for others, higher scores represent better outcomes (PASE,

KOOS, SF-12). Most patterns we found were as expected. For both Control and TKR subjects,

worse baseline WOMAC Knee pain (XWOMKP) was significantly (p<.0001) associated with worse

scores for physical function (ADL, KOOS, and KGLRS) (Table 1).

49

Figure 1. Distributions of WOMAC and Estimated WOMAC

50

II. Estimation of a WOMAC knee pain score using results from the KSS Both the OAI and MOST data sources had data for WOMAC knee pain but the NEBH and

TMC data sources did not. In order to create a common WOMAC or WOMAC like knee pain

variable across data sources, we constructed a new variable based on the KSS data available in the

NEBH and TMC datasets. Different versions of the KSS were used for NEBH versus TMC so we used

different approaches to estimate a WOMAC score for each. We examined how the WOMAC scale

was constructed and used that information to estimate a WOMAC score from KSS data. Figure 1

shows the distributions of the WOMAC and resulting estimated WOMAC from each of the four

studies stratified by timing (baseline verses follow-up).

Creation of a WOMAC Pain score from the NEBH version of KSS We used the OAI database to establish and explore relationships between the WOMAC

pain score, based on five components, and the estimated WOMAC score based on fewer

components that would be captured in a KSS.

The KSS captures data on walking, stairs, and rest. Keeping similar weighting as the

WOMAC, we developed the following mapping of the KSS to make an estimated WOMAC

instrument.

Table 2. NEBH WOMAC/KSS Mapping Schemes

WOMAC Scale

KSS Scale (NEBH)

Walking Stairs Rest

KSS score

Estimated WOMAC

component score

KSS score

Estimated WOMAC

component score

KSS score

Estimated WOMAC

component score

None (0) None 35 0 15 0 0 0

Mild (1) Mild/Occasional 30 1 10 1 -5 3

Moderate (2) Moderate 15 2 5 2 -10 6

51

Severe (3)

Severe 0 3.5 0 3.5 -15 10.5 Extreme (4)

WOMAC is a sum of 5 pain scores on a 0 to 4 point scale (Stairs + Walking + In Bed + Sit/Lie

down + Standing), where high numbers mean high pain. The KSS is a sum of 3 pain scores, each

with their own scale (Stairs + Walking + Rest), where low numbers mean high pain.

Both scales give more ‘weight’ for rest pain than stair or walking pain. The KSS weights

stair pain as worse than walking pain, unless it is severe, then they are the same. WOMAC weights

stair pain and walking pain the same and distinguishes different types of rest pain.

Our estimated WOMAC pain score weights stair and walking pain the same, as with the

WOMAC. The range is 0 to 17.5 (rather than 0 to 20). It assumes that the (“KSS Pain at Rest” x 3) is

the same as (WOMAC Pain in Bed + WOMAC Pain Sitting/Lying down + WOMAC Pain Standing).

Creation of a WOMAC Pain score from the TMC version of KSS

The Tufts database used an older version of the KSS and presented the biggest challenge

for the pain score outcome. We asked members of the research team to use the questions

collected on the Tufts KSS and review the WOMAC scale and weightings and try to score items

that they believed would approximate the WOMAC. We then calculated scores and plotted

distributions at baseline and follow-up. After reviewing the distribution of scores at baseline and

follow-up the team decided to use the version shown in Table 3 which looked most like what was

seen in the NEBH, the OAI, and MOST databases.

52

Table 3. TMC WOMAC/KSS Mapping Schemes KSS Pain KSS Walk (I. Walking) Estimated

WOMAC component

score

KSS Stairs (Stairs) Estimated WOMAC

component score

None Unlimited, > 10 Blocks, 5 - 10 Blocks

0 Normal Up & Down, Normal Up; Down with Rail; Up & Down with Rail

0

< 5 blocks, Housebound, Unable 0.5 Up with Rail; Unable 0.5 Mild or Occasional

Unlimited, > 10 Blocks, 5 - 10 Blocks


1

< 5 blocks, Housebound, Unable 1.5 Up with Rail; Unable 1.5 Mild or Occasional, Stairs Only



2

< 5 blocks, Housebound, Unable 2.5 Up with Rail; Unable 2.5 Mild or Occasional, Walking & Stairs



2

< 5 blocks, Housebound, Unable 2.5 Up with Rail; Unable 2.5 Moderate Occasional



3

< 5 blocks, Housebound, Unable 3.5 Up with Rail; Unable 3.5 Moderate Continual



4

< 5 blocks, Housebound, Unable 4.5 Up with Rail; Unable 4.5 Severe -- 5 -- 5

In summary, we re-scaled the estimated WOMAC pain scores from 0 to 100 in our

matched datasets. The distributions of scores pre and post-TKR were similar to those observed in

the MOST and OAI databases. Also, we looked at follow-up scores and controlled for baseline

scores, which were based on the same scale within studies. We reviewed pain scores between

controls and matched TKR subjects within each study and did not see any gross inconsistencies.

While we are aware that our methods are not validated, we believe they are adequate and

reasonable based on patient and clinician stakeholder and research team input.

53

APPENDIX D: The process for creating predictive models for the pain and function outcomes

Model Development Initial variable selection was done (development and updating

phases) using a stepwise selection process in one dataset with one observation per knee created

from averaging 10 multiply-imputed copies of the OAI database (and later the pooled OAI and

MOST databases). The candidate variables and interactions included in the selection process were

chosen from those available in the OAI and MOST database and that stakeholders and the project

team considered important and plausible. The selected variables then were used to make a

separate model for each of the individual imputed datasets, and we combined the results from

these 10 models to get the parameter estimates and associated standard errors that accounted

for both variation in the data and the amount of missingness. If variables in the model were no

longer significant at the p < 0.10 level, then they were removed using a backward selection

process.

Model Validation Performance of the linear regression models derived on the OAI

matched database was tested on the MOST dataset. We looked at scatter plots of predicted

outcomes (based on the equation from the OAI model) versus true 1-year outcomes in the MOST

database, and also r-square values. We did this for the pooled data and stratified by treatment

status.

Model Updating The MOST data and OAI databases were then pooled together (10

imputed copies of each) and the beta coefficients from the validated model were re-estimated on

the larger pooled database. We then re-explored adding additional interactions of variables with

the treatment indicator variable and removing variables based on significance, clinical, or

pragmatic reasons. These final models were called the P1 and F1 models for pain and function.

Re-derivation A project objective was to make patient-specific predictions accounting for

their individual characteristics. For a statistical model, this implies interactions of variables

representing patient characteristics with those representing the treatment. To better screen for

54

interactions, we used a larger dataset that included OAI and MOST data, and also the NEBH and

TMC matched datasets. One tradeoff with using these latter data sources is that they had fewer

candidate variables to draw upon. The model development process we used for the pooled data

from the four data sources was similar to that used for the model development. The final models

created from the pooled data from the four data sources were called the P2 and F2 models.

55

APPENDIX E:User Interface Testing

User interface testing included three components:

1. Entering demographic data and completing questionnaires to provide the information

needed for the predictive models.

2. Interpreting predictive model results through data displays including:

a. Data Table - Identifying current pain and function scale results and identifying

predicted pain and function outcomes with surgical or nonsurgical treatment.

b. Bar Charts – Understanding the meaning of scales, identifying current and

predicted outcomes, and understanding uncertainty.

c. Combined pain and function plot – Understanding the meaning of a data point,

identifying current and predicted outcomes, and comprehending the concept of

uncertainty around the predictions.

3. Determining user understanding of the predictive models, mathematical equipoise and

clinical trial randomization through case-based discussions.

Usability testing included a “think aloud” protocol and usability testing script. The former

involved asking users to accomplish a series of tasks including completing questionnaires and

describing their interpretation of the predicted outcomes. This allowed us to understand

participants’ thoughts as they used the application and to identify aspects of the tool that were

unclear. Any questions or problems were noted to improve future versions of the application. The

latter allowed determination of users’ understanding and interpretation of the current and

predicted outcome results page .

The usability testing was conducted in two stages. Both versions of the Usability Testing

Plan are provided on the following pages.

56

Version 1 - Usability and Cognitive Testing Plan

This version of the usability testing plan was used for the initial testing conducted with a

diverse group of research institute staff and members of our patient and clinician stakeholder

panels.

Introduction: Imagine that you (or someone you know) has knee osteoarthritis and is

contemplating different options, including surgical and nonsurgical treatments. Now imagine that

your clinician asks you to complete these questions in advance of the appointment [or possibly,

you will complete the questions together with your clinician during that appointment]. Next, you

and your clinician are going to make some decisions about the next steps for how to treat your

knee osteoarthritis.

Please complete this online questionnaire, which will take about 15 minutes. We expect

the total time will be about 1 hour for both completing the questionnaire and answering some

questions about using this application.

Note: When you see the term “physical function,” it refers to how your health affects your

ability to perform activities that you might do during a typical day.

Part 1: Questionnaire Usability: Please follow the prompts and instructions on the screen.

The bar at the top will show your progress, and you can use the arrows to go backwards and

forwards. Some of the pages require you to scroll down. You need to complete all of the questions

on a page before you can go onto the next page. If you have any questions while completing the

forms, please ask me. I will record your questions, so we can improve future versions of this

questionnaire. Please think aloud while you complete the questionnaire.

A. If you need help on this page, what would you do?

Part 2: Usability of Numeric Predictions: The predictive model built into the application

used information that you entered to calculate the current level of knee pain and physical

functioning. It also calculated four predictions about the level of knee pain and physical

functioning at 1 year. [This includes knee pain without surgery, knee pain with surgery, physical

function without surgery, and physical function with surgery].

57

[The patient is directed to the screen displaying the bar graphs for current and predicted

pain and function with and without surgical treatment.]

A. Expectations Physical Function:

I have a few questions before we look at all of your results. [Research Assistant

turns the computer away and writes out the current scores on a piece of paper along with

the scale and later the predicted scores.]

o What bothers you more knee pain or physical function? [Ask about whichever one

bothers the patient more first.]

o Your current physical function score is __ on a scale of 0 to 100 where 0 is poor

function and 100 is excellent function.

o In a year, would you expect your physical function with usual care to be higher or

lower than your current physical function?

o What activities would you hope to be able to do after a year of usual care?

o If you were to have knee replacement surgery, would you expect your physical

function to be higher or lower in a year?

o What activities would you hope to be able to do a year after knee replacement

surgery?

Knee Pain:

o Your current knee pain score is ___ on a scale of 0 to 100 where 0 is no pain and

100 is extreme pain. In a year, would you expect your knee pain with usual care to

be higher or lower than your current knee pain? If you were to have knee

replacement surgery, would you expect your knee pain to be higher or lower in a

year than your current knee pain?

58

B. Usability of Graphical Depiction of Predictions Interpret the Bar graphs and Legend:

o Just looking at the bar graph for current pain, where on the bar graph would knee

pain be worse?

o Where would it be better?

o Would your friends (or the people that you know) understand where severe knee

pain and where mild knee pain are on the graph?

o Brief description of uncertainty: Just like with a weather forecast, there is some

uncertainty for any prediction. When you hear that there is a 60% chance of

showers, you know there is some uncertainty around that number. The most likely

prediction (the average) is shown by the height of the color in the bar. 95 out of

100 people who answered like you would have knee pain predictions within the

colored area. Please take a look at this graph.

[On the top next to the pain tab, click on the function tab.]

o On the current function graph, where on the graph would physical function be

worse?

o On the current function graph, where on the graph would physical function be

better?

o Would your friends (or the people that you know) understand where poor function

and where excellent function are on the graph?

o What does the colored area around the value marked in the bar graph mean to

you?

o Does it convey anything to you about uncertainty or error?

o [If the user does not think that the colored area conveys uncertainty, then ask this

question.] If the colored area around the value in the bar graph does not convey

uncertainty, what would convey uncertainty to you for this bar chart?

o Given the bar graphs, do they help you interpret the results? [We will show the bar

graph for pain and the bar graph for function separately.]

Interpreting combine pain and function plot:

59

o [The Research Assistant first moves to the Combined Knee Pain and Function

Graph, and walks through the different layers that we can add on the graph, using

the text below.]

o The purple star on this graph displays your current pain level with your current

physical function level.

o The blue diamond shows the predictions for your knee pain and physical function

at one year with usual care. Just like the line on the bar graph, the blue colored

cloud represents the uncertainty around the predictions for knee pain and physical

function at one year with usual care.

o The green circle shows your predicted knee pain and physical function for one year

after knee replacement. Just like the line on the bar graph, the green colored cloud

represents the uncertainty around the knee pain and physical function prediction

one year after knee replacement surgery.

o Brief description of uncertainty: There is some uncertainty for any (model)

prediction. If we give you a prediction of 32, there is some probability that the

actual outcome will be higher or lower than that. On the graph, we are showing

knee pain and physical function predicted together. 9 out of 10 people like you

would fall within the colored area around the prediction. Please take a look at this

diagram/graph.

o The overlap between the dotted blue and green clouds shows that some people

with either usual care or knee replacement have the same predicted knee pain and

physical function outcomes.

o If the current value (purple star) is within the blue dotted cloud around the usual

care prediction, then this means that there is a chance that with usual care after

one year this individual will still be at the same level of knee pain and physical

function.

o Can you point to your current level of pain and function?

o Where is your prediction if you do have surgery?

o Where is your prediction if you have usual care?

60

o What are your likely outcomes if you do have surgery? Would you get better,

worse, or do about the same?

o Given the uncertainty oval, how confident are you about that?

o [If the current score is within the uncertainty shape, ask this question.] What does

it mean that your current score is within the dashed shape? [We are looking for an

articulation of whether the graph is unclear and whether more explanation is

needed. E.g. Can the user realize that his/her current score is in the surgical circle?

Perhaps, it could come out that the scores would be no better off with surgery.]

o What are your likely outcomes if you do not have surgery? Would you get better,


o Given the uncertainty shape, how confident are you about that?

o [If the current score is within the uncertainty circle, ask this question.] What does it

mean that your current score is within the dashed shape? [We are looking for an

articulation of whether the graph is unclear and whether more explanation is

needed. E.g. Can the user realize that his/her current score is in the surgical circle?

Perhaps, it could come out that the scores would be no better off with surgery.]

o How do you interpret the colored shape around the prediction?


o [Researcher points outside of the shape.] Is this a likely outcome? [No, would be

the correct answer. It is possible but not likely.]

o You picked the (surgical/nonsurgical) treatment option before, would you still

choose (surgery/ nonsurgical treatment)? OR (You were not sure which treatment

option that you would chose before, are you still undecided?)

C. Current Score Function:

o [Please look at the table on the left and tell me] what is your current physical

function score?

61

o Would you say that your current score suggests your physical function is excellent,

very good, good, fair, or poor?

Pain:

o What is your current knee pain score?

o Would you say that your knee pain score suggests that your knee pain is none, mild,

moderate, severe, or extreme?

B. Predicted Scores

Physical function:

o What is your predicted 1-year physical function score with usual care?

o What is your predicted 1-year physical function score with surgery?

o Compared to your current physical function score, would you say that you would

be better, worse, or the same in your physical functioning in one year with usual

care?

o Compared to your current physical function score, would you say that you would

be better, worse, or the same in your physical functioning in one year with

surgery?

Knee Pain:

o What is your predicted 1-year knee pain score with usual care?

o What is your predicted 1-year knee pain score with surgery?

o Compared to your current knee pain score, would you say that your knee pain

would be better-off, worse-off, or the same with usual care?

o Compared to your current knee pain score, would you say that your knee pain

would be better-off, worse-off, or the same with surgery?

Interpretation:

o Based on this information about surgical or nonsurgical treatment options, does

one option look better than the other for improving knee pain?

o Does one option look better than the other for improving your physical function?

o Is the improvement what you would have expected?

62

o Which treatment option would you chose given all four predictions?

o What do you understand about the improvement that makes you select it?

[The patient is directed to the screen displaying the bar graphs of pain and

function.]

Part 3: Equipoise 1. Please look at the three scenarios below depicted from the results page.

o Let’s imagine that a clinician is telling you about a research study with two options

(surgical or nonsurgical treatment of knee OA). If you choose to participate in the

study, you will be randomized to the surgical or nonsurgical treatment with equal

chances of each treatment being the one you receive.

o Looking at figure 1, would you choose randomization?



o What’s the amount of overlap that suggests your future pain and function will be

about the same with or without surgery?

Small Overlap Moderate Overlap Large Overlap

63

Additional possible questions to add about the knee pain and physical function graphs

to test for understanding:

Order:

o Does the order of Knee Pain results, then Physical Function results, and ultimately the

results with Knee Pain and Physical Function together provide a logical and helpful

organization of the information?

Concluding Questions:

o If you were unable to make a decision about treatment options after viewing this tool,

what further information would you need?

o Did this questionnaire aid your decision making process in choosing between

nonsurgical and surgical treatment for knee OA?

o Are there any other parts of this results page that you found unclear or confusing that

you have not already mentioned?

o Are there any further changes that you would recommend for this questionnaire?

64

Version Two - Usability and Cognitive Testing Plan This version of the usability testing plan was used for the final testing conducted with

patients in the Rheumatology, Orthopedic and Primary care clinics.

Introduction: Imagine that you (or someone you know) has knee osteoarthritis and is contemplating

different options, including surgical and nonsurgical treatments. Now imagine that your clinician

asks you to complete these questions in advance of the appointment [or possibly, you will

complete the questions together with your clinician during that appointment]. Next, you and your

clinician are going to make some decisions about the next steps for how to treat your knee

osteoarthritis.

Note: When you see the term “physical function,” it refers to how your health affects your

ability to perform activities that you might do during a typical day.

Part 1: Questionnaire Usability Please follow the prompts and instructions on the screen. The bar at the top will show

your progress, and you can use the arrows to go backwards and forwards. Some of the pages

require you to scroll down. You need to complete all of the questions on a page before you can go

onto the next page. If you have any questions while completing the forms, please ask me. I will

record your questions, so we can improve future versions of this questionnaire.

Part 2: Usability of Graphical Depiction of Predictions Interpret the Bar graphs and Legend

o Tell me what you think this is telling you.

o What does the colored area around the value marked in the bar graph mean to you?

o (Based on the information displayed, what does this tell you about the level of knee

pain at one year with usual care and with knee replacement?)

o (Based on the information displayed, what does this tell you about the level of

physical function at one year with usual care and with knee replacement?)

65

Interpreting combined pain and function plots:

o What does this tell you?

o (Based on the information displayed, what does this tell you about the outcomes at

one year with usual care and with knee replacement?)

o What are your likely outcomes if you do have surgery? Would you get better, worse,

or do about the same?

o Given the uncertainty oval, how confident are you about that?

o What are your likely outcomes if you do not have surgery? Would you get better,


o (Based on the information displayed, what does this tell you about the outcomes at

one year with usual care and with knee replacement?)

o Given the uncertainty shape, how confident are you about that?


Concluding Questions:

o Are there any other parts of this results page that you found unclear or confusing that

you have not already mentioned?

o Only ask if the patient is eligible for randomization as determined by the model with

the orange bar in the application itself. Imagine that there was a research study in

which you were eligible to participate. The study is testing two different treatments for

knee osteoarthritis; nonsurgical treatment or knee replacement. Considering that the

model predicts that you would benefit from either knee replacement surgery or

nonsurgical treatment, would you be willing to be randomized to a study in which you

would have equal chances of you being assigned to either total knee replacement or

to nonsurgical treatment? If no, why not? If yes, why would you choose to participate?

66

Figure 1a. Consort Diagram - OAI OAI [May 2014]

4796 Patients

EXCLUDED n pts Patients not in OAI 'incidence' or 'progression' cohort 122

Patient with no follow-up or outcome data 254

Patients with TKR in both knees >90 days apart 39

Patients with TKR prior to OAI entry 2

4379 Patients 8713 Knees

EXCLUDED n Knees (non-TKR) from patients who had TKR on contralateral knee 2

Knees with TKR that had no pre-TKR visit within one year of the TKR 2 Knees with TKR that had no post-TKR visit 6-60 months post-surgery 4

CONTROL (non-

TKR Sample 4049 Patients 253 Patients 8095 Knees 278 Knees ** **note: For patients with bilateral (or TKR on both knees close in time only 1

knee was used. If bilateral, choice was random; if two close in time, data

from first TKR was used.

MATCH TKR SAMPLE WITH CONTROL SAMPLE [KNEE VISITS] Matching Variables: Age (<55, 55-65, >65),

Gender EXCLUDED WOMAC Pain + Disability (Riddle based) :on Incident Knee 1 TKR Patient/Knee that could not be matched to Control knee-visit

WOMAC Pain + Disability (Riddle based) :on contralateral knee Location (Riddle Category) K-L (Riddle): moderate/severe vs. not SF 12 ( <44 , 44-56, >56) Charlson (0, 1, >2, missing) Change in WOMAC Pain from Prior Visit (>=2 points vs. not)

CONTROL TKR

252 Knees 252 Knees

APPENDIX F

67

Figure 1b. Consort Diagram - MOST

Most [January

3026 Patients EXCLUDED n

Patients with TKR in both knees >90 days apart 69 2957 Patients 5914 Knees EXCLUDED n

Knees with TKR that had no pre-TKR visit within one year of the

221 OR Knees with TKR that had no post-TKR visit 6-60 months post-

No Post-Surgery WOMAC 133 Knees (non-TKR) form patients who had TKR on contralateral

115

CONTROL (non-TKR) Sample TKR Sample 2652 Patients 2652 Patients 5071 Knees 5071 Knees **note: For patients with bilateral (or TKR on both knees close in time only 1 knee

was used. If bilateral, choice was random; if two close in time, data from first TKR was used. MATCH TKR SAMPLE WITH CONTROL SAMPLE [KNEE VISITS] Matching Variables: Age (<55, 55-65, >65) Gender EXCLUDED n

WOMAC Knee pain [0-20 scale] (0-3, 4-9, 10-20, missing) 1 TKR Patient/Knee that could not be matched to Control knee-

0 WOMAC contralateral knee pain (0-2, 3-8, 9-20, missing)

SF 12 ( <44 , 44-56, >56) Charlson (0, 1, >2, missing) Change in WOMAC Pain from Prior Visit (>=2 points vs. not)

CONTROL CONTROL 154 Knees 154 Knees

68

Figure 1c. Consort Diagram - NEBH

NEBH

(December 2014) 5519 Subjects EXCLUDED n

Knee OA diagnosis 5205 Prior Surgery No Follow-up information/Coding Issues 314 Subjects

CONTROL (non-TKR) Sample TKR Sample 2652 Patients (5071 Knees)

314 Subjects

4049 Patients (8095 Knees)

314 Knees **note: For patients with bilateral (or TKR on both knees close in time only 1 knee was used. If bilateral, choice was random; if two close in time, data from first TKR was

MATCH TKR SAMPLE WITH CONTROL SAMPLE [KNEE VISITS] Matching Variables: Age (<55, 55-65, >65) Gender EXCLUDED n

WOMAC Knee pain [0-100] (11-50, 51-75, 75-100, missing) Missing Follow-up information 66 WOMAC contralateral knee pain [0-100] (11-50, 51-75, 75-100, missing) SF 12 ( <44 , 44-56, >56) Charlson (0, 1, >2, missing)

CONTROL TKR 248 Knees 248 Knees

69

Figure 1d. Consort Diagram - TMC Tufts Medical Center [July 2015]

535 Subjects EXCLUDED n= 99 pts

Patients with TKR in both knees > 90 days apart

.

Patients with TKR revision as their primary procedure Patients with TKR due to rheumatoid arthritis Patients without TKR

436 Subjects EXCLUDED n= 319 pts Patients with TKR that had no post-TKR visit 6-60 months post-surgery

Patients with TKR in both knees without an available one year follow-up for the first TKR before the second TKR

Patients who did not clearly meet the inclusion criteria (2)

Patients with unavailable medical records Duplicate or ambiguously identified patients or those with inconsistent or missing data

Patients that, due to limited resources, we were unable to collect additional data to supplement the original registry data

117 Subjects

EXCLUDED n= 21 pts **note: For patients with bilateral (or TKR on both knees close in time) only 1 Issues with TKA classification and miss-coded

knee was used. If bilateral, choice was random; if two close in time, data from first TKR was used.

96 Subjects

CONTROL (non-TKR) Sample TKR Sample 2652 Patients (5071 Knees) [MOST] 96 Subjects

4049 Patients (8095 Knees) [OAI] 96 Knees

MATCH TKR SAMPLE WITH CONTROL SAMPLE [KNEE VISITS] Matching Variables: Age (<55, 55-65, >65) Gender EXCLUDED n= 25 pts WOMAC Knee pain [0-100] (11-50, 51-75, 75-100, missing) Missing Follow-up

SF 12 ( <44 , 44-56, >56) Charlson (0, 1, >2, missing)

CONTROL TKR 72 Knees

72 Knees

70

Table 1a.i - Comparison of Distribution of Variables used for Matching – OAI database

Variable TKR

Non-TKR MATCHING VARIABLES

Age riddle category N=252 N=252 1. <55 7.9% ( 20) 7.5% ( 19) 2. 55-65 32.5% ( 82) 32.9% ( 83) 3. >65 59.5% ( 150) 59.5% ( 150) Gender (%, fraction male) 42.9% (108/252) 43.7% (110/252) WOMAC riddle based_sum pain+dis N=252 N=252 1. slight: p+d <=11 8.3% ( 21) 9.1% ( 23) 2. mod: p+d 12-22 18.3% ( 46) 19.4% ( 49) 3. intense: p+d 23-33 25.0% ( 63) 24.6% ( 62) 4. severe: p+d >33 46.0% ( 116) 44.4% ( 112) 99. N/A 2.4% ( 6) 2.4% ( 6) WOMAC cat on contralat N=252 N=252 1. slight: p+d <=11 54.8% ( 138) 54.8% ( 138) 2. mod: p+d 12-22 18.7% ( 47) 18.7% ( 47) 3. intense: p+d 23-33 12.3% ( 31) 12.3% ( 31) 4. severe: p+d >33 12.7% ( 32) 12.7% ( 32) 99. N/A 1.6% ( 4) 1.6% ( 4) Location riddle category N=252 N=252 0. Unicomp tibio-fem 1.2% ( 3) 2.8% ( 7) 1. Unicomp+ patello-fem 47.6% ( 120) 46.4% ( 117) 2. Tricompart 1.2% ( 3) 0.8% ( 2) 99. N/A 50.0% ( 126) 50.0% ( 126) K-L riddle category N=252 N=252 0. not mod/sev 20.2% ( 51) 21.8% ( 55) 1. yes mod/sev 29.8% ( 75) 28.2% ( 71) 99. N/A 50.0% ( 126) 50.0% ( 126) Base SF-12 Physical groups N=252 N=252 1. 0 to <43 50.4% ( 127) 49.2% ( 124) 2. >=43 to <56 23.4% ( 59) 24.6% ( 62) 3. >=56 2.0% ( 5) 2.0% ( 5) 99. N/A 24.2% ( 61) 24.2% ( 61) Base Charlson Comorbidity score N=252 N=252 0. none 34.1% ( 86) 34.9% ( 88) 1. one 11.1% ( 28) 10.3% ( 26) 2. two or more 5.6% ( 14) 5.6% ( 14) 99. N/A 49.2% ( 124) 49.2% ( 124) Delta WOMAC pain category (on 20pt scale) N=252 N=252 0. no more than 1 point increase 26.6% ( 67) 27.0% ( 68) 1. increase of at least 2 points 64.7% ( 163) 64.3% ( 162) 99. N/A 8.7% ( 22) 8.7% ( 22)

71

Table 1a.ii - Comparison of Distribution of Variables used for Matching – MOST database

Variable TKR


Age (riddle category) N=154 N=154 01. <55 8.4% ( 13) 8.4% ( 13) 02. 55-65 40.3% ( 62) 42.9% ( 66) 03. >65 51.3% ( 79) 48.7% ( 75) Gender (%, fraction male) 31.2% (48/154) 30.5% (47/154) Base WOMAC knee pain groups N=154 N=154 01. 0 to 3 5.8% ( 9) 5.8% ( 9) 02. 4 to 9 39.6% ( 61) 40.3% ( 62) 03. 10- 20 (max) 25.3% ( 39) 24.7% ( 38) 99. N/A 29.2% ( 45) 29.2% ( 45) Base WOMAC contra knee pain groups N=154 N=154 01. 0 to 2 26.0% ( 40) 26.0% ( 40) 02. 3 to 8 35.7% ( 55) 35.7% ( 55) 03. 9- 20 (max) 9.1% ( 14) 9.1% ( 14) 99. N/A 29.2% ( 45) 29.2% ( 45) Base WOMAC disability score groups N=154 N=154 01. 0 to 17 7.8% ( 12) 8.4% ( 13) 02. 18 to 32 39.0% ( 60) 40.9% ( 63) 03. >=33 21.4% ( 33) 18.8% ( 29) 99. N/A 31.8% ( 49) 31.8% ( 49) Base SF-12 Physical groups N=154 N=154 01. 0 to 43 51.3% ( 79) 49.4% ( 76) 02. 44 to 56 13.6% ( 21) 16.2% ( 25) 03. >=57 0.6% ( 1) . ( .) 99. N/A 34.4% ( 53) 34.4% ( 53) Base Charlson Comorbidity score N=154 N=154 00. None 61.7% ( 95) 61.0% ( 94) 01. One 26.0% ( 40) 24.7% ( 38) 02. Two or more 12.3% ( 19) 14.3% ( 22) Delta WOMAC pain groups N=154 N=154 00. Negative/Decrease 7.1% ( 11) 8.4% ( 13) 01. No change or 1 point increase 7.1% ( 11) 6.5% ( 10) 02. Increase of 2 or more points 18.2% ( 28) 17.5% ( 27) 99. N/A 67.5% ( 104) 67.5% ( 104)

72

Table 1a.iii - Comparison of Distribution of Variables used for Matching – NEBH database

Variable TKR


Age riddle category N=248 N=248 01. <55 18.1% ( 45) 18.5% ( 46) 02. 55-65 40.7% ( 101) 40.7% ( 101) 03. >65 41.1% ( 102) 40.7% ( 101) Gender (%, fraction male) 41.9% (104/248) 40.3% (100/248) Base WOMAC knee pain groups N=248 N=248 01. 0 to 10 0.4% ( 1) 0.8% ( 2) 02. 11 to 50 37.1% ( 92) 39.5% ( 98) 03. 51 to 75 44.8% ( 111) 47.2% ( 117) 04. 75 to 100(max) 14.5% ( 36) 9.3% ( 23) 99. N/A 3.2% ( 8) 3.2% ( 8) Base WOMAC contralat knee pain groups N=248 N=248 01. 0 to 10 35.5% ( 88) 35.5% ( 88) 02. 11 to 50 51.2% ( 127) 50.8% ( 126) 03. 51 to 75 8.1% ( 20) 8.5% ( 21) 04. 75 to 100(max) 2.0% ( 5) 2.0% ( 5) 99. N/A 3.2% ( 8) 3.2% ( 8) Base SF-12 Physical groups N=248 N=248 01. 0 to 43 69.0% ( 171) 69.4% ( 172) 02. 44 to 56 25.4% ( 63) 25.0% ( 62) 03. >=57 1.2% ( 3) 1.2% ( 3) 99. N/A 4.4% ( 11) 4.4% ( 11) Base Charlson Comorbidity N=248 N=248 00. None 75.4% ( 187) 75.4% ( 187) 01. One 16.9% ( 42) 16.9% ( 42) 02. Two or more 6.5% ( 16) 6.5% ( 16) 99. N/A 1.2% ( 3) 1.2% ( 3)

73

Table 1a.iv - Comparison of Distribution of Variables used for Matching – TUFTS database

Variable TKR ( 72)

Non-TKR ( 72) MATCHING VARIABLES

Age Riddle category N=72 N=72 01. <55 26.4% ( 19) 26.4% ( 19) 02. 55-65 33.3% ( 24) 34.7% ( 25) 03. >65 40.3% ( 29) 38.9% ( 28) Gender (%, fraction male) 69.4% (50/72) 69.4% (50/72) Base WOMAC knee pain groups N=72 N=72 02. 11 to 50 50.0% ( 36) 55.6% ( 40) 03. 51 to 75 40.3% ( 29) 31.9% ( 23) 04. 75 to 100(max) 1.4% ( 1) 2.8% ( 2) 99. N/A 8.3% ( 6) 9.7% ( 7) Base SF-12 Physical groups N=72 N=72 01. 0 to 43 47.2% ( 34) 47.2% ( 34) 02. 44 to 56 5.6% ( 4) 5.6% ( 4) 99. N/A 47.2% ( 34) 47.2% ( 34) Base Charlson Comorbidity N=72 N=72 00. None 48.6% ( 35) 48.6% ( 35) 01. One 19.4% ( 14) 19.4% ( 14) 02. Two or more 30.6% ( 22) 30.6% ( 22) 99. N/A 1.4% ( 1) 1.4% ( 1)

74

Table 1b.i - Comparisons of subject characteristics in raw data – OAI

Variable TKR (n=252) Non-TKR (n=252) TKR minus non-TKR Delta

(∆) and [95% CI] Effect Size

mean +/- standard deviation (SD) or %(n) (∆/SD) Age in years 67.88 +/- 8.64 ( 252) 66.82 +/- 8.36 ( 252) 1.06[ -0.43 to 2.55] 0.12 BMI 29.80 +/- 4.60 ( 184) 30.39 +/- 4.66 ( 179) -0.59[ -1.55 to 0.36] -0.13 SF-12 Physical 38.75 +/- 9.17 ( 191) 39.82 +/- 9.31 ( 191) -1.07[ -2.93 to 0.79] -0.12 SF-12 Mental 55.57 +/- 8.64 ( 191) 54.39 +/- 9.18 ( 191) 1.18[ -0.61 to 2.98] 0.13 Depression Scale 7.47 +/- 7.13 ( 233) 7.99 +/- 6.98 ( 231) -0.52[ -1.81 to 0.77] -0.07 Back Pain 62.2% +/- 48.6% ( 188) 66.8% +/- 47.2% ( 193) ( 4.6%)[( 14.3%) to 5.0% ] -0.10 ADL/Disability 24.49 +/- 11.70 ( 247) 23.27 +/- 11.72 ( 247) 1.21[ -0.86 to 3.28] 0.10 WOMAC Total score 35.66 +/- 15.99 ( 246) 33.40 +/- 16.20 ( 246) 2.26[ -0.59 to 5.11] 0.14 WOMAC Pain subscores, 0=none, 4=extreme WOMAC Pain- Walking 1.77 +/- 0.98 ( 252) 1.46 +/- 0.90 ( 250) 0.32[ 0.15 to 0.48] 0.34 WOMAC Pain- Stairs 2.31 +/- 0.97 ( 248) 1.96 +/- 0.92 ( 248) 0.35[ 0.18 to 0.52] 0.37 WOMAC Pain- In Bed 1.14 +/- 1.14 ( 251) 1.05 +/- 1.11 ( 250) 0.09[ -0.11 to 0.29] 0.08 WOMAC Pain- Sit/Lie down 0.96 +/- 0.94 ( 251) 1.12 +/- 0.97 ( 250) -0.17[ -0.34 to -0.00] -0.18 WOMAC Pain- Standing 1.42 +/- 0.94 ( 250) 1.30 +/- 0.99 ( 250) 0.13[ -0.04 to 0.30] 0.13 WOMAC Knee pain (0-100) 38.00 +/- 18.92 ( 251) 34.41 +/- 19.70 ( 250) 3.59[ 0.20 to 6.98] 0.19 Contralateral Knee Pain (0-100) 15.05 +/- 17.30 ( 252) 15.67 +/- 16.59 ( 251) -0.62[ -3.59 to 2.35] -0.04 Hip Pain or Pain/Ache/Stiff 56.0% +/- 49.8% ( 191) 60.1% +/- 49.1% ( 193) ( 4.1%)[( 14.0%) to 5.8% ] -0.08 Homunculus (0 to 100%) 24.63 +/- 12.82 ( 155) 26.47 +/- 14.42 ( 164) -1.84[ -4.85 to 1.17] -0.13 Narcotics 14.3% +/- 35.1% ( 252) 9.6% +/- 29.5% ( 251) 4.7% [( 1.0%) to 10.4% ] 0.15 Charlson (approximate) N=128 N=128

0 67.2% ( 86) 68.8% ( 88) 1 21.9% ( 28) 20.3% ( 26) 2 7.0% ( 9) 9.4% ( 12) 3 3.9% ( 5) 1.6% ( 2)

Baseline Charlson_approx 0.48 +/- 0.79 ( 128) 0.44 +/- 0.73 ( 128) 0.04[ -0.15 to 0.23] 0.05 Follow-up (FU) FU SF-12 Physical 44.55 +/- 9.53 ( 129) 41.29 +/- 10.10 ( 171) 3.26[ 1.00 to 5.52] 0.33 FU WOMAC Knee Pain (0-100) 11.78 +/- 15.69 ( 252) 26.62 +/- 20.33 ( 252) -14.84[ -18.01 to -11.66] -0.82 Time from Baseline to FU (months) Median <q1-q3> (n) 22.6<13.1-24.1> ( 252) 12.0<11.1 -12.7>( 252) Mean +/-sd (n) 19.76 +/- 5.60 ( 252) 12.20 +/- 2.81 ( 252) 7.56[ 6.78 to 8.33] 1.71

75

Table 1b.ii - Comparisons of subject characteristics in raw data – MOST

Variable TKR (n=154) Non-TKR (n=154) TKR minus non-TKR Delta


mean +/- standard deviation (SD) or %(n) (∆/SD) Age in years 65.49 +/- 6.88 ( 154) 65.21 +/- 7.47 ( 154) 0.28[ -1.33 to 1.89] 0.04 BMI 32.19 +/- 5.69 ( 106) 31.77 +/- 6.15 ( 107) 0.42[ -1.18 to 2.02] 0.07 SF-12 Physical 36.73 +/- 8.26 ( 101) 37.25 +/- 8.08 ( 101) -0.53[ -2.79 to 1.74] -0.06 SF-12 Mental 55.75 +/- 9.41 ( 101) 53.38 +/- 9.65 ( 101) 2.37[ -0.28 to 5.01] 0.25 Depression Scale 8.13 +/- 7.42 ( 97) 10.70 +/- 9.02 ( 96) -2.56[ -4.91 to -0.22] -0.31 Back Pain 71.1% +/- 45.5% ( 97) 81.3% +/- 39.2% ( 96) ( 10.1%)[(22.2%) to 2.0% ] -0.24 ADL/Disability 28.08 +/- 10.44 ( 105) 26.84 +/- 10.85 ( 105) 1.24[ -1.66 to 4.13] 0.12 WOMAC Total Score 40.07 +/- 14.43 ( 105) 37.93 +/- 15.02 ( 105) 2.14[ -1.86 to 6.15] 0.15 WOMAC Pain subscores, 0=none, 4=extreme WOMAC Pain- Walking 1.78 +/- 0.77 ( 109) 1.50 +/- 0.82 ( 109) 0.28[ 0.07 to 0.50] 0.36 WOMAC Pain- Stairs 2.51 +/- 0.87 ( 109) 2.18 +/- 0.94 ( 109) 0.33[ 0.09 to 0.57] 0.36 WOMAC Pain- In Bed 1.27 +/- 0.96 ( 109) 1.26 +/- 0.98 ( 109) 0.01[ -0.25 to 0.27] 0.01 WOMAC Pain- Sit/Lie down 0.81 +/- 0.87 ( 109) 0.90 +/- 0.80 ( 109) -0.09[ -0.31 to 0.13] -0.11 WOMAC Pain- Standing 1.50 +/- 0.82 ( 109) 1.54 +/- 0.90 ( 109) -0.05[ -0.28 to 0.18] -0.05 WOMAC Knee pain (0-100) 40.28 +/- 17.52 ( 109) 38.39 +/- 17.72 ( 109) 1.88[ -2.82 to 6.58] 0.11 Contralateral Knee Pain (0-100) 22.32 +/- 18.97 ( 109) 23.30 +/- 19.09 ( 109) -0.99[ -6.07 to 4.09] -0.05 Hip Pain or Pain/Ache/Stiff 46.4% +/- 50.0% ( 153) 61.4% +/- 48.8% ( 153) (15.0%)[(26.2%) to (3.9%)] -0.30 Homunculus (0 to 100%) 17.34 +/- 17.25 ( 154) 20.93 +/- 19.01 ( 154) -3.59[ -7.66 to 0.48] -0.20 Narcotics 20.8% +/- 40.7% ( 106) 14.0% +/- 34.9% ( 107) 6.7% [( 3.5%) to 17.0% ] 0.18 Charlson (approximate) N=154 N=154

0 61.7% ( 95) 61.0% ( 94) 1 26.0% ( 40) 24.7% ( 38) 2 11.0% ( 17) 9.1% ( 14) 3 1.3% ( 2) 2.6% ( 4)

4 . ( .) 0.6% ( 1) 5 . ( .) 0.6% ( 1) 7 . ( .) 1.3% ( 2)

Baseline Charlson_approx 0.52 +/- 0.74 ( 154) 0.66 +/- 1.15 ( 154) -0.14[ -0.35 to 0.08] -0.14 Follow-up (FU) FU SF-12 Physical 40.03 +/- 11.66 ( 133) 40.69 +/- 10.72 ( 137) -0.67[ -3.35 to 2.02] -0.06 FU WOMAC Knee Pain (0-100) 14.01 +/- 16.35 ( 154) 23.93 +/- 18.13 ( 154) -9.92[ -13.79 to -6.05] -0.05 Time from Baseline to FU (months) Median <q1-q3> (n) 36 < 17- 39> ( 154) 25 < 16- 37> ( 152) -0.57 Mean +/-sd (n) 30.05 +/- 11.63 ( 154) 26.72 +/- 10.64 ( 152) 3.33[ 0.82 to 5.84] 0.3

76

Table 1b.iii - Comparisons of subject characteristics in raw data – NEBH

Variable TKR (n=248) Non-TKR (n=248) TKR minus non-TKR Delta (∆) and [95% CI]

Effect Size

mean +/- standard deviation (SD) or %(n) (∆/SD) Age in years 63.33 +/- 8.52 ( 248) 63.28 +/- 8.85 ( 248) 0.05[ -1.48 to 1.58] 0.01 BMI 31.50 +/- 6.71 ( 244) 31.18 +/- 5.00 ( 233) 0.32[ -0.75 to 1.39] 0.05 SF-12 Physical 37.33 +/- 8.92 ( 237) 38.06 +/- 9.17 ( 237) -0.72[ -2.36 to 0.91] -0.08 SF-12 Mental 48.23 +/- 8.05 ( 237) 53.08 +/- 10.57 ( 237) -4.85[ -6.54 to -3.15] -0.52 Depression Scale 10.06 +/- 9.43 ( 238) Back Pain 75.6% +/- 43.0% ( 234) ADL/Disability 29.07 +/- 14.46 ( 229) WOMAC Total Score 43.05 +/- 19.85 ( 228) WOMAC Pain subscores, 0=none, 4=extreme WOMAC Pain- Walking 1.96 +/- 0.96 ( 241) . WOMAC Pain- Stairs 2.58 +/- 0.91 ( 236) WOMAC Pain- In Bed 1.70 +/- 1.30 ( 241) . WOMAC Pain- Sit/Lie down 1.46 +/- 1.09 ( 240) WOMAC Pain- Standing 1.90 +/- 1.02 ( 240) . WOMAC Knee pain (0-100) 56.07 +/- 19.02 ( 240) 48.41 +/- 22.67 ( 240) 7.67[ 3.91 to 11.42] 0.37 Contralateral Knee Pain (0-100) 21.11 +/- 22.23 ( 240) 23.99 +/- 21.84 ( 240) -2.89[ -6.84 to 1.06] -0.13 Hip Pain or Pain/Ache/Stiff 12.5% +/- 33.1% ( 248) 64.7% +/- 47.9% ( 238) (52.2%)[(59.5%) to(44.9%)] -1.27 Homunculus (0 to 100%) 30.22 +/- 17.05 ( 211) . Narcotics 8.5% +/- 27.9% ( 248) 15.4% +/- 36.1% ( 241) ( 6.9%)[( 12.6%) to ( 1.2%)] -0.21 Charlson (approximate) N=245 N=245 .

0 76.3% ( 187) 76.3% ( 187) 1 17.1% ( 42) 17.1% ( 42) . 2 6.1% ( 15) 2.4% ( 6) 3 0.4% ( 1) 2.0% ( 5) .

4 . ( .) 1.2% ( 3) 5 . ( .) 0.8% ( 2) .

Baseline Charlson_approx 0.31 +/- 0.60 ( 245) 0.37 +/- 0.85 ( 245) -0.07[ -0.20 to 0.07] -0.09 Follow-up (FU) FU SF-12 Physical 48.92 +/- 9.45 ( 228) 40.37 +/- 10.00 ( 152) 8.55[ 6.56 to 10.54] 0.88 FU WOMAC Knee Pain (0-100) 15.33 +/- 18.87 ( 248) 33.53 +/- 23.83 ( 247) -18.20[ -21.99 to -14.40] -0.85 Time from Baseline to FU (months) Median <q1-q3> (n) 12 < 12- 12> ( 248) 12.2 < 11.46- 22> ( 244) Mean +/-sd (n) 12.77 +/- 3.33 ( 248) 16.90 +/- 9.29 ( 244) -4.12[ -5.35 to -2.89] -0.59

77

Table 1b.iv- Comparisons of subject characteristics in raw data – TUFTS

Variable TKR (n=72) Non-TKR (n=72) TKR minus non-TKR Delta (∆) and [95% CI]

Effect Size

mean +/- standard deviation (SD) or %(n) (∆/SD) Age in years 62.6 +/- 10.0 ( 72) 61.8 +/- 8.1 ( 72) 0.75[ -2.23 to 3.73] 0.08 BMI 35.8 +/- 19.3 ( 65) 31.2 +/- 4.5 ( 49) 4.57[ -1.01 to 10.15] 0.31 SF-12 Physical 31.5 +/- 7.6 ( 38) 35.5 +/- 6.7 ( 38) -4.05[ -7.32 to -0.78] -0.57 SF-12 Mental 48.5 +/- 12.2 ( 38) 51.7 +/- 11.8 ( 38) -3.23[ -8.72 to 2.27] -0.27 Depression Scale not avail 11.0 +/- 8.9 ( 47) Back Pain not avail 81.3% +/- 39.4% ( 48) ADL/Disability not avail 31.1 +/- 9.9 ( 57) WOMAC Total Score not avail 44.6 +/- 12.8 ( 57) WOMAC Pain subscores, 0=none, 4=extreme WOMAC Pain- Walking not avail 1.9 +/- 0.8 ( 65) WOMAC Pain- Stairs not avail 2.5 +/- 0.8 ( 62) WOMAC Pain- In Bed not avail 1.6 +/- 0.9 ( 65) WOMAC Pain- Sit/Lie down not avail 1.5 +/- 0.8 ( 65) WOMAC Pain- Standing not avail 1.8 +/- 0.7 ( 65) WOMAC Knee pain (0-100) 48.0 +/- 13.1 ( 66) 47.2 +/- 14.5 ( 65) 0.80[ -3.98 to 5.58] 0.06 Contralateral Knee Pain (0-100) not avail 36.0 +/- 21.4 ( 65) Hip Pain or Pain/Ache/Stiff 9.9% +/- 30.0% ( 71) 60.0% +/- 49.4% ( 65) ( 50.1%)[( 63.9%) to ( 36.4%)] -1.24 Homunculus (0 to 100%) 13.0 +/- 7.7 ( 23) 23.5 +/- 21.5 ( 64) -10.49[ -19.61 to -1.36] -0.56 Narcotics 16.7% +/- 37.5% ( 72) 19.3% (11/57) ( 2.6%)[( 16.2%) to 10.9% ] -0.07 Charlson (approximate) N=71 N=71

0 49.3% ( 35) 49.3% ( 35) 1 19.7% ( 14) 19.7% ( 14) 2 22.5% ( 16) 14.1% ( 10) 3 2.8% ( 2) 8.5% ( 6)

4 1.4% ( 1) 1.4% ( 1) 5 2.8% ( 2) 7.0% ( 5) 6 1.4% ( 1) . ( .)

Baseline Charlson_approx 1.01 +/- 1.34 ( 71) 1.14 +/- 1.50 ( 71) -0.13[ -0.60 to 0.34] -0.09 Follow-up (FU) FU SF-12 Physical 39.2 +/- 9.1 ( 52) 37.2 +/- 7.8 ( 42) 1.91[ -1.60 to 5.42] 0.22 FU WOMAC Knee Pain (0-100) 16.4 +/- 15.3 ( 72) 35.0 +/- 21.0 ( 72) -18.65[ -24.71 to -12.58] -1.01 Time from Baseline to FU (months) Median <q1-q3> (n) 14 < 13- 16> ( 70) 15.36 < 12.12- 29.5> ( 72) Mean +/-sd (n) 15.84 +/- 12.00 ( 70) 21.28 +/- 11.54 ( 72) -5.43[ -9.34 to -1.53] -0.46

78

Table 1c.i - Comparisons of subject characteristics in imputed data – OAI

Variable TKR

(n=252)

Non-TKR

(n=252) TKR minus non-TKR Delta


mean +/- standard deviation (SD) (∆/SD)

Age 67.88 ± 8.50 66.82 ± 8.50 1.06 [-0.42 , 2.54] 0.06

Male, N(%) 0.43 ± 0.50 0.44 ± 0.50 -0.01 [-0.09 , 0.08] -0.01

Baseline BMI 29.84 ± 5.84 30.51 ± 5.69 -0.67 [-1.68 , 0.33] -0.06

Baseline SF-12 Physical 38.68 ± 10.51 39.77 ± 10.64 -1.10 [-2.94 , 0.75] -0.05

Baseline SF-12 Mental 55.60 ± 9.81 54.28 ± 9.61 1.31 [-0.38 , 3.01] 0.07

Baseline WOMAC Knee pain (0-100) 37.97 ± 19.33 34.40 ± 19.38 3.56 [0.18 , 6.94] 0.09

Baseline Knee Pain in Contralateral (0-100) 15.05 ± 16.95 15.70 ± 16.99 -0.65 [-3.62 , 2.31] -0.02

Baseline Hip Pain or Pain/Ache/Stiff 0.56 ± 0.54 0.60 ± 0.56 -0.04 [-0.14 , 0.06] -0.04

At least one comorbidity, N (%) 0.30 ± 0.61 0.27 ± 0.70 0.03 [-0.08 , 0.14] 0.02

Narcotics, N (%) 0.14 ± 0.32 0.10 ± 0.33 0.05 [-0.01 , 0.10] 0.07

Follow-up SF-12 Physical 44.20 ± 13.65 41.17 ± 10.24 3.03 [0.92 , 5.13] 0.13

Follow-up WOMAC Knee Pain (0-100) 11.78 ± 18.15 26.62 ± 18.15 -14.84 [-18.01 , -11.67] -0.41

*Shaded rows indicate variables where definitions varied between databases so that these variables ultimately were exlcuded as candidates in the building of final models.

79

Table 1c.ii - Comparisons of subject characteristics in imputed data – MOST

Variable TKR

(n=154)

Non-TKR

(n=154) TKR minus non-TKR

Delta (∆) and [95% CI] Effect Size


Age 65.49 ± 7.18 65.21 ± 7.18 0.28 [-1.32 , 1.88] 0.02

Male, N(%) 0.31 ± 0.46 0.31 ± 0.46 0.01 [-0.10 , 0.11] 0.01

Baseline BMI 32.33 ± 9.48 31.44 ± 7.18 0.89 [-0.98 , 2.77] 0.05


Baseline SF-12 Mental 55.99 ± 11.00 54.23 ± 11.80 1.76 [-0.79 , 4.31] 0.08

Baseline WOMAC Knee pain (0-100) 40.23 ± 24.82 35.33 ± 22.24 4.90 [-0.36 , 10.17] 0.1

Baseline Knee Pain in Contralateral (0-100) 21.46 ± 20.06 19.38 ± 24.10 2.08 [-2.87 , 7.03] 0.05

Baseline Hip Pain or Pain/Ache/Stiff 0.46 ± 0.50 0.61 ± 0.50 -0.15 [-0.26 , -0.04] -0.15

At least one comorbidity, N (%) 0.38 ± 0.49 0.39 ± 0.49 -0.01 [-0.12 , 0.10] -0.01

Narcotics, N (%) 0.20 ± 0.44 0.13 ± 0.39 0.07 [-0.03 , 0.16] 0.08

Follow-up SF-12 Physical 40.01 ± 12.11 40.10 ± 11.99 -0.09 [-2.78 , 2.60] 0


*Shaded rows indicate variables where definitions varied between databases so that these variables ultimately were excluded as candidates in the building of final models.

80

Table 1c.iii - Comparisons of subject characteristics in imputed data – NEBH

Variable TKR

(n=248)

Non-TKR




Age 63.33 ± 8.69 63.28 ± 8.69 0.05 [-1.48 , 1.58] 0

Male, N(%) 0.42 ± 0.49 0.40 ± 0.49 0.02 [-0.07 , 0.10] 0.02

Baseline BMI 31.56 ± 5.86 31.14 ± 6.07 0.41 [-0.64 , 1.46] 0.03


Baseline SF-12 Mental 48.22 ± 9.60 53.17 ± 9.72 -4.95 [-6.65 , -3.25] -0.26

Baseline WOMAC Knee pain (0-100) 56.11 ± 21.64 48.11 ± 21.16 8.00 [4.23 , 11.77] 0.19

Baseline Knee Pain in Contralateral (0-100) 21.27 ± 22.18 23.86 ± 22.18 -2.58 [-6.49 , 1.32] -0.06


At least one comorbidity, N (%) 0.24 ± 0.43 0.23 ± 0.42 0.00 [-0.07 , 0.08] 0

Narcotics, N (%) 0.08 ± 0.32 0.15 ± 0.33 -0.07 [-0.13 , -0.01] -0.1




81

Table 1c.iv - Comparisons of subject characteristics in imputed data – TMC

Variable TKR

(n=72)

Non-TKR




Age 62.57 ± 9.06 61.82 ± 9.06 0.75 [-2.21 , 3.71] 0.04

Male, N(%) 0.69 ± 0.46 0.69 ± 0.46 -0.00 [-0.15 , 0.15] 0

Baseline BMI 33.42 ± 6.77 30.99 ± 7.28 2.43 [0.13 , 4.73] 0.17


Baseline SF-12 Mental 49.48 ± 14.97 51.58 ± 14.97 -2.10 [-6.99 , 2.79] -0.07

Baseline WOMAC Knee pain (0-100) 47.48 ± 14.48 46.43 ± 14.43 1.05 [-3.67 , 5.77] 0.04

Baseline Knee Pain in Contralateral (0-100) 21.46 ± 13.72 19.38 ± 16.48 2.08 [-2.87 , 7.03] 0.07


At least one comorbidity, N (%) 0.51 ± 0.51 0.51 ± 0.51 0.00 [-0.16 , 0.17] 0

Narcotics, N (%) 0.17 ± 0.38 0.19 ± 0.40 -0.02 [-0.15 , 0.11] -0.03




82

APPENDIX G: Example of stakeholder influence on the model development process

Stakeholders were strongly supportive of using both the pain and functional outcomes in

patients’ decision-making processes, although they were not of equal importance to all patient

stakeholders. As the project progressed, the team continued to get more input from patient and

clinician stakeholders, which influenced modeling. For example, the initial model built on the

pooled OAI and MOST dataset predicted 1-year physical function with inclusion of a depression

score, which was a variable included in the OAI and MOST data sets. We used the larger pooled

dataset to see if any variables alone, or interacting with a treatment variable, should be added or

removed to improve the model, and had the model re-reviewed by clinician and patient

stakeholders and the team developing the user interface. In this case, the depression score had a

p-value between .05 and .10 from model of the 1-year SF-12 outcome after corrected for

imputation error. The depression scores available in our database came from a multi-item

questionnaire. The research team was concerned the additional items needed to compute the

depression score might be burdensome for the patient and/or clinician to collect, and we thus

considered removing the variable from the model. We then compared model performance (r-

square, calibration) with and without the variable, and although performance was slightly worse

without the variable, the decline in r-square of <1% was considered insufficient to justify the data

collection burden of retaining it, and the team decided to use the simpler version of the model.

83

APPENDIX H

Table 2a. Models of 1-year Knee pain (scored as 0 = no pain, 100= extreme pain)

Initial Model built on OAI

and tested on MOST

P1. Model

(from pooled MOST+OAI)

P2. model

(from 4-source data)

R-square 0.36(OAI), 0.32 (MOST) 0.32 0.32

Beta Coeff(stderr) p-value Beta Coeff(stderr) p-value Beta Coeff(stderr) p-value

Model intercept (constant) 1.95 (2.60) p=0.4550 -2.99(4.04) p=0.4597 31.44(5.52) p=<.0001

Treatment (1=TKR, 0=control) -4.59 (3.41) p=0.1781 -5.00(2.77) p=0.0718 -3.33(2.16) p=0.1246

WOMAC knee pain (base) 0.49 (0.05) p=<.0001 0.42(0.04) p=<.0001 0.49(0.03) p=<.0001

Interaction: Treatment * WOMAC knee pain -0.21 (0.07) p=0.0044 -0.18(0.06) p=0.0026 -0.33(0.05) p=<.0001

Age (in years) -0.12(0.05) p=0.0225

SF-12 mental component [base] -0.11(0.05) p=0.033

SF-12 physical component [base] -0.21(0.07) p=0.0017

Age, dichotomized (less than 60 years old: 1=yes,0=no)) 4.20 (1.79) p=0.0186 4.44(1.41) p=0.0017

WOMAC [base] contralateral knee pain 0.13(0.04) p=0.0562

Homunculus % 0.26 (0.08) p=0.0008 0.11(0.05) p=0.0155

Body mass index [base], kg/m2 0.22(0.12) p=0.0628

Hip Pain (1=yes, 0=no) -0.31 (2.48) p=0.8992 2.00(1.80) p=0.2694

Interaction: Treatment*Hip Pain -5.69 (2.96) p=0.0545 -3.82(2.29) p=0.0948

84

Table 2b. Models of 1-year Physical Function (Physical Component Score of SF-12)

Initial Model built on OAI and

tested on MOST

F1. model

(from pooled MOST+OAI)

F2. model

(from 4-source data)

Adjusted R-square (range from 10 imputed datasets) 0.42(OAI), 0.18 (MOST) 0.35 0.34

Beta Coeff(stderr) p-value Beta Coeff(stderr) p-value

Model Intercept (constant) 5.49 (8.70) p=0.5337 25.84(5.15) p=<.0001 17.40(4.27) p=<.0001

Treatment (1=TKR, 0=control) 3.43 (1.00) p=0.0017 2.58(0.74) p=0.0008 25.41(4.33) p=<.0001

Gender (1=male, 0=female) 1.25 (0.95) p=0.1961 1.60(0.75) p=0.037 0.99(0.57) p=0.0873

Age (in years) -0.11 (0.04) p=0.0080 -0.13(0.04) p=0.0017 -0.05(0.04) p=0.2397

SF-12 mental component [base] 0.32 (0.10) p=0.0058 0.12(0.05) p=0.0196 0.19(0.04) p=<.0001

SF-12 physical component [base] 0.65 (0.06) p=<.0001 0.57(0.04) p=<.0001 0.55(0.03) p=<.0001

Body mass index [base], kg/m2 -0.16(0.08) p=0.0664 -0.19(0.05) p=0.0008

Charlson Comorbidity Score >=1 (vs. 0) -2.05(0.60) p=0.0009

Interaction: hadTKR*age -0.15(0.06) p=0.0084

Interaction: hadTKR*SF-12 mental score -0.18(0.06) p=0.0013

WOMAC [base] contralat knee pain -0.05 (0.03) p=0.1335 -0.07(0.02) p=0.0033

Homunculus (% of sites positive) -0.19 (0.06) p=0.0059

Hip Pain (1=yes, 0=no) 3.90 (1.03) p=0.0003

85

Table 2c. Summary of variables used in multivariable models built on pooled data sources

Summary of Data

(% or 5th to 95th percentile from imputed data) [for variables used for models]

Included in model built on pooled 2-source (P1/F1) or pooled 4-source

database (P2/F2)

Label Pooled (2-source) database

OAI and MOST Pooled (4-source) database

OAI/MOST/TUFTS/NEBH P1 F1 P2 ++ F2 ++

Treatment (1=TKR, 0=control) (50% -matched set) (50% -matched set) Yes Yes Yes Yes

Gender (1=male, 0=female) 39% male 42% male no Yes no Yes

Age (in years) 53 to 79 51 to79

(min-max: 40 to 88) (dichot) Yes Yes Yes

SF-12 mental component [base] 37 to 67 (high is good) 34 to 66 no Yes Yes Yes

SF-12 physical component [base] 24 to 53 (high is good) 23 to 53 no Yes Yes Yes

WOMAC [base] contralat knee pain (100 pt. scale) 0 (no pain) to 51 not available Yes Yes no no

Body mass index [base], kg/m2 22 to 41 23 to 41 Yes Yes no Yes

WOMAC knee pain (base) 5 to 70 (high is bad) 10 to 80 Yes no Yes no

Homunculus, % sites with symptom 0% (no sites with pain) to 53% not available Yes no no no

Hip Pain (1=yes, 0=no)* 56% 48% Yes no no no

Less than 60 years old (1=yes,0=no) 21% 27% Yes no no no

Charlson comorbidity score >=1 (1=yes, 0=no) 32% 31% no no no Yes

* OAI definition: Right or Left hip pain, aching or stiffness: any, past 12 months (includes pain in groin and in front and sides of upper thigh)

*Variables in gray were used in P1 and/or F1 but were NOT used in P2/F2 models

++ Model used for interface

86

APPENDIX I: Exploration of Linear and Alternative Regression Models for WOMAC knee pain

One challenge encountered in this project was related to distribution of the 1-year

WOMAC scores. We realized that the follow-up WOMAC Pain scores at 1-year (or closest visit to 1-

year) were right skewed with most subjects having low scores (less pain). In addition to looking at

linear regression models as planned, we also explored using general linear mixed models

assuming the outcome had either a negative binomial model (with the outcome rounded to

integer values of 0 to 100) or a beta distribution (outcome was rescaled as .01 to 0.99 with 0 equal

to .01 and 1 equal to 0.99)

We then ranked the true 1-year WOMAC pain outcome by quintile, and compared

observed mean WOMAC pain and predicted WOMAC pain (from each model) to see if we might

improve our predictions using a non-linear regression model.

The next page shows some preliminary models that were run on the pooled MOST and

OAI databases and plots of predicted values. After reviewing these results, we opted to continue

using linear regression for this project, as neither of the alternatives we explored appeared much

better than the simpler and pre-planned approach using linear regression.

89

APPENDIX J: Figure 1. Early version of the physical function predicted outcome results page

Figure 2. Final version of the physical function predicted outcome results page

90

Figure 3. Early combined pain and function predicted outcome results page

91

Figure 4. Final combined pain and function predicted outcome results page

92

APPENDIX K: Mathematical Equipoise between Pain and Function Predictions with Nonsurgical Care or Total Knee Replacement

Mathematical equipoise is defined within KOMET as a condition when pain and

functioning outcome predictions with nonsurgical care and TKR are relatively close and fall within

each other’s circles of “zones of uncertainty,” i.e., their circles of uncertainty overlap. These circles

are illustrated on the graph below. The uncertainty circle is defined by the shaded area extending

around each of the point estimates and represents the uncertainty associated with the

predictions. The blue diamond represents the outcome prediction point estimate for nonsurgical

care and the green circle represents the point estimate for TKR. The large shaded blue and green

overlapping circles around the point estimates represent the uncertainty associated with the

predictions. We computed the mathematical distance between the nonsurgical and TKR

predictions as the distance between the two coordinates on the pain and function graph.

93

Figure 1. Depiction of “zone of uncertainty” or mathematical equipoise

Equation for Calculating the Distance between Predictions with nonsurgical care and with knee

replacement

The equation used to calculate the distance between the predictions for pain and function

is below.

d1 = �((x2 − x1) ∗ (x2 − x1) + (y2 − y1) ∗ (y2 − y1)) where the coordinate for

pain and function predictions with nonsurgical care is represented as (x1, y1) and the coordinate

for pain and function predictions with TKR as (x2, y2).

94

APPENDIX L Three Sample Graphs with Small, Moderate, and Large Amounts of “Uncertainty Circle” Overlap

Small Overlap

Moderate

Large Overlap

94

APPENDIX M Consistency of knee pain and function outcomes used for models with other measure of knee

pain and function

The team felt it important to evaluate other outcomes to look for consistency of effect.

These were exploratory analyses done after the models for pain and function were finalized. The

evaluations were done using the OAI database.

For pain, the study outcome was WOMAC knee pain. For the consistency of effect

evaluation, we also looked at KOOS knee pain, and KOOS symptom scales.

For function, the study outcome was SF-12 physical function score. For the consistency of

effect evaluation, we also looked at the KOOS function, sports, recreation (FSR) scale, the KOOS

quality of life (QOL) scale, and the KGLRS scale. The KGRLS is another quality of life index that asks

responders to ‘consider all the ways that knee pain and knee arthritis affect you’ rated on a 10

point scale of how they ‘are doing’ ranging from very good to very poor. For the purposes of these

evaluations, all of these scales/instruments were re-scaled to 0 to 100 where a low value

indicated poorer function and/or higher pain and high values indicated good function and/or

lower pain.

The results of these exploratory analyses suggest that the WOMAC knee pain tracks well

with other measures of knee pain and symptoms, and in particular, KOOS knee pain. The SF-12

physical function score, while positively correlated, does not track as strongly with other knee-

related quality of life and function variables. These results are somewhat to be expected in that

while there may be overlap in physical function and knee-related function they are not the same

thing. Our stakeholders suggest both overall and knee-related function are important and we have

come to believe future work to develop predictions of the more specific knee-related function

would be useful to both patient and clinical stakeholders.

94

I. SUBJECT PLOTS: For illustrative purposes we are showing baseline (pre) and 1-year follow-up (pos) raw (knee) pain and

function scores for a random sample of subjects. The header for each panel in each figure tells if the subject got a total knee

replacement (TKR). If the different scales are all capturing the same information, the lines within each panel should be overlapping. The

panel on the left shows the different pain scales (WOMAC knee pain (KP), KOOS KP, and KOOS Symptom. The panel on the right shows

the different function scales (SF-12 physical component score, KOOS SFR, KOOS QOL, KGLRS) . In general, the lines were reasonably

parallel and going in the same direction, although there was variability.

96

II. DISTRIBUTIONS: The distribution of scores at baseline (PRE), at the approximate 1-year follow-up (POS), and the POS minus

PRE change from baseline delta (DEL) were plotted for each scale. Different colors were used to show the distributions for both the

group of subjects that got TKR (red) and did not get TKR (blue). The results for distributions of the 3 pain scores are on the left panel,

and of the 4 function scores on the right panel.

Consistency of the scores would best be illustrated by finding similarities of the distributions between ROWs of the figures, while

there still may be differences between columns. This is shown clearly for the plot of pain scores on the left. For the PRE, the

distributions are reasonably symmetric and centered near a value of 60. For POS, the scores are higher (better) and skewed to the right,

especially for the TKR (red) subjects. The delta scores for all (3) pain measures are symmetric, but one can see more separation

between the TKR and non-TKR (red and blue respectively) subjects with the TKR subjects having greater improvements captured by all

three scales.

97

III. CORRELATION: We next wanted to look at consistency of scores at the subject level using simple bivariate scatter plots. If

scores for any two scales were the same, one would expect the points on the scatter plot to all fall along a diagonal line on the plot. The

corresponding correlation coefficient would be 1.0. Again, the panel on the left shows the 3 bivariate scatter plots for the 3 pain scores,

and the panel on the right shows the 6 bivariate plots for the 4 function scores. The red dots and red smoothed line are the data for the

subjects with TKR, and the blue dots and blue smoothed line are for the subjects who did not have TKR. All correlations were positive,

and nearly all having associated p-value <0.05.

98

IV. AGREEMENT: The last evaluation we did was categorize the change from baseline to follow-up as an improvement of over 8

points, worsening of over 8 points, or a change of no more than +/- 8 points. This was done for each subject for each scale. Again,

bivariate tables were constructed looking at agreement for the change categories. The choice of a change of 8 points on a 100-point

scale was based on the KOOS User’s Guide 1.1 Updated August 2012 (http://www.koos.nu/) which notes “The Minimal Important

Change (MIC) is currently suggested to be 8-10” with an acknowledgment that there are limitations to this suggestion. We evaluated

“agreement” with a kappa statistic. A Kappa of 1 indicates perfect agreement. The results of these analyses are displayed below. For the

pain scales, the WOMAC knee pain (KP) and KOOS KP had the highest Kappa (consistent with the largest correlation seen in part III).

Kappa’s were lower for the function scales than pain scales. The SF-12 agreeing more with the KOOS than KGLRS. Among the function

measures, the kappa was highest for the 2 KOOS scales (KOOS FSR and KOOS QOL). These results are shown on the following page.

http://www.koos.nu/

99

A. AGREEMENT: Pain Scales

100

B. AGREEMENT: Function Scales

101

B. AGREEMENT: Function Scales (continued)

102

V. Summary of Scales

103

VI. Screenshots of components of KOOS and KGLRS Scales from OAI database

KOOS Knee Pain KOOS Function, Sports, Recreational Activities

104

VII. Screenshots of components of KOOS and KGLRS Scales from OAI database (continued)

KOOS SYMPTOMS KOOS QOL and KGLRS

96

Copyright© 2019. Tufts Medical Center. All Rights Reserved.

Disclaimer:

The [views, statements, opinions] presented in this report are solely the responsibility of the author(s) and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute® (PCORI®), its Board of Governors or Methodology Committee.

Acknowledgement:

Research reported in this report was [partially] funded through a Patient-Centered Outcomes Research Institute® (PCORI®) Award (#ME-1306-02327) Further information available at:

https://www.pcori.org/research-results/2013/developing-software-predict-patient-responses-knee-osteoarthritis-treatments

Developing Software to Predict Patient Responses to Knee ... · Osteoarthritis Mathematical...

Documents

Transcript of Developing Software to Predict Patient Responses to Knee ... · Osteoarthritis Mathematical...