MASTER IN DE ERGOTHERAPEUTISCHE...

Faculteit Geneeskunde en Gezondheidswetenschappen

Updating the Evidence on Functional Capacity Evaluation Methods

A Systematic Review

Noortje SCHALLEY

Masterproef ingediend tot het verkrijgen van de graad van

Master of science in de ergotherapeutische wetenschap

Promotor: Prof. dr. Dominique Van de Velde

Co-promotor: Sofie Vertriest

Mentor: Lien Van Peteghem

Academiejaar 2015-2016

MASTER IN DE ERGOTHERAPEUTISCHE WETENSCHAP

Interuniversitaire master in samenwerking met:

UGent, KU Leuven, UHasselt, UAntwerpen,

Vives, HoGent, Arteveldehogeschool, AP Hogeschool Antwerpen, HoWest, Odisee, PXL,

Thomas More

ABSTRACT

Objectives: The purpose of this systematic review is to synthesize the recent evidence

(published since May 2004) on the psychometrics of currently used Functional Capacity

Evaluation (FCE) methods. This way, information from previous systematic reviews on

this topic can be enriched with up-to-date evidence, enabling more objectively

substantiated decisions in everyday practice.

Methods: A systematic literature search was conducted in nine databases. The resulting

articles‟ title, abstract and full-text were screened based on predefined inclusion and

exclusion criteria. Two reviewers independently performed this screening process.

Included studies were appraised based on their methodological quality. Relevant data

were extracted into extraction tables.

Results: The search resulted in 20 eligible studies of varying methodological quality,

on nine different FCE methods. The Baltimore Therapeutic Equipment Work Simulator

showed a moderate predictive validity and good diagnostic abilities. The Ergo-Kit (EK)

showed moderate variability and high inter- and intra-rater reliability. Low

discriminative abilities and high convergent validity were found for the EK. Concurrent

validity of the EK and the ERGOS Work Simulator was low to moderate. Moderate to

high test-retest, inter- and intra-reliability was found in the Isernhagen Work Systems

(IWS) FCE. The predictive validity of the IWS was low. The Physical Work

Performance Evaluation (PWPE) showed moderate test-retest reliability and moderate

to high inter-rater reliability. Low internal and external responsiveness were found for

the PWPE, while predictive validity was high. The predictive value of the Short-Form

FCE was also high. Low discriminative and convergent validity were found for the

Work Disability Functional Assessment Battery. The WorkHab showed moderate to

high test-retest, inter- and intra-rater reliability. Its manual handling component‟s

internal consistency was high.

Conclusions: Well-known FCE methods have been rigorously studied, but some of the

research indicates weaknesses in their reliability and validity. Future research should

address how these weaknesses can be overcome. Newer methods, such as the Short-

Form FCE need to be further examined on several psychometric properties.

ABSTRACT

Doelstellingen: Het doel van deze systematische review is om recente evidentie

(gepubliceerd sinds mei 2004) omtrent de psychometrische eigenschappen van

Functionele Capaciteiten Evaluatie methoden samen te bundelen. Op deze manier kan

de informatie uit voorafgaande systematische reviews omtrent dit thema worden verrijkt

met up-to-date evidentie, om zo objectief onderbouwde beslissingen in de dagelijkse

praktijk te faciliteren.

Methodes: In negen databases werd een systematische literatuursearch uitgevoerd. De

titel, het abstract en de full-texts van de verkregen artikels werden achtereenvolgens

gescreend op basis van vooraf bepaalde inclusie- en exclusiecriteria. Twee reviewers

voerden dit screeningsproces onafhankelijk van elkaar uit. Geïncludeerde studies

werden beoordeeld op basis van methodologische kwaliteit. Relevante data werden

geëxtraheerd in extractietabellen.

Resultaten: De zoekopdrachten leverden 20 geschikte studies op van variërende

methodologische kwaliteit, omtrent negen verschillende FCE methoden. De Baltimore

Therapeutic Equipment Work Simulator vertoonde matige predictieve validiteit en een

goed diagnostisch vermogen. De Ergo-Kit (EK) vertoonde matige variabiliteit en hoge

inter- en intra-beoordelaarsbetrouwbaarheid. Verder werden voor de EK ook een laag

discriminerend vermogen en goede convergente validiteit gevonden. De concurrente

validiteit van de EK en de ERGOS Work Simulator was laag tot matig. Matige tot hoge

test-hertest betrouwbaarheid en inter- en intra-beoordelaarsbetrouwbaarheid werden

gevonden voor de Isernhagen Work Systems (IWS) FCE. De predictieve validiteit van

de IWS was laag. De Physical Work Performance Evaluation (PWPE) vertoonde matige

test-hertest betrouwbaarheid en matige tot hoge inter-beoordelaarsbetrouwbaarheid. Er

werd ook een lage interne- en externe responsiviteit gevonden voor de PWPE, terwijl de

predictieve validiteit hoog was. Ook de predictieve validiteit van de Short-Form FCE

was hoog. Lage discriminante en convergente validiteit werden gevonden voor de Work

Disability Functional Assessment Battery. De WorkHab vertoonde matige tot hoge test-

hertest betrouwbaarheid en inter- en intra beoordelaarsbetrouwbaarheid. De interne

consistentie van de manual handling component van het assessment was hoog.

Conclusies: Bekende FCE methoden werden reeds uitgebreid bestudeerd, maar

bepaalde studies tonen zwaktes aan op vlak van betrouwbaarheid en validiteit. Verder

onderzoek zou best focussen op hoe deze zwaktes overwonnen kunnen worden.

Nieuwere methodes zoals de Short-Form FCE, zouden verder onderzocht moeten

worden op vlak van diverse psychometrische eigenschappen.

Aantal woorden masterproef: 10 303

(exclusief inhoudstafel, tabellen, cijfermateriaal, bijlagen en bibliografie)

TABLE OF CONTENTS

PREFACE / ACKNOWLEDGEMENTS ......................................................................... 1

1. Introduction ........................................................................................................... 2

2. Methods ................................................................................................................. 6

2.1. Systematic literature search strategy .............................................................. 6

2.2. Quality assessment ......................................................................................... 8

2.3. Synthesis approach ....................................................................................... 12

3. Results ................................................................................................................. 13

3.1. Literature search ........................................................................................... 13

3.2. Reviewed studies .......................................................................................... 14

3.2.1. Baltimore Therapeutic Equipment (BTE) Work Simulator .................. 15

3.2.2. Ergo-Kit (EK) Functional Capacity Evaluation .................................... 15

3.2.3. ERGOS Work Simulator (EWS) .......................................................... 15

3.2.4. Blankenship WorkEval Functional Capacity Evaluation ..................... 15

3.2.5. Isernhagen Work Systems (IWS) Functional Capacity Evaluation /

WorkWell Systems Functional Capacity Evaluation .......................................... 15

3.2.6. ErgoScience Physical Work Performance Evaluation (PWPE) ........... 16

3.2.7. Short-Form Functional Capacity Evaluation ........................................ 16

3.2.8. Work Disability Functional Assessment Battery (WD-FAB) .............. 16

3.2.9. WorkHab Functional Capacity Evaluation ........................................... 16

3.3. Quality appraisal .......................................................................................... 19

3.4. Study outcomes ............................................................................................ 20

3.5. Summary of the results and their interpretations ......................................... 63

3.5.1. Baltimore Therapeutic Equipment (BTE) Work Simulator .................. 63

3.5.2. Blankenship WorkEval Functional Capacity Evaluation ..................... 63

3.5.3. Ergo-Kit (EK) Functional Capacity Evaluation .................................... 63

3.5.4. ERGOS Work Simulator (EWS) .......................................................... 63


WorkWell Systems Functional Capacity Evaluation .......................................... 63

3.5.6. ErgoScience Physical Work Performance Evaluation (PWPE) ........... 64

3.5.7. Short-Form Functional Capacity Evaluation ........................................ 64

3.5.8. Work Disability Functional Assessment Battery (WD-FAB) .............. 64

3.5.9. WorkHab Functional Capacity Evaluation ........................................... 65

4. Discussion and conclusion .................................................................................. 66

4.1. Results .......................................................................................................... 66

4.2. Interpreting the results of the reviewed studies ............................................ 66

4.3. More than psychometric properties .............................................................. 67

4.4. Literature search and potentially missed studies .......................................... 68

4.5. Choice of the quality assessment method .................................................... 68

4.6. Recommendations for future research.......................................................... 69

REFERENCES ............................................................................................................... 70

APPENDICES ................................................................................................................ 76

1. List of figures and tables ..................................................................................... 76

2. Toelating voor consultatie ................................................................................... 77

1

PREFACE / ACKNOWLEDGEMENTS

My gratitude goes out to all the helping hands that contributed to realizing this study.

First of all, and above all, I would like to extend my gratitude to doctor Dominique Van

de Velde, promoter of this thesis. I am very thankful for his guidance and support

throughout the entire writing process; especially during the many consultation meetings,

in which I could always turn to him with my questions and with difficulties I had

encountered. Another great help during these meetings was my mentor Lien Van

Peteghem, who helped me understand the practical aspects, usage and relevance of

Functional Capacity Evaluations. I would also like to thank Sofie Vertriest, co-promoter

of this thesis, who proposed this study topic and followed-up on the realization of the

systematic review.

Furthermore, a special thanks goes out to Stijn De Baets for generously providing tips,

materials and feedback; and most importantly, for independently performing the

literature screening process as an external reviewer. I would also like to express my

deep gratitude to the five external readers who took the time to attentively read my work

and provide their very helpful inputs and feedback. Thank you for your time and effort,

professor Ev Innes, professor Haije Wind, professor Michiel Reneman, Dirk Vandamme

and Linda Gabriël.

Finally, special recognition goes out to my family and close friends, for their support,

encouragement and patience during the realization of my thesis.

Without all this direct and indirect support, I would not have been able to realize this

final result.

2

1. Introduction

A synthesis of research by the Organization for Economic Co-operation and

Development (OECD) in 2011 showed that almost all countries in the OECD have

significant social and economic impacts from the high number of workers permanently

leaving the labour market due to health problems or disability (1,2). Additionally,

research shows that people with reduced work capacity or with more time off work, are

less likely to remain in employment (1,2). Costs associated with disability benefits

make up a significant proportion of public expenditure across OECD countries,

averaging a total of 1.2% of countries‟ Gross Domestic Product (GDP) (1). In the

Netherlands, Norway and Sweden the proportion of GDP is even higher at 3.5% (1).

Furthermore, employment rates of people with disability are on average 40% lower than

the overall level (1). These low employment rates are accompanied by high social costs

(1,3,4) due to unemployment benefits, lower incomes and much higher poverty risk (1).

Changes in the labour market due to the Global Financial Crisis (GFC), such as

increased unemployment rates, might lead to a higher number of people relying on

sickness and disability benefits as their main source of income (1). Meanwhile, the

incidence of occupational injuries leading to long-term absenteeism and potentially

leading to job loss is still on the rise in high income countries (5,6). Such developments

have led to governments shifting policy focus to concentrate on tackling rising

unemployment (1). Despite the growing awareness of abovementioned issues, Return-

To-Work approaches are not always adequate and in several cases result in higher level

of employee litigation and lower levels of employee morale (7); while the prevalence of

occupational disability and related costs continue to increase (5). Evidence shows,

however, that “incapacity-related spending is much higher than unemployment-related

spending” (1). Furthermore, “except for a few countries, the share of spending on

vocational rehabilitation and employment programs is less than 8%, and in most cases

less than 4%, of total disability-related spending” (1). The discrepancy between the

costs associated with unemployment and those related to incapacity and disability are of

concern. Alongside this is the relatively small investment in vocational rehabilitation

and employment programs to address the high unemployment rates of those with

disabilities and the poor work sustainability of those with reduced work capacity. There

3

is a need for greater investment in vocational rehabilitation in an effort to increase

employment rates, including employment of people with disabilities limiting their work

capacity. In order to determine a person‟s work capacity, it is necessary to have

appropriate measurement instruments. Functional Capacity Evaluations (FCE) are such

instruments that could subsequently play a key role in decreasing incapacity-related

spending by determining a person‟s work capacity and matching this with appropriate

employment; either by matching workers‟ abilities to job requirements or by identifying

necessary modifications to the work environment or workload (8).

Although experts have difficulties in agreeing upon a single definition of Functional

Capacity Evaluations (FCE) (9,10), there seems to be agreement on the different terms

that comprise FCE (10). The following definition, based on the International

Classification of Functioning, Disability and Health, achieved 63% agreement amongst

FCE experts responding to a Delphi Survey: “A FCE is an evaluation of capacity of

activities that is used to make recommendations for participation in work while

considering the person‟s body functions and structures, environmental factors, personal

factors and health status” (10). Regardless of the lack of a uniform definition, the

purpose of FCE is considered to be the evaluation of a person‟s ability to participate in

work (10) by matching his or her capacities with functional job requirements (11). The

underlying assumption here, is that a better performance in the FCE is associated with

faster Return-To-Work and lower risk of re-injury or pain exacerbation (11); however, it

is important to note that many other factors which are not measured by FCE, such as

personal causation, also influence Return-To-Work outcomes (12). Furthermore, FCE is

also described as a systematic, comprehensive and multi-faceted measurement tool,

designed to measure a person‟s current physical abilities in work-related tasks (6,13).

FCEs are used for a variety of reasons, in differing settings. Most commonly they are

used with individuals who have work disabilities (6) in a rehabilitation or clinical

setting to: develop an individually oriented, customized rehabilitation program; adjust

the currently used rehabilitation program; measure changes in physical abilities (pre-

and post-intervention); and to determine functional work abilities and match these with

employment prior to Return-To-Work (6,8,14). FCEs have also become part of medico–

legal assessments to determine whether claimants should receive disability benefits,

4

based on their assessed functional abilities (8). In conclusion, many (health) disciplines

from various organizations use FCEs as part of practice, including for making

recommendations and decisions that have implications for people with work-related

disabilities, employers, insurers, other health and medical professionals and other

stakeholders. Therefore, it is important that FCE users know whether FCE methods

provide reliable and valid information and subsequently which method is preferable.

Over the past 30 years, several FCE methods have been developed. Matheson provided

one of the earliest examples in 1984, followed by Isernhagen in 1988, who also

suggested that FCE should be a multidisciplinary matter. Since then, at least ten

different types of well-known FCEs have been described (14). Some are still being

researched and adapted, while others have fallen out of favour, their

manufacturers/distributors have ceased operation or they have been superseded by later

versions or computer based systems (8).

The credibility of these FCE methods was primarily based on the knowledge and

expertise of the developers (6,9). However there is a growing interest in FCE‟s

psychometric properties because, as for any other test, “a FCE should give reliable and

valid measurements” (15). Consequently, numerous studies have been conducted to

validate FCE methods and to demonstrate their reliability in varying client groups.

These studies have been reviewed over the years, commencing with two comprehensive

reviews of a wide range of work-related assessments by Innes and Straker

(1999a,1999b) (16,17). Since then, other comprehensive systematic reviews addressing

the psychometric properties of FCEs have been conducted by Gouttebarge, Wind,

Kuijer and Frings-Dresen (2004); Wind, Gouttebarge, Kuijer and Frings-Dresen (2005)

and Innes (2006) (6,8,18). These comprehensive reviews synthesize the studies on all

types of psychometric properties of several well-known FCE methods and allow the

comparison of the quality of these methods. There have also been multiple systematic

reviews on more specific topics regarding FCE, such as the psychometrics of one

specific FCE method, such as the review by Bieniek and Bethge (2014) (19).

Furthermore, there have been multiple systematic reviews on specific psychometrics of

FCE and on the use of FCE in specific target groups, such as the systematic reviews by

van der Meer, Trippolini, van der Palen, Verhoeven and Reneman (2013) (20) and

5

Kuijer, Gouttebarge, Brouwer, Reneman and Frings-Dresen (2012) (21). These reviews

provide answers on more focused questions regarding FCE.

With the most recent comprehensive review on FCE dating from 2006 (8), there is a

clear need for an updated global review of the existing evidence. Especially since the

past years, new, promising FCE methods have been developed and known methods

have been refined and more thoroughly researched (8). Another important reason for

updating current comprehensive evidence on FCE validity and reliability, is the

continuing use of FCEs in important decision-making processes regarding occupational

rehabilitation, insurances, disability benefits, and so on. The decision makers should be

objectively informed of the strengths, weaknesses and unknown properties of the

measurement tools they choose or have chosen. With this study it is attempted to enable

such decision makers, for example clinicians in the vocational rehabilitation setting, to

nuance and critically interpret the outcomes FCE methods provide. The main aim of

performing this systematic review on the psychometric properties of FCE methods,

therefore, is to synthesize the recent evidence (published since May 2004) on the

validity and reliability of currently used methods. This way, information in previous,

similar syntheses can be enriched with up-to-date evidence, providing a more

comprehensive frame of reference. By doing so, it will be more feasible to evaluate and

compare the reliability and validity of FCE methods in order to make more objective

and substantiated decisions, for example in choosing between FCE methods.

The research question of this systematic review is therefore: “What are the (recently

studied) psychometric properties of current Functional Capacity Evaluation methods?”.

6

2. Methods

2.1. Systematic literature search strategy

For this systematic review, a literature search was conducted to identify relevant studies

from the following electronic databases. These particular databases were chosen based

on convenience and relevance.

Broad database(s)

- Web Of Science

- Trip Database

- Journal Storage (JSTOR)

- The bibliographic database of the Catholic University of Leuven

Healthcare database(s)

- PubMed

- Embase

- Cochrane Library

Discipline specific database(s)

- PEDro

- OTSeeker

Relevant search terms and their synonyms were identified. MeSH-terms were used

when available. Before the actual literature search, scoping searches were performed to

determine which of these search terms would provide the highest number of relevant

results. These were used as key terms (Table 1: Used key words and their synonyms)

and were combined into search phrases using Boolean operators. For example:

(Functional Capacity Evaluation OR FCE) AND (Psychometrics OR Psychometric

properties OR Validity OR Reliability) AND (Return to Work OR Vocational

rehabilitation OR Work OR Job OR Participation in work). These search phrases were

entered into the abovementioned databases between the 12th

and the 22nd

of November

2015. Filters for publication date and type of study were used when available, in

accordance with the inclusion criteria. This complete process was performed in three

7

languages: English, Dutch and French and resulted in a total of 1381 hits. These hits

were screened by applying the following inclusion and exclusion criteria:

Inclusion criteria:

- type of study: RCT (clustered or individual), CCT, meta-analysis or systematic

review,

- type of subjects: healthy subjects or subjects with an occupational disability or

disease (in the process of vocational rehabilitation),

- type of outcome measure: psychometric properties (validity and reliability),

- type of assessment: (components of) Functional Capacity Evaluation methods,

that assess the global physical/functional capacity of the subject,

- publication date: May 2004 to the 12th

of November 2015,

- language: English, French or Dutch.

Exclusion criteria:

- studies reporting on FCE methods that are no longer commercially available or

no longer used,

- studies of FCE methods that only measure certain specific functional/physical

capacities, for example: the EPIC Lift Capacity Test and the Progressive

Isoinertial Lifting Evaluation (PILE), only assesses lifting,

- studies of FCE methods with a specific target population, for example: the

Whiplash Associated Disorders (WAD) FCE,

- studies that are not relevant to the research question.

In the first phase of the screening process, these criteria were applied to the titles of the

1381 hits to determine whether they would be included or excluded. In the second

phase, the abstracts of the studies included based on title, were also screened, based on

above described criteria. When studies showed questionable relevance in these first two

phases, they were included and more thoroughly screened in the following phase. The

studies that were included based on title and abstract, were screened a third time based

on their full texts. In this final phase of the screening process, references of included

systematic reviews were also screened for relevant studies through citation searching.

8

To ensure objectivity in the screening process, an external reviewer was asked to

independently complete the same process of screening the studies based on title and

abstract. Percentage of agreement and Cohen‟s kappa were calculated to compare the

results of the screening process by the external reviewer. Agreement between the two

reviewers on the inclusion of studies based on abstract, was high on the first review

with 98.17% agreement and κ = 0.96 (SE = 0.03; 95% CI = 0.910-1.00), and excellent

after deliberation/discussion, with 100% agreement and κ = 1.00. Disagreements on the

inclusion or exclusion of studies were resolved by a discussion between the two

reviewers. It was not necessary to consult a third reviewer.

The complete process of the literature search, selection and screening were documented

to guarantee transparency.

Table 1: Used key words and their synonyms

Assessment Study objective Study topic

Functional Capacity Evaluation(s)

FCE

Functional assessment(s)

Psychometric properties

Psychometrics

Vocational rehabilitation

Return-To-Work

Vocational participation

Reliability

Reliable

Repeatable

Work

Job

Validity

Valid

2.2. Quality assessment

The methodological quality of the included studies was assessed by using the three-

level quality appraisal scale, developed by Gouttebarge, Wind, Kuijer and Frings-

Dresen (2004) for a previous systematic review on the psychometrics of FCE methods

(6). The scale is mostly based on other studies (16,22–26) and its purpose is to evaluate

the scientific relevance of studies researching psychometric properties. The items

„internal consistency‟ and „responsiveness‟ were added for the current review and based

on the criteria proposed by Terwee, Bot, de Boer, van der Windt , Knol, Dekker, Bouter

and de Vet (2007). Overall, the scale considers five methodological quality appraisal

features (6), namely:

9

1. Functional Capacity Evaluation: to evaluate if it is clearly mentioned and

whether the full FCE method has been used or, if not, which subtests were used

2. Objective: to evaluate whether the objective of the study is clearly defined

3. Study population: to judge whether the study population is well described

4. Procedure: to evaluate whether the study used a well-defined procedure to

achieve the objective

5. Statistics: to evaluate whether the statistics used were clearly described and

properly used to test the hypothesis of the study

Each study was scored on these five features and a total score was calculated by adding

+ and - scores. A plus (+) adds one point to the score, a minus (-) subtracts one point

and a +/- does not change the overall score. The methodological quality of the studies

was rated as follows (6):

- High: + 4 or 5, indicating a high methodological quality

- Moderate: + 2 or 3, indicating a moderate methodological quality

- Low: + 0 or 1, indicating a low methodological quality

Table 2: „The methodological quality appraisal‟ shows a comprehensive description of

the full appraisal. When a study did not present the needed information, it immediately

scored a „-‟ for this item. To determine whether the study procedures and statistics used

were adequate for the study purpose, the descriptions by Terwee et al. (2007) were used

(27). The outcomes of the reviewed studies are expressed through different statistics,

which all require a specific interpretation. Table 3: Interpretation of reliability and

validity outcomes‟ displays how specific statistics were translated into a „good‟,

„moderate‟ or „poor‟ level of reliability or validity.

10

Table 2: The methodological quality appraisal

FCE method

+ It is clearly mentioned in this study whether the full FCE-method or which subtests have been

used

- It is not clearly mentioned in this study whether the full FCE-method or which subtests have

been used

Objective

+ The objective of the study is clearly mentioned

- The objective of the study is not clearly mentioned

Population

n: number of subjects/raters, G: gender, A: age, H: health status, W: work status

+ The five Population items (n, G, A, H and W) appear in the article

+/- 3–4 of the five items appear in the article

- 1–2 of the five items appear in the article

Procedure

Intra-rater reliability or Test-retest reliability

+ Time interval (days) between test–retest ranges from 7 to 14

+/- Time interval (days) between test–retest ranges from 3 to 6 and 15 to 21

- Time interval (days) between test–retest is less than 3 or more than 21

Inter-rater reliability

+ Number of raters used is more than 2

+/- Number of raters used is 2 within more than ten measurements

- Number of raters used is 2 within ten measurements or less

Validity

+ The study design is clearly described and appears properly defined to the type of validity that

is meant to be measured

+/- The study design satisfies only one of the conditions described above

- The study design is not clearly described and does not appear properly defined to the type of

validity that is meant to be measured

Internal consistency

+ The study design is clearly described and appears properly defined to measure internal

consistency


- The study design is not clearly described and does not appear properly defined to measure

internal consistency

Responsiveness

+ Hypotheses about changes in measures are predefined

- Hypotheses about changes in measures are not predefined

Statistics

+ The statistics used are clearly described and appear properly defined to achieve the objective

of the study


- The statistics used are not clearly described and do not appear properly defined to achieve the

objective of the study

11

Table 3: Interpretation of reliability and validity outcomes

Levels of reliability

Pearson product moment coefficient (r), Spearman correlation coefficient (ρ), Kendall-Tau correlation

High r/ρ > 0.80

Moderate 0.50 ≤ r/ρ ≤0.80

Low r/ρ < 0.50

Intra-class correlation coefficient (ICC)

High ICC > 0.90

Moderate 0.75 ≤ ICC ≤ 0.90

Low ICC < 0.75

Kappa value (κ)

High κ > 0.60

Moderate 0.41 ≤ κ ≤ 0.60

Low κ ≤ 0.40

Cronbach’s alpha (α)

High α > 0.80

Moderate 0.71 ≤ α ≤ 0.80

Low α ≤ 0.70

Percentage of agreement ( %)

High % >0.90 and the raters can choose between more than two score levels

Moderate % >0.90 and the raters can choose between two score levels

Low % <0.90 and/or the raters can choose between two score levels

Levels of validity

Face/content validity

High The test measures what it is intended to measure and all relevant components are included

Moderate The test measures what it is intended to measure but not all relevant components are included

Low The test does not measure what it is intended to measure

Criterion-related validity: concurrent and predictive validity

High Substantial similarity between the test and the criterion measure (percentage agreement

≥ 90%, κ > 0.60, r > 0.75)

Moderate Some similarity between the test and the criterion measure (percentage agreement ≥70%, κ ≥

0.40, r ≥ 0.50)

Low Little or no similarity between the test and the criterion measure (percentage agreement <70%,

κ < 0.40, r < 0.50)

Construct validity: convergent and divergent/discriminant validity

High Good ability to differentiate between groups or interventions, or good convergence/divergence

between similar tests (r ≥ 0.60)

Moderate Moderate ability to differentiate between groups or interventions, or moderate

convergence/divergence between similar tests (0.60> r ≥ 0.30)

Low Poor ability to differentiate between groups or interventions, or low convergence/divergence

between similar tests (r < 0.30)

Other

Responsiveness

Adequate Ability to detect clinically important changes over time (Responsiveness Ratio (RR) ≥ 1.96 or

the area under the receiver operating characteristics (ROC) curve (AUC) ≥ 0.70)

Inadequate No ability to detect clinically important changes over time (Responsiveness Ratio (RR) < 1.96

or the area under the receiver operating characteristics (ROC) curve (AUC) < 0.70)

12

Studies were not excluded based on their score on the three-level quality appraisal by

Gouttebarge et al. (2004) because this appraisal does not take into account all factors

that potentially influence methodological quality. To further nuance the studies‟ scores

on the methodological quality appraisal, additional information on methodological

aspects that are susceptible to specific types of bias (28) are displayed in Table 4:

„Methodological aspects of the reviewed studies, susceptible to bias‟.

2.3. Synthesis approach

Relevant data were extracted from the reviewed studies and gathered into a standardized

data-extraction form. This form was subdivided into extraction tables in which relevant

data were organized. The core findings in each study were expressed by measures of

validity and/or reliability. Where possible, these data were directly extracted from the

original article.

The complete systematic review was reviewed by five external reviewers: three experts

in FCE research with several topic-related publications and two professionals with

extensive practical FCE experience.

13

3. Results

3.1. Literature search

A total of 1381 hits were retrieved from the nine databases. Of these, 200 were

duplicates. After screening the remaining 1181 references by applying the inclusion and

exclusion criteria to their titles, 1073 were excluded (phase 1 screening). The most

frequent reason for exclusion was that the titles clearly indicated topics not related to

FCE and/or vocational rehabilitation. Of the 108 references included based on title, 69

were excluded following application of the inclusion and exclusion criteria to the

abstracts (phase 2 screening). Most frequent reasons for exclusion were the publication

date of the article (prior to May 2004), the type of article/study (not a RCT, CCT, meta-

analysis or systematic review), the study topic (not related to FCE or vocational

rehabilitation) and/or the study objective (not aiming to research psychometric

properties). The full-texts of the remaining 39 references were retrieved through

bibliographic databases of the Catholic University of Leuven and the University of

Ghent or by contacting the author(s). Original articles accounted for 34 of the 39

publications and the remaining five were systematic reviews. Citation searches were

performed by using the reference lists included in these reviews and applying the

inclusion and exclusion criteria, resulting in one additional eligible original paper. Other

citations were excluded based on title or abstract. Reasons for exclusion were:

duplicates, the publication date, the study topic and the study objective. Inclusion and

exclusion criteria were applied to the 40 eligible full-texts, resulting in the exclusion of

20 full-texts with the following reasons for exclusion: not containing relevant

psychometric data, researching a target group-specific FCE method or not providing

information of relevance to the research question in this current systematic review. This

left 20 original studies to be included in this systematic review. With the exception of

one study written in French, all others were in English.

14

Figure 1: PRISMA Flow Diagram

3.2. Reviewed studies

Studies on nine different FCE methods were reviewed (Table 5: Studies sorted by FCE

method). This included two studies on the full Baltimore Therapeutic Equipment Work

Simulator and one on the full Blankenship FCE. Three studies researched the

psychometric properties of subtests of the Ergo-Kit FCE and another study examined

subtests of the Ergo-Kit FCE, in comparison to subtests of the ERGOS Work Simulator.

Five studies were retrieved on the Isernhagen Work Systems FCE, of which two studied

the full assessment and three studied its subtests. Four studies researched the full

Physical Work Performance Evaluation and one study was found on the psychometrics

of the full Short-Form FCE. Lastly, one study researched the subtests of the Work

Disability Functional Assessment Battery and two studies researched the properties of

the WorkHab FCE‟s subtests. Table 6: „Studies sorted by topic (type of psychometric

studied‟ shows which psychometric properties were researched per FCE method.

15

3.2.1. Baltimore Therapeutic Equipment (BTE) Work Simulator

The systematic literature search retrieved no studies on the reliability of the BTE. Two

studies were found on the predictive validity of the assessment (29,30).

3.2.2. Ergo-Kit (EK) Functional Capacity Evaluation

Two studies were found on the reliability of the EK: one on the intra- and inter-rater

reliability (31) and one on the inter-rater reliability and agreement (32). Two studies

were found on the validity of the EK. One studied the discriminant/divergent and

convergent validity (33) and the other study researched the concurrent validity of the

EK and the ERGOS Work Simulator (34).

3.2.3. ERGOS Work Simulator (EWS)

No studies were found on the reliability of the ERGOS Work Simulator. One study was

found on the concurrent validity of the EWS with the Ergo-Kit FCE (34).

3.2.4. Blankenship WorkEval Functional Capacity Evaluation

No studies found on the reliability or validity of the Blankenship FCE. One study on the

sensitivity and specificity of the assessment was found (35).


WorkWell Systems Functional Capacity Evaluation

Three studies were found on the reliability of IWS, one on the test-retest

reliability/reproducibility (36), one on the inter-rater reliability (37) and one on both

inter-rater as intra-rater reliability (38). Two studies on the validity of the IWS were

found, both examined its predictive validity (10,25).

An important note is that the name of the Isernhagen Work Systems FCE has recently

been officially replaced with the name „WorkWell Systems (WWS) FCE‟. However, in

the five studies that were retrieved on this FCE method, the name „Isernhagen Work

Systems FCE‟ was still used. Therefore, it was chosen to continue using this name for

this systematic review. However, it is important to bear in mind that both names refer to

the same assessment.

16

3.2.6. ErgoScience Physical Work Performance Evaluation (PWPE)

The systematic literature search retrieved two studies on the reliability of the PWPE:

one on the test-retest reliability/reproducibility (41) and one on the inter-rater reliability

(42). One study on the predictive validity of the PWPE was found (43) and one on the

internal and external responsiveness of the assessment (44).

3.2.7. Short-Form Functional Capacity Evaluation

The systematic literature search retrieved no studies on the reliability of the Short-Form

FCE and one study on its predictive validity (45).

3.2.8. Work Disability Functional Assessment Battery (WD-FAB)

No studies were retrieved on the reliability of the WD-FAB and one study was retrieved

on the discriminant/divergent and convergent validity of the assessment (46).

3.2.9. WorkHab Functional Capacity Evaluation

Two studies were found on the reliability of the WorkHab FCE, one researching test-

retest reliability (47) and one researching intra- and inter-rater reliability (48). The study

by James, Mackenzie and Capra (2010) also researched the WorkHab‟s internal

consistency.

17

Table 4: Studies sorted by topic (type of studied psychometric property)

Reliability Validity Other

Reproducibility / Test-retest reliability Criterion-related validity Construct validity Diagnostic properties Responsiveness Internal

consistency

Agreement Reliability Concurrent

validity

Predictive

validity

Discriminant/

divergent

validity

Convergent

validity Sensitivity Specificity

Internal

responsiveness

External

responsiveness

Gouttebarge

et al. (2006)

Brassard et al. (2006) Rustenburg,

Kuijer and

Frings-

Dresen

(2004)

Branton et

al. (2010)

Gouttebarge

et al. (2009)

Gouttebarge

et al. (2009)

Brubaker et

al. (2007)

Brubaker et

al. (2007)

Durand et al.

(2008)

Durand et al.

(2008)

James,

Mackenzie

and Capra

(2010)

Reneman et al. (2004)

James,

Mackenzie

and Capra

(2010)

Inter-rater

reliability

Intra-rater

reliability

Cheng and

Cheng

(2010)

Meterko et al.

(2015)

Meterko et

al. (2015)

Durand et al.

(2004)

Gouttebarge

et al. (2005)

Cheng and

Cheng

(2011)

Gouttebarge

et al. (2005)

James,

Mackenzie

and Capra

(2011)

Gross and

Battié

(2005)

Gouttebarge

et al. (2006)

Trippolini et

al. (2014)

Gross and

Battié

(2006)

James,

Mackenzie

and Capra

(2011)

Lechner,

Page and

Sheffield

(2008)

Reneman et

al. (2005)

Trippolini et

al. (2014)

Note: some studies are mentioned more than once, because they study multiple psychometric properties

18

Table 5: Studies sorted by FCE method

Baltimore

Therapeutic

Equipment

(BTE) Work

Simulator

Blankenship

WorkEval

Functional

Capacity

Evaluation

Ergo-Kit (EK)

Functional

Capacity

Evaluation

ERGOS Work

Simulator

Isernhagen

Work Systems

FCE

Physical Work

Performance

Evaluation

(PWPE)

Short-Form

Functional

Capacity

Evaluation

Work Disability

Functional

Assessment

Battery (WD-

FAB)

WorkHab

Functional

Capacity

Evaluation

Full FCE studied Full FCE studied Full FCE studied Full FCE studied Full FCE studied Full FCE studied Full FCE studied Full FCE studied Full FCE studied

Cheng and

Cheng (2010)

Brubaker et al.

(2007)

Reneman et al.

(2004)

Brassard et al.

(2006)

Branton et al.

(2010)

Cheng and

Cheng (2011)

Gross and Battié

(2005)

Durand et al.

(2004)

Durand et al.

(2008)

Lechner, Page

and Sheffield

(2008)

Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied

Gouttebarge et al.

(2005)

Rustenburg,

Kuijer and

Frings-Dresen

(2004)

Reneman et al.

(2005)

Meterko et al.

(2015)

James,

Mackenzie and

Capra (2010)

Gouttebarge et al.

(2006)

Trippolini et al.

(2014)

James,

Mackenzie and

Capra (2011)

Gouttebarge et al.

(2009)

Gross and Battié

(2006)

Rustenburg,

Kuijer and

Frings-Dresen

(2004)

Note: some studies are mentioned more than once, because they study multiple FCE methods

19

3.3. Quality appraisal

The overall methodological quality of the reviewed studies was moderate to high, based

on the three-level quality appraisal scale by Gouttebarge et al. (2004). Sixteen studies

showed high methodological quality (11,29–34,37,39,42–48), three studies showed

moderate methodological quality (35,36,41) and only one study was rated with low

methodological quality (40). Results of the methodological quality appraisal are

displayed in Table 7. Nearly all studies reported whether a full FCE method or subtests

were studied and clearly mentioned the study objective. A few studies did not (clearly)

report data on subjects‟ health status or work status, while the number of subjects, their

gender-distribution and mean age were always mentioned. In most cases, scores

decreased because of a questionable study design/procedure or the use of inadequate

statistics, based on the criteria proposed by Terwee et al. (2007) (27).

Most included studies appeared to be somewhat susceptible to several specific types of

bias, or failed to report important methodological aspects of the study (Table 8). Firstly,

none of the studies reported that study-subjects were randomly sampled. Most studies

did not report the subject selection method, or used convenience sampling. However,

almost all studies reported the used inclusion and exclusion criteria. Of the studies in

which allocation was relevant, only three performed random allocation. The subjects

were never reported to be blinded, mostly because blinding of the subjects was not

possible. In most cases, rater blinding was not performed or not mentioned, when this

would have been relevant. Furthermore, many studies were faced with missing data,

however, did not report how missing data were handled. All studies did report the

necessary outcomes, but in some studies it was unclear whether extra results were

reported (possibly to embellish poor results). In studies which needed to use two

equivalent groups of subjects, it was mostly not mentioned whether these groups were

really comparable or significantly different at baseline. In most of the studies that did

report this aspect, groups were equivalent for important characteristics. Lastly, it was

sometimes unclear whether studies received funding from (possibly) involved parties.

Four studies clearly reported that there was no financial influence, while most others did

not mention financial support.

20

3.4. Study outcomes

Table 6: Overview of the study characteristics‟ shows the main characteristics and

outcomes of the reviewed studies, sorted by FCE method. The study methods that were

used in the reviewed studies and the corresponding outcomes with their interpretations

are more extensively displayed in Table 5: Study methods and results‟. Some of the

reviewed studies also researched other aspects of the FCE methods besides the

psychometric properties. This information however, is not included in the results of the

present review because it falls outside the scope of the research.

21

Table 6: Overview of the study characteristics and main outcomes

ICC = Intra-class Correlation Coefficient, % = percentage of agreement, κ = kappa value, r = Pearson Correlation Coefficient; ρ = Spearman’s rank correlation,

(A)HR = (Adjusted) Hazard Ratio, (A)OR = (Adjusted) Odds Ratio, *= significant, NS= Not Significant.

n number of subjects or raters/G gender/A age/H health status/W work status.

Studies on the Baltimore Therapeutic Equipment (BTE) Work Simulator

Full FCE method or

subtests studied?

Objective

Psychometric(s) studied Population Procedure Outcomes

Author(s) and year of

publication

Job-specific FCE:

varying composition of

assessment components,

depending on the subject

Predictive validity n: 645

G: 390 male, 255 female

A: 41.59±10.49 years

H: Subjects with non-

specific low back pain

W: Not working because of

low back pain

Retrospective cohort

study: ability of the BTE

to predict Return-To-

Work/employment status

Recommendations & employment

status: κ = 0.435

Cheng and Cheng

(2010)

Job-specific FCE:

varying composition of

assessment components,

depending on the subject

Predictive validity n: 194


A: 43.6±10.11 years

H: Subjects with distal

radius fracture

W: Not working

Retrospective cohort

study: ability of the BTE

to predict Return-To-

Work/employment status


status: κ = 0.449

Percentages of correct predictions:

„Previous job‟: 94.83%

„Change job‟: 60.47%

„Previous job with modification‟:

9.38%

„Do not work at the moment‟:

60.47%

Cheng and Cheng

(2011)

22

Studies on the Blankenship WorkEval Functional Capacity Evaluation

Full FCE method or

subtests studied?

Objective



publication

Studied subtests:

- Repetitive-

movement tests

- Static-strength tests

- Occasional-material-

handling tests

- Hand tests

Sensitivity

Specificity

n: 49 subjects


A: 36 years, range 18-65

years

H: Subjects with or without

musculoskeletal injury

W: Working, not working

or retired

Single-blinded,

randomized, post-test

only study: comparing

results from two groups,

one performing at

submaximal effort (50%)

and one performing at

maximal effort (100%)

Sensitivity = 80.0%

Specificity = 84.2%

Positive likelihood ratio = 5

Negative likelihood ratio = 0.2

Brubaker et al. (2007)

23

Studies on the Ergo-Kit (EK) Functional Capacity Evaluation

Full FCE method or

subtests studied?

Objective



publication

Subtests studied:

- Back-torso lift test

(BTLT)

- Shoulder lift test

(SLT)

- Forward

manipulation test

(FMT)

- Lower manipulation

test crouching

(LMTC)

- Carrying lifting

strength test (CLST)

- Lower lifting

strength test (LLST)

- Upper lifting

strength test (ULST)

Intra-rater reliability (1)

Inter-rater reliability (2)

n: 2 raters (27 subjects)


A: 40±16 years

H: Subjects without

musculoskeletal complaints

W: Working part-time or

full-time

Time-interval (1):

between t1&t2: 4±1

days, between t2&t3:

8±2 days

Raters used (2): 2

(1) BTLT: ICC = 0.96 (95% CI =

0.91-0.98), SLT: ICC = 0.93 (95%

CI = 0.85-0.97), CLST: ICC = 0.88

(95% CI = 0.75-0.94), LLST: ICC

= 0.86 (95% CI = 0.72-0.93),

ULST: ICC = 0.84 (95% CI =

0.69-0.92), FMT: ICC = 0.76 (95%

CI = 0.46-0.89), LMTC: ICC =

0.55 (95% CI = 0.21-0.77)

(2) After a 4-day time interval

BTLT: ICC = 0.96(95% CI = 0.91-

0.98), LLST: ICC = 0.95 (95% CI

= 0.89-0.98), SLT: ICC = 0.94

(95% CI = 0.87-0.97), ULST: ICC

= 0.94 (95% CI = 0.88-0.97),

CLST: ICC = 0.93 (95% CI =

0.85-0.97), LMTC: ICC = 0.90

(95% CI = 0.78-0.95), FMT: ICC =

0.88 (95% CI = 0.74-0.94)

Gouttebarge, Wind,

Kuijer and Sluiter

(2005)

24

(2) After a 8-day time interval

BTLT: ICC = 0.96 (95% CI =

0.90-0.98), SLT: ICC = 0.95 (95%

CI = 0.88-0.98), LLST: ICC = 0.92

(95% CI = 0.79-0.97), CLST: ICC

= 0.91 (95% CI = 0.75-0.97),

ULST: ICC = 0.91 (95% CI =0.67-

0.97), LMTC: ICC = 0.62 (95% CI

= 0.01-0.85), FMT: ICC = 0.53

(95% CI = -0.27-0.82)

Subtests studied:


(BTLT)


(SLT)

- Carrying lifting


- Lower lifting


- Upper lifting



Agreement (2)



A: 49±8 years

H: Subjects with low back

pain

W: Working part-time or

full-time

Raters used: 2 (1) BTLT: ICC = 0.97 (95% CI =

0.94-0.99), SLT: ICC = 0.96 (95%

CI = 0.91-0.98), CLST: ICC = 0.95

(95% CI = 0.84-0.98), LLST: ICC

= 0.94 (95% CI = 0.85-0.97),

ULST: ICC = 0.95 (95% CI =

0.89-0.98)

(2) BTLT: SEM = 8.6 (95% CI =

X±16.7), SLT: SEM = 5.0 (95% CI

= X±9.8), CLST: SEM = 3.4 (95%

CI = X±6.6), LLST: SEM = 3.7

(95% CI = X±7.2), ULST: SEM =

8.6 (95% CI = X±3.8)

Gouttebarge, Wind,

Kuijer, Sluiter and

Frings-Dresen (2006)

25

Subtests studied:


(BTLT)


(SLT)

- Carrying lifting


- Lower lifting


- Upper lifting


Construct validity

- Discriminant/

divergent validity

(1)

- Convergent validity

(2)

n: 72 subjects


A: 41±10 years

H: Subjects with

musculoskeletal disorders

(MSD)

W: Construction workers

on sick leave as a result of

MSD

A cross-sectional study

with within-subject

design: comparison

between the Ergo-Kit

and the Instrument for

Disability Risk (IDR) +

the Von Korff

Questionnaire (VKQ)

(1) Differences in performance in

EK lifting tests between low risk

group & high risk group (based on

IDR score): NS (p>0.05)

(2) VKQ & BTLT: ρ = -0.17, VKQ

& SLT: ρ = -0.18, VKQ & CLST:

ρ = -0.27* (p<0.05), VKQ &

LLST: ρ = -0.23* (p<0.05), VKQ

& ULST: ρ = -0.16

Gouttebarge, Wind,

Kuijer, Sluiter and


26

Studies on the ERGOS Work Simulator

Full FCE method or

subtests studied?

Objective



publication

ERGOS Work Simulator

subtests studied:

- Dynamic lifting tests

The Ergo-Kit (EK)

Functional Capacity

Evaluation

subtests studied:

- Dynamic lifting tests

Concurrent validity n: 25 subjects


A: 34.8±9.5 years

H: No musculoskeletal

problems

W: Fire-fighters

Balanced study:

comparing lifting

capacity for the ERGOS

Work Simulator and for

the Ergo-Kit

Correlation EK & ERGOS

outcomes:

sporadic lower lifting ρ = ?,

frequent lower lifting ρ = 0.50*,

constant lower lifting ρ = 0.50*

sporadic upper lifting ρ = 0.49*,

frequent upper lifting ρ = 0.66**,

constant upper lifting ρ = 0.56**

* = p<0.05, ** = p<0.01

Rustenburg, Kuijer and


27

Studies on the Isernhagen Work Systems FCE

Full FCE method or

subtests studied?

Objective



publication

Full FCE method studied Predictive validity n: 54 subjects


A: 41±10.4 years

H: Subjects with

compensable back injuries

W: Not working,

compensation claimants

Prospective cohort study:

ability of the Isernhagen

Work Systems FCE to

predict timely and

sustained Return-To-

Work

Number of failed tasks (1 out of

25): AHR days to benefit

suspension = 0.91 (95% CI = 0.87-

0.96), AHR days to claim closure

= 0.93 (95% CI = 0.89-0.98), AOR

recurrence = 0.95 (95% CI = 0.89-

1. 03)

Floor-to-waist lift: AHR days to

benefit suspension = 1.55 (95% CI

= 1.28-1.98), AHR days to claim

closure = 1.42 (95% CI = 1.12-

1.80), AOR recurrence = 0.91

(95% CI = 0.63-1. 33)

Association FCE performance

indicators with future recurrence,

work status, pain intensity and

disability: NS (p>0.05)

Gross and Battié

(2005)

28

Subtests studied:

- The 15 activities in

the upper extremity

protocol

Predictive validity n: 336 subjects


A: 45±11.2 years

H: Subjects with specific

(a) and nonspecific (b)

upper extremity conditions

W: Not working,


Longitudinal cohort

study: ability of the

Isernhagen Work

Systems FCE upper

extremity protocol to

predict timely and

sustained recovery

Lifting performance:

AHR days to benefit suspension =

1.51

AHR days to claim closure = 1.23

OR recurrence = 1.17 (95% CI

0.96 to 1.43), NS (p>0.05) after

controlling for confounding

variables

Gross and Battié

(2006)

Full FCE method

studied:

- Modified versions of

nine out of the 24

tests

The test-retest reliability

/ reproducibility

n: 26 subjects


A: 34.9±12.7 years

H: Healthy subjects

W: ?

Time interval: 2-3 weeks Material handling tests

ICC range = 0.68-0.98, Shuttle

walk test ICC = 0.64

Five criterion/ceiling tests

κ > 0.60, one criterion/ceiling test

κ = 0.57, 11 other criterion/ceiling

tests κ = ?


Subtests studied:

- Lifting tests

Inter-rater reliability n: 9 raters (31 subjects)


A: (a): 29.5±10.8 years,

(b): 39.6 ±7.1 years.

H: healthy subjects (a),

subjects with chronic, non-

specific low back pain (b)

W: ?

Raters used: 9 (a) CR-10 ratings: ICC = 0.87

(95% CI = 0.69–0.91)

Categorical ratings: κ = 0.58

(b) CR-10 rating: ICC = 0.76 (95%

CI = 0.69–0.83)

Categorical ratings: κ= 0.50


29

Subtests studied:

- “11 tests”, not

specified

Intra-rater reliability (1)

of the Physical Effort

Determination Scale (a)

& the Submaximal Effort

Determination Scale (b)


of the Physical Effort

Determination Scale (a)

& the Submaximal Effort

Determination Scale (b)


G: ?

A: 35.5 years, range 21-49

years

H: 3 subjects with non-

specific low back pain and

1 with non-specific neck

pain

W: ?

(1) Raters used: 21

(2) Time interval: 10

months

(1) (a) κ = 0.49 (95% CI = 0.22–

0.75) (1) (b) κ = 0.68 (95% CI =

0.60–0.76)

(2) Rating 1: 73% agreement,

rating 2: 85% agreement

(2) (a) Rating 1: κ = 0.51 (95% CI

= 0.23–0.80), rating 2: κ = 0.7

(95% CI = 0.49–0.94)

(2) (b) Rating 1: κ = 0.68 (95% CI

= 0.60–0.76), rating 2: κ = 0.77

(95% CI = 0.70-0.84)

Trippolini et al. (2014)

30

Studies on the Physical Work Performance Evaluation (PWPE)

Full FCE method or

subtests studied?

Objective



publication

Full FCE method studied Test-retest reliability/

reproducibility

n: 30 subjects


A: 43±7.3 years

H: Healthy subjects

W: Working

Time interval: 6 weeks Dynamic force-section: ICC =

0.79-0.91

Tolerance at different positions

section: κ = 0.05-0.50

Mobility-section: κ = 0.34-0.83

Dynamic strength section: κ = 0.49

Mobility section: κ = 0.52

Overall PWPE score: κ = 0.43

Brassard, Durand,

Loisel and Lemaire

(2006)

Full FCE method studied Inter-rater reliability n: 5 raters (40 subjects)


A: 40.9±9.9 years

H: Subjects with back pain

W: Not working or only

working light duties

because of back pain

Raters used: 5 Dynamic Strength section: κ =

0.81 (95% CI = 0.65-0.96)

Position Tolerance section: κ =

0.72 (95% CI = 0.54-0.90)

Mobility section: κ = 0.54 (95% CI

= 0.28-0.81)

Overall PWPE score: κ = 0.76

(95% CI = 0.58-0.93)

Durand, Loisel,

Mercier, Stock and

Lemaire (2004)

31

Full FCE method studied Responsiveness

- Internal

responsiveness (1)

- External

responsiveness (2)

n: 57 subjects

G: Experimental group (a):

23 male, 4 female

Control group (b): 24 male,

6 female

A: Experimental group (a):

42±9.4 years. Control

group (b): 43±7.4 years

H: Experimental group (a):

work-related non-specific

low back pain. Control

group (b): healthy.

W: Experimental group (a):

not working because of

work-related non-specific

low back pain. Control

group (b): working

Correlational prospective

pre-/post-test study:

comparing change scores

(1) between group (a)

and (b) and comparing

change scores (2) in

group (a) with concurrent

and empirical data

(1) (a) differences in PWPE-score

pre-/post-test: dynamic strength*

(p = 0.001), postural tolerance (p =

0.0195), NS (p>0.05) for mobility

and overall PWPE-score

(b) differences in PWPE-score pre-

/ post-test: overall PWPE-score NS

(p>0.05) and PWPE sections NS

(p>0.05)

(2) difference between (a) & (b)

pre-/post-test: Dynamic strength*

(p = 0.0008), other PWPE sections

& overall PWPE score NS

(p>0.05)

Difference between concurrent

criteria* pre-/post-test (a) (p≤0.05)

Change-scores & changes in

concurrent criteria: perceived

disability and postural tolerance

Kendall Tau = -0.41* (p =0.0131),

other PWPE sections + overall

PWPE score and measures Kendal

Tau = NS (p>0.05)

Durand, Brassard, Nha

Hong and Loisel

(2008)

32



A: 40.5±10.7 years

H: Subjects with

musculoskeletal

dysfunction

W: Construction workers

on sick leave

Retrospective study:

ability of the PWPE to

predict Return-To-Work

/employment status


status at discharge κ = 0.74


status 3 months after discharge: κ

= 0.69


status at 6 months after discharge

κ= 0.70

Lechner, Page and

Sheffield (2008)

33

Studies on the Short-Form Functional Capacity Evaluation

Full FCE method or

subtests studied?

Objective



publication



A: 44.3±11.1 years

H: Subjects with

musculoskeletal injuries

W: Not working,


Prospective cohort study:

ability of the Short-Form

FCE to predict timely

and sustained Return-To-

Work

AHR days to benefit suspension =

5.45 (95% CI = 2.73-10.85)

AHR days to claim closure = 5.80

(95% CI = 3.50-9.61)

AOR recurrence = 1.31 (95% CI =

0.48-3.60)

Branton et al. (2010)

34

Studies on the Work Disability Functional Assessment Battery (WD-FAB)

Full FCE method or

subtests studied?

Objective



publication

Subscales studied:

- Physical

Functioning scales

(PF): Changing and

Maintaining Body

Position (CMBP),

Upper Body

Function (UBF),

Upper Extremity

Fine Motor (UEFM)

and Whole Body

Mobility (WBM)

- Behavioural Health

scales (BH): Self-

Efficacy (SELF-E),

Mood and Emotions

(MOOD),

Behavioural Control

(BC) and Social

Interactions

(SOCIAL)

Construct validity

- Discriminant/

divergent validity

(1)

- Convergent validity

(2)

n: 973 subjects


A: 56 ±8.52 years

H: Subjects with physical

(a) or mental (b) disability,

no further specifics

W: Not working because of

disability

Cross-sectional study

comparison between:

established and new

WD-FAB scales

(Physical Functioning

(PF) & Behavioural

Health (BH))

(1) Correlations with the cross-

domain measure for PF: CMBP

r=0.12*, UBF r=0.21*, UEFM

r=0.24*, WBM r=0.15*

Correlations with the cross-domain

measure for BH: SELF-E r=0.46*,

MOOD r=0.67*, BC r=0.32*,

SOCIAL r=0.56*

(2) Correlations with the two

same-domain measures for PF:

CMBP r=0.42* & r=0.65*, UBF

r=0.43* and r=0.69*, UEFM

r=0.23* & r=0.54*, WBM r=0.55*

& r=0.70*

Correlations with the scales of the

same domain measure for BH: r

range = -0.24 to -0.74, all

significant (p<0.05)

* = (p<0.05)

Meterko et al. (2015)

35

Studies on the WorkHab Functional Capacity Evaluation

Full FCE method or

subtests studied?

Objective



publication

Subtests studied:

- Tests of the manual

handling component

Test-retest reliability /

reproducibility (1)

Internal consistency (2)

n: 25 subjects


A: 29±12.0 years

H: Healthy subjects

W: Students and staff

members from a university

Time interval: 1 week (1) Overall manual handling score:

ICC = 0.74 (95%CI = 0.42-0.88)

Tests of the manual handling

score: floor to bench ICC = 0.92,

bench to shoulder ICC = 0.90,

bench to bench ICC = 0.91

(2) Manual handling score:

Cronbach‟s α = 0.92

Tests of the manual handling

score: floor to bench Cronbach‟s α

= 0.86, bench to shoulder

Cronbach‟s α = 0.85, bench to

bench Cronbach‟s α = 0.82

James, Mackenzie and

Capra (2010)

Subtests studied:

- Tests of the manual

handling component

(1) Intra-rater reliability

(2) Inter-rater reliability


G: ?

A: ?

H: Subjects with a work

disability, not specified

W: Not working

(1) Time interval:

“approximately 2 weeks”

(2) Raters used: 17

(1) Overall ICC 0.97 (ICC range =

0.81–0.97)

(2) ICC 0.90 (ICC range = 0.77-

0.91)

James, Mackenzie and

Capra (2011)

36

Table 7: Results of the methodological quality appraisal

Studies on the Baltimore Therapeutic Equipment (BTE) Work Simulator

Author(s) and year of publication FCE method Objective Population Procedure Statistics Methodological

quality

Cheng and Cheng (2010) + + + + + 5 = High

Cheng and Cheng (2011) + + + + + 5 = High

Studies on the Blankenship WorkEval Functional Capacity Evaluation


quality

Brubaker et al. (2007) + - + + + 3 =Moderate

Studies on the Ergo-Kit (EK) Functional Capacity Evaluation


quality

Gouttebarge et al. (2005) + + + + & +/- + 4-5 = High

Gouttebarge et al. (2006) + + + +/- + 4 = High

Gouttebarge et al. (2009) + + + + & + + 5 = High

Studies on the ERGOS Work Simulator (EWS)


quality

Rustenburg, Kuijer and Frings-Dresen (2004) + + + + + 5 = High

37

Studies on the Isernhagen Work Systems (IWS) Functional Capacity Evaluation


quality

Gross and Battié (2005) + + + + + 5 = High

Gross and Battié (2006) + + + + + 5 = High

Reneman et al. (2004) + + +/- +/- + 3 = Moderate

Reneman et al. (2005) + + +/- + + 4 = High

Trippolini et al. (2014) - + +/- -* & + + 1-2* = Low

* The test-retest time interval was ten months long. However, video recordings of subjects were used to assess intra- and inter-rater reliability, preventing learning effects and

carry-over effects in the subjects.

Studies on the Physical Work Performance Evaluation (PWPE)


quality

Brassard et al. (2006) + + + - + 3 = Moderate

Durand et al. (2004) + + + + + 5 = High

Durand et al. (2008) + + + + +/- 4 = High

Lechner, Page and Sheffield (2008) + + + + + 5 = High

Studies on the Short-Form Functional Capacity Evaluation


quality

Branton et al. (2010) + + + + + 5 = High

38

Studies on the Work Disability Functional Assessment Battery (WD-FAB)


quality

Meterko et al. (2015) + + +/- + + 4 = High

Studies on the WorkHab Functional Capacity Evaluation


quality

James, Mackenzie and Capra (2010) + + + + & + + 5 = High

James, Mackenzie and Capra (2011) + + - + & + + 4 = High

39

Table 4: Methodological aspects of the reviewed studies, susceptible to bias

NA = Not Applicable, NM = Not Mentioned, U=Unclear

Met

ho

do

log

ical

asp

ect

and

th

e as

soci

ated

typ

e o

f p

ote

nti

al b

ias

Bra

nto

n e

t al

. (2

01

0)

Bra

ssar

d e

t al

. (2

00

6)

Bru

bak

er e

t al

. (2

00

7)

Ch

eng

an

d C

hen

g (

20

10

)

Ch

eng

an

d C

hen

g (

20

11

)

Du

ran

d e

t al

. (2

00

4)

Du

ran

d e

t al

. (2

00

8)

Go

utt

ebar

ge

et a

l. (

200

5)

Go

utt

ebar

ge

et a

l. (

200

6)

Go

utt

ebar

ge

et a

l. (

200

9)

Gro

ss a

nd

Bat

tié

(20

05

)

Gro

ss a

nd

Bat

tié

(20

06

)

Jam

es,

Mac

ken

zie

and

Cap

ra (

20

10)

Jam

es,

Mac

ken

zie

and

Cap

ra (

20

11)

Lec

hn

er,

Pag

e an

d S

hef

fiel

d (

20

08)

Met

erk

o e

t al

. (2

015

)

Ren

eman

et

al.

(20

04

)

Ren

eman

et

al.

(20

05

)

Ru

sten

bu

rg,

Ku

ijer

an

d F

ring

s-D

rese

n (

200

4)

Tri

pp

oli

ni

et a

l. (

201

4)

Selection bias

Was subject selection random?

Were inclusion/exclusion criteria used?

NM

NM

No

Yes

No

Yes

NM

Yes

No

Yes

NM

Yes

U

Yes

NM

Yes

NM

Yes

U

Yes

NM

Yes

NM

Yes

No

NM

No

NM

No

Yes

U

NM

No

NM

No

Yes

No

NM

No

NM

Allocation bias

Was allocation random? Yes NA Yes NA NA NA NA NA NA NA NA NA NA NA NA Yes NA NA NM NA

Performance bias

Were subjects blinded to allocation? NM NA No NA NA NA NA NA NA NA NA NA NA NA NA NM NA NA NA NA

Detection bias

Were raters blinded to allocation? No NA Yes NA NA NA NM NA NA NM NA NA NA NA NA NM NA NA NA NA

Attrition bias

Were there missing data/drop-

outs/withdrawals/…?

NM No Yes Yes Yes NM Yes No Yes NM Yes Yes NM Yes NM Yes Yes Yes Yes NM

40

Reporting bias

Were all outcomes stated to be measured

actually reported?

Were there extra results measured post-

hoc?

Yes

No

Yes

No

Yes

No

Yes

U

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

U

U

Yes

U

Yes

No

Yes

No

Yes

No

Confounders

Were there no significant differences

between groups at baseline?

U NA Yes NA NA NA NA NA NA NA NA NA NA NA NA NM NA NA NM NA

Analysis

Were the data for all subjects included in

the final analysis, even those who

withdrew from the study?

NM Yes NM NM NM NM NM Yes NM NM No NM NM NM NM NM NM U No NM

Funding bias

Was the study funded by a (possibly)

involved party?

U U NM NM NM U NM No No No U U U U Yes U NM No NM U

Note: Scores are only based on the methodological aspects of the part of the study in which inter-rater reliability is examined

41

Table 5: Study methods and results

Study FCE method Psychometric property/

properties studied Study methods and results

Cheng and Cheng (2010) Baltimore

Therapeutic

Equipment (BTE)

Work Simulator

Predictive validity FCE data on subjects were extracted from clinical databases of rehabilitation

multiple centres. Based on their performance in the BTE, compared to physical

work demands of the current job in terms of duration, frequency and intensity

of tasks and activities, subjects were given recommendations on Return-To-

Work. Three months after the evaluation, all subjects were contacted by

telephone to find out their current employment status.

Kappa coefficients were calculated to evaluate the strength of agreement and

the McNemar-Bowker test was used to test the symmetry between the Return-

To-Work recommendations and employment status after three months. The

percentage of correct predictions (hit rate) of each recommendation was

measured and the relative percentage difference (deviation) was used to

evaluate the precision of Return-To-Work recommendations with respect to

each FCE-based rating.

A moderate agreement of κ = 0.435 was found between Return-To-Work

recommendations and employment status after three months. A statistically

significant difference (p<0.0001) was obtained by the McNemar-Bowker test,

indicating that there is more disagreement over some categories of Return-To-

Work outcomes than others. A percentage of correct predictions of 88.36%

was observed in returning to previous job, 71.57% in changing job, 23.17% in

returning to previous job with modification and 37.50% in not working at the

moment.

42

Cheng and Cheng (2011) Baltimore

Therapeutic

Equipment (BTE)

Work Simulator

Predictive validity Data of interest on the subjects were extracted from the clinical databases of

multiple rehabilitation centres. Based on their performance in the BTE,

compared to physical work demands of the current job (in terms of duration,

frequency and intensity of tasks and activities), subjects were given

recommendations on Return-To-Work. Three months after evaluation, all

subjects were contacted by telephone to find out their current employment

status.

Kappa coefficients were calculated to evaluate the strength of agreement and

the McNemar-Bowker test was used to test the symmetry between the Return-

To-Work recommendations and employment status after three months. The

percentage of correct predictions (hit rate) of each recommendation was

measured and the relative percentage difference (deviation) was used to

evaluate the precision of Return-To-Work recommendations with respect to

each FCE-based rating.

A moderate agreement of κ = 0.449 was found between Return-To-Work

recommendations and employment status after three months. A statistically

significant difference (p<0.0001) was obtained by the McNemar-Bowker test.

A percentage of correct predictions of 94.83% was observed in returning to

previous job, 60.47% in changing job, 9.38% in returning to previous job with

modification and 60.47% in not working at the moment.

43



Brubaker et al. (2007) Blankenship

WorkEval

Functional Capacity

Evaluation

Sensitivity

Specificity

Subjects were randomly assigned to a 100% effort group and a 50% effort

group. Raters were blinded to allocation. Two raters had to apply an ordinal

scale based on the Blankenship with three possible ratings (valid, invalid or

equivocal) to each of the four observed components: repetitive-movement

tests, static-strength tests, occasional-material-handling tests and hand tests.

Sensitivity, specificity and likelihood ratios were computed from contingency

tables.

A sensitivity of 80.0% and specificity of 84.2% „respectively‟ were found. The

positive likelihood ratio was 5 and the negative likelihood ratio 0.2.

44



Gouttebarge et al. (2005) Ergo-Kit (EK)

Functional Capacity

Evaluation

Intra-rater reliability


Each subject was assessed three different times on the EK, twice by rater 1 and

one time by rater 2. Both raters were blinded to the others‟ ratings. A time

interval of 4±1 days was used between the first and second assessment and the

second and third assessment. A time interval of 8±2 days was used between

the first and the last/third assessment. Seven EK tests were administered in

following order: the Back-Torso Lift Test (BTLT), Shoulder Lift Test (SLT),

Forward Manipulation Test (FMT), Lower Manipulation Test Crouching

(LMTC), Carrying Lifting Strength Test (CLST), Lower Lifting Strength Test

(LLST) and the Upper Lifting Strength Test (ULST). Three groups (A, B & C)

of nine subjects were formed, according to their availability. Each group was

assessed on the EK with a counterbalanced order of raters.

Levels of intra- and inter-rater reliability were expressed as an Intra-class

Correlation Coefficient (ICC); and a 95% confidence interval was calculated

for each ICC.

Intra-rater reliability per EK test: high ICCs were found in the BTLT, with

ICC 0.96 (95% CI = 0.91-0.98) and the SLT, with ICC 0.93 (95% CI = 0.85-

0.97). Moderate ICCs were found in the CLST with ICC 0.88 (95% CI = 0.75-

0.94), in the LLST with ICC 0.86 (95% CI = 0.72-0.93), in the ULST with

ICC 0.84 (95% CI =0.69-0.92) and in the FMT with ICC 0.76 (95% CI =0.46-

0.89). The ICC for the LMTC was low, with ICC 0.55 (95% CI =.0.21-0.77).

Inter-rater reliability after a four-day time interval: high ICCs were found in

the BTLT with ICC 0.96 (95% CI = 0.91-0.98), in the LLST with ICC 0.95

(95% CI = 0.89-0.98), in the SLT with ICC 0.94 (95% CI = 0.87-0.97), in the

45

ULST with ICC 0.94 (95% CI =0.88-0.97) and in the CLST with ICC 0.93

(95% CI = 0.85-0.97). A moderate ICC was found in the LMTC with ICC 0.90

(95% CI = 0.78-0.95) and in the FMT with ICC 0.88 (95% CI =0.74-0.94).

These ICCs indicate moderate to high agreement between raters after a four-

day time interval.

Inter-rater reliability after an eight-day time interval: high ICCs were found in

the BTLT with ICC 0.96 (95% CI = 0.90-0.98), in the SLT with ICC 0.95

(95% CI = 0.88-0.98), in the LLST with ICC 0.92 (95% CI = 0.79-0.97), in the

CLST with ICC 0.91 (95% CI = 0.75-0.97) and in the ULST with ICC 0.91

(95% CI = 0.67-0.97). Low ICCs were found in the LMTC with ICC 0.62

(95% CI = 0.01-0.85) and the FMT with ICC 0.53 (95% CI = -0.27-0.82).

These ICCs indicate moderate agreement between raters after a four-day time

interval.


Functional Capacity

Evaluation


Agreement

Subjects were independently assessed twice by different raters on five lifting

tests of the EK: the Back-Torso Lift Test (BTLT), Shoulder Lift Test (SLT),

Carrying Lifting Strength Test (CLST), Lower Lifting Strength Test (LLST)

and the Upper Lifting Strength Test (ULST). The time interval was set at three

days and the EK tests were performed at the same time of the day.

Level of reliability and agreement were expressed with an Intra-class

Correlation Coefficient (ICC) and their 95% confidence intervals were

calculated. To assess the raters‟ stability in repeated measurements over time

and to gain an insight into the clinical relevance of the Ergo-Kit lifting tests,

agreement was expressed with the standard error (SE) of measurement and its

95% Confidence Interval.

46

The inter-rater reliability was high in the BTLT with ICC 0.97 (95% CI =

0.94-0.99), in the SLT with ICC 0.96 (95% CI = 0.91-0.98), in CLST with ICC

0.95 (95% CI = 0.84-0.98), in the LLST with ICC 0.94 (95% CI = 0.85-0.97)

and in the ULST with ICC 0.95 (95% CI = 0.89-0.98).

Agreement: the BTLT had a SEM of 8.6 (95% CI = X±16.7), the SLT had a

SEM of 5.0 (95% CI = X±9.8), the CLST had a SEM of 3.4 (95% CI =

X±6.6), the LLST had a SEM of 3.7 (95% CI = X±7.2) and the ULST had a

SEM of 8.6 (95% CI = X±3.8).


Functional Capacity

Evaluation

Construct validity:

Discriminant/ divergent

validity

Convergent validity

Subjects were independently assessed twice by different raters (how many was

not specified) on five lifting tests of the EK: the Back-Torso Lift Test (BTLT),

Shoulder Lift Test (SLT), Carrying Lifting Strength Test (CLST), Lower

Lifting Strength Test (LLST) and the Upper Lifting Strength Test (ULST). The

time interval was set at less than ten days and the EK tests were performed at

the same time of the day. Subjects were divided into two groups post hoc,

based on their outcomes on the Instrument for Disability Risk (IDR), with cut-

off score 38% for high risk for work disability. Scores on the five lifting tests

were compared between the high-risk group and the low-risk group to

determine discriminant validity. It was hypothesized that the high-risk group

would have a lower score. To determine convergent validity, the association

between outcomes on the five lifting tests with the Von Korff Questionnaire

(VKQ) (an adapted version) was researched.

Independent sample t-tests were performed to determine discriminant validity

of the EK lifting tests in relation to the IDR (p<0.05). To determine convergent

validity, EK lifting tests outcomes and VKQ outcomes were correlated using a

Pearson correlation coefficient.

47

No statistically significant differences (p>0.05) in the outcomes of the five EK

lifting tests were found between the high risk and the low risk group. This

indicates low discriminative abilities.

Correlations of -0.17,-0.18, -0.27, -0.23 and -0.16 were found between the

VKQ outcomes and the EK lifting tests outcomes, which are all low. Only two

of these correlations were statistically significant (p<0.05). This indicates low

or no association between EK lifting tests outcomes and VKQ outcomes.

48



Rustenburg, Kuijer and Frings-Dresen

(2004)

ERGOS Work

Simulator and the

Ergo-Kit FCE (EK)

Concurrent validity Subjects performed the dynamic lifting tests of both the ERGOS Work

Simulator and the Ergo-Kit to determine frequent, and constant lower and

sporadic, frequent and constant upper lifting capacity. Testing occurred on two

different days, separated by an interval of seven days (six or eight). The time

of day was kept constant. The number of raters is not mentioned.

Concurrent validity was determined by comparing the mean values for the

lifting tests by means of a paired t-test and a Spearman rank correlation

analysis (two-sided).

Some similarity between the FCEs was found for frequent and the constant

lower lifting capacity outcomes, with moderate correlations for both items: ρ =

0.50. Both correlations were statistically significant at p<0.05. No correlation

coefficient could be calculated for sporadic lower lifting capacity, and therefor

concurrent validity for this item was not determined. Little or no similarity

between the FCEs was found for sporadic upper lifting capacity, with a low

correlation of ρ = 0.49. This correlation was statistically significant at p<0.05.

Some similarity between the FCEs was found for frequent lower lifting with ρ

= 0.66 and for constant lower lifting with ρ = 0.56, which are moderate

correlations. Both correlations were statistically significant at p<0.01.

49



Gross and Battié (2005) Isernhagen Work

Systems FCE

Predictive validity Data were extracted from databases on the subjects undergoing the Isernhagen

Work Systems FCE. Specifically, data on their number of failed tasks and

floor-to-waist lift weight. Subjects were contacted by telephone after one year

for a follow-up interview on their current work status and employment duties,

current pain intensity and perceived disability.

Analysis included Cox and logistic regression to determine the predictive

ability of the number of failed tasks and the floor-to-waist lift weight to predict

recovery. Recovery indicators were days to suspension of time-loss benefits,

days to claim closure and future recurrence.

Associations between FCE results (1 out of 25 tasks failed) and recovery

outcomes: The Adjusted Hazard Ratio (AHR) for days to benefit suspension

was 0.91 (95% CI = 0.87-0.96), the AHR for days to claim closure was 0.93

(95% CI = 0.89-0.98) and the Adjusted Odds Ratio for recurrence was 0.95

(95% CI = 0.89-1. 03).

Associations between FCE results (Floor-to-waist lift, 10 kg units) and

recovery outcomes: The Adjusted Hazard Ratio (AHR) for days to benefit

suspension was 1.55 (95% CI = 1.28-1.98), the AHR for days to claim closure

was 1.42 (95% CI = 1.12-1.80) and the Adjusted Odds Ratio for recurrence

was 0.91 (95% CI = 0.63-1. 33).

FCE performance indicators were not significantly (p>0.05) associated with

future recurrence, self-reported outcomes of work status, pain intensity and

disability.

50

Gross and Battié (2006) Isernhagen Work

Systems FCE

Predictive validity Data on variables of interest were extracted from clinical and administrative

databases of a rehabilitation centre. Based on the nature of the subjects‟

diagnosis, they were divided into two groups: the specific injuries group and

the non-specific injuries group. There were also other significant differences

between the two groups, based on days between injury and admission,

previous WCB upper extremity claims, number of previous health visits, sex,

side of injury and employment status (p<0.05). Performance in the 15

activities in the upper extremity FCE protocol of the Isernhagen Work Systems

FCE was reported. Based on the subjects‟ job requirements, they were graded

with a „pass‟ or „fail‟ for these activities (met/didn‟t meet the required job

demand level). Following FCE indicators were used, based on previous

research: maximum performance during handgrip and lift testing and the

number of tasks where performance was rated as „fail‟ (below required job

demands). Investigated indicators for timely and sustained recovery were: days

receiving time-loss benefits in the year following FCE, days until claim

closure and future recurrence (whether benefits restarted, the claim reopened

or a new upper extremity claim was filed).

To determine relationships between FCE and days to suspension of time-loss

benefits and claim closure, Cox regression was used. Initially, a univariate

screen of the relation between all predictor variables and outcomes was

performed. The adjusted effect of the FCE indicators was then determined by

entering them into multivariable Cox regression models along with the

potential confounding measures found significant in the univariate screen.

Tests of confounding were also undertaken to determine if any other predictor

variable altered regression coefficients by more than 20%. These analyses

51

were performed for each FCE variable separately. Logistic regression was

used to evaluate the relation between FCE and future recurrence.

Association between number of failed FCE items (out of 15 tasks) and days to

suspension of time-loss benefits resulted in a Adjusted Hazard Ratio (AHR) of

0.97 (95% CI = 0.94-1.00) and associations between maximum performance in

handgrip and lifting tests (10 kg units) varied between AHR 0.98-1.55 (95%

CI = 0.92-1.78). Association between number of failed FCE items (out of 15

tasks) and days to claim closure resulted in a Adjusted Hazard Ratio (AHR) of

0.97 (95% CI = 0.94-1.00) and associations between maximum performance in

handgrip and lifting tests (10 kg units) varied between AHR 1.05-1.81 (95%

CI = 0.98-2.20). Association between number of failed FCE items (out of 15

tasks) and future recurrence resulted in a AHR of 0.97 (95% CI = 0.90-1.04)

and associations between maximum performance in handgrip and lifting tests

(10 kg units) varied between AHR 0.87-1.08 (95% CI = 0.60-1.27).

This indicates low or no associations between the FCE indicators and the

indicators for timely and sustained recovery.

Reneman et al. (2004) Isernhagen Work

Systems FCE

Test-retest reliability Twenty-eight tests of the Isernhagen Work Systems FCE were administered

twice to the subjects within a two to three week interval by one rater. Time of

the day of assessment was kept constant. Subjects were asked to perform at

maximum capacity.

For all criterion tests, the number of subjects who met the criterion for each

test session was calculated and based on these results, Cohen‟s kappa‟s and

percentages of absolute agreement for the two measurements were calculated.

For all tests with a ceiling effect, the number of subjects who reached the

ceiling for each test session was calculated and based on these results, Cohen‟s

52

kappa‟s and percentages of absolute agreement for the two measurements were

calculated.

ICCs of the material-handling tests ranged from 0.68 to 0.98, one of these

ICCs showed low, five showed moderate and two showed high agreement.

ICC of the shuttle walk test was 0.64, which indicates low agreement. Kappa

coefficients of 0.60 and higher were found in five of the criterion and ceiling

tests, which indicates high agreement. One moderate Kappa coefficient was

found and kappa coefficients could not be calculated in eleven of the tests

because of incomplete filling of the cells in the 2x2 tables.

Reneman et al. (2005) Isernhagen Work

Systems FCE

Inter-rater reliability The subjects were videotaped while performing lifting tests as outlined in the

Isernhagen Work System FCE. Nine observers had to independently rate the

effort levels used by the subjects based on the videos.

Inter-rater reliability for the Isernhagen Work Systems FCE lifting tests was

analysed by means of Cohen‟s‟ kappa.

In healthy subjects kappa was 0.58, which indicates moderate agreement

between raters. In the subjects with chronic low back pain kappa was 0.50,

which is moderate.

Trippolini et al. (2014) Isernhagen Work

Systems FCE

Intra-rater reliability


Twenty-one raters independently rated physical effort in eighteen soundless

videos of four subjects performing eleven tests of the Isernhagen Work

Systems FCE. Each video was rated according to observational criteria

indicative of physical effort for material handling tests as „light to moderate‟,

„heavy‟ or „maximal‟. Observational criteria for postural tolerance tests and

ambulation tests were rated on a scale from „no or slight functional

problem/limitation‟, „some functional problem/limitation‟ to „substantial

functional problem/limitation‟. This categorical scale was termed the physical

53

effort determination (PED) scale. Submaximal effort was assumed when a

client stopped a material or non-material handling test before the FCE rater

observed sufficient criteria indicative of maximal weight or significant

functional problems/limitation. This dichotomous scale was termed the

submaximal effort determination (SED). In total the videos were 28 minutes

long and all raters watched it at the same time. For every test the raters

received standardized information on heart rate, lifted weight (kg) and duration

of the test. Raters had to fill in ratings of physical effort. This rating was

performed twice, once in September 2010 and once in July 2011, with a time

interval of ten months to prevent a learning effect. Between these sessions

each rater performed approximately 30 short FCEs (material handling tests

only), as part of the regular clinical procedure of a work rehabilitation

program.

Intra-rater reliability for the PED and the SED scales were assessed by

comparing the scores from the first rating session with the scores from the

second session for each rater. Inter-rater reliability was assessed twice: by

comparing between the scores all the raters in session 1 and in session 2.

Category 5 „„not classifiable‟‟ was excluded from the analyses. Inter-rater and

intra-rater reliability was calculated using Cohen‟s kappa values for

dichotomous data, and squared weighted kappa values for categorical data and

percentages of agreement.

Agreement within raters (intra-rater reliability) for the PED scale was

moderate, with a kappa value of 0.49 (95% CI = 0.22–0.75). Agreement within

raters (intra-rater reliability) for the SED scale was high with a kappa value of

0.68 (95% CI = 0.60–0.76).

54

Inter-rater agreement was 73% for rating 1 and 85% for rating 2 (percentages

of agreement), which is low.

Agreement between raters (inter-rater reliability) for the PED scale was

moderate in rating 1 with a kappa value of 0.51 (95% CI = 0.23–0.80) and

high in rating 2 with a kappa value of 0.7 (95% CI = 0.49–0.94). Agreement

between raters (inter-rater reliability) for the SED scale was high in rating 1

with a kappa value of 0.68 (95% CI = 0.60–0.76) and also high in rating 2 with

a kappa value of 0.77 (95% CI = 0.70-0.84).

55



Brassard et al. (2006) Physical Work

Performance

Evaluation (PWPE)

Test-retest reliability The PWPE was administered at two separate occasions by the same rater out of

four raters, within a time interval of six weeks.

ICC and Cohen‟s‟ kappa were calculated to determine test-retest reliability.

ICCs for the tasks of the dynamic force-section ranged from 0.79 to 0.91,

displaying moderate to high agreement between test-retest results. A kappa value

ranging from 0.05 to 0.50 was found for the tasks of the tolerance at different

positions-section, which is low to moderate. A kappa ranging from 0.34 to 0.83

was found or the tasks of the mobility-section, with kappa 0.49 for dynamic

strength, kappa 0.52 for mobility and kappa 0.43 for the total PWPE score;

indicating a moderate agreement between test-retest results.

Durand et al. (2004) Physical Work

Performance

Evaluation (PWPE)

Inter-rater reliability Subjects were evaluated by five raters on the PWPE and a scores for each section

and an overall scores were determined. These scores were matched to the

corresponding category of the five categories of work, proposed by the Dictionary

of Occupational Titles (DOT): Very heavy, heavy, medium, light or sedentary

work.

The PWPE was carried out by two selected raters, one of them administered the

FCE (the direct rater) and the other one functioned as silent rater, which only

observed the performance without intervening in the evaluation process. Both

raters alternated being the direct rater.

Unweighted Cohen‟s kappa coefficients were calculated to determine the inter-

rater reliability for each task and section of the PWPE and the overall score.

Percentage of agreement and 95% confidence intervals were also calculated.

A high agreement of κ = 0.76 was found for the overall PWPE score (95% CI =

56

0.58-0.93), with 85% of agreement. High agreement was also found for the

Dynamic Strength section, with κ = 0.81 (95% CI = 0.65-0.96) and for the

Position Tolerance section, with κ = 0.72 (95% CI = 0.54-0.90). Percentages of

agreement were 87.5% and 82.5% . Moderate agreement was found for the

Mobility section, with κ = 0.54 (95% CI = 0.28-0.81) and 80% of agreement.

Durand et al. (2008) Physical Work

Performance

Evaluation (PWPE)

Internal responsiveness

External responsiveness

Subjects were a group of participants in a work rehabilitation program because of

low back pain (rehabilitation group) and a group of healthy subjects (comparison

group). The PWPE was administered twice to both groups by the same rater, with

a time interval of six weeks. There were five raters. To determine the internal

responsiveness, the change in the pre-/post-test PWPE scores of both groups was

compared. To determine external responsiveness, the change in PWPE scores in

the rehabilitation group was compared to concurrent criteria. Six concurrent

criteria were chosen: aerobic capacity, perceived disability in ADL‟s, fear-

avoidance of activity, psychological distress, pain level and therapist and worker

judgement. According to literature, these are predictors of Return-To-Work.

To compare results pre-/post-test per group, Wilcox Signed-Ranks Test were used.

Differences between both groups were analysed using the Wilcoxon Mann-

Whitney Test. Kendall correlation coefficients were used to determine whether a

significant correlation existed between pre-/post-test differences.

In the rehabilitation group, PWPE-scores significantly (p≤0.05) differed pre-/post-

test for the dynamic strength section (p=0.001) and the position tolerance section

(p=0.0195) but not for the mobility section or the overall PWPE score (internal

responsiveness). In the comparison group no significant (p≤0.05) differences on

PWPE-score pre-/ post-test were found, not for the overall PWPE score or PWPE

sections (internal responsiveness). A statistically significant difference between

57

both groups pre-/post-test was found only for the dynamic strength section of the

test (p = 0.0008). All concurrent criteria for the rehabilitation group were

significantly different pre-/post-test but correlations between change-scores and

changes in concurrent criteria were not statistically significant; except for the

postural tolerance section: Kendall Tau = -0.41, (p =0.0131) (external

responsiveness). Perceived change in the rehabilitation group was statistically

significant, but did not correlate with the overall PWPE scores 0.28≤Kendall

Tau≤0.37. Change in therapists‟ perception of change was significant but did not

correlate with the overall PWPE scores.

Lechner, Page and Sheffield (2008) Physical Work

Performance

Evaluation (PWPE)

Predictive validity This study used data from December 1993 to October 1994 from a previous study.

The PWPE was administered to the subjects by two raters and the results were

compared to their job requirements (which were observationally analysed by the

raters). After a maximum of five weeks of intervention based on the functional

areas of deficit, the subjects were re-assessed for the PWPE components for which

there was a deficit between job demands and clients‟ physical abilities at the initial

assessment. Based on final test results, a recommendation was made: „Return-To-

Work‟, „Return-To-Work: modified duty‟ or „no Return-To-Work‟. The subjects

were contacted by telephone three and six months after discharge to determine

their current working level.

A kappa coefficient was calculated to determine the level of agreement between

discharge recommendations and actual Return-To-Work.

At discharge kappa was 0.74 for Return-To-Work recommendations, which is

high. At three months after discharge kappa was 0.69 for Return-To-Work

recommendations, which is high. At six months after discharge kappa was 0.70 for

Return-To-Work recommendations, which is also high.

58



Branton et al. (2010) Short-Form

Functional Capacity

Evaluation

Predictive validity Previously collected data were used to compare subjects‟ performance in the

items of the Short-Form FCE to administrative recovery outcomes (days to

claim closure, days to time loss benefit suspension and future recurrence). Key

FCE variables for prediction analysis included: overall subject performance,

number of failed FCE items and performance in the individual FCE items.

Analysis included multivariable Cox and logistic regression using a risk factor

modelling strategy.

Associations between FCE results (number of failed items) and administrative

outcomes: Adjusted Hazard Ratio (AHR) for days to benefit suspension = 5.45

(95% CI = 2.73-10.85), AHR for days to claim closure = 5.80 (95% CI = 3.50-

9.61), Adjusted Odds Ratio for recurrence = 1.31 (95% CI = 0.48-3.60). The

proportion of variance explained by the FCE ranged from 18%-27%.

59



Meterko et al. (2015) Work Disability

Functional

Assessment Battery

(WD-FAB)

Discriminant/divergent

validity

Convergent validity

Subjects were unable to work because of physical or mental disability (two

disability groups). Each disability group responded to a survey, consisting of

the relevant WD-FAB scales and existing measures (the RAND 36-Item

Health Survey‟s Physical Component Summary (PCS) and the Mental

Component Summary (MCS). The number of raters was not mentioned. For

the physical disability group, four multi-item physical functioning (PF) scales

were identified: changing and maintaining body position, upper body function,

upper extremity fine motor and whole body mobility. For the mental disability

group, the behavioural health (BH) components‟ four multi-item scales were

identified: Self-Efficacy, Mood and Emotions, Behavioural Control and Social

Interactions. The physical disability group were also administered the Patient-

Reported Outcomes Measurement Information System (PROMIS) Physical

Function 10-Item Short-Form, which measures current capability for mobility,

walking, hand and arm use and activities of daily living. The behavioural

disability group was also administered five scales of the six-scale Behaviour

and Symptom Identification Scale (BASIS).

Construct validity was assessed by examining both convergent and

discriminant correlations between the WD-FAB scales and scores on same-

domain and cross-domain measures.

All correlations between the measures of physical functioning (the PCS and

the PROMIS) and the four PF scales of the WD-FAB were statistically

significant (p<0.05) in the predicted direction. Correlations with the PCS and

the four items of the physical functioning scale were 0.23, which is low; and

60

0.42, 0.43 and 0.55, which is moderate. Correlations with the PROMIS and the

four PF scales were 0.54, which is moderate and 0.65, 0.69 and 0.70, which is

high. This indicates a low to moderate association between the four PF scales

and same-domain measures (convergent validity). All correlations between the

measures of behavioural health (5 scales of the BASIS) and the four BH scales

of the WD-FAB were statistically significant (p<0.05) in the predicted

direction. These correlations ranged from -0.24 to -0.74, with two high

correlations, 15 moderate correlations and three low correlations. This

indicates a moderate association between the four BH scales and same-domain

measures (convergent validity).

Correlations between the four PF scales and the MCS were all statistically

significant (p<0.05) and they were all low: 0.12, 0.15, 0.21 and 0.24. This

indicates a good ability to differentiate between mental and physical disability

groups (discriminative/divergent validity). Correlations between the four BH

scales and the PCS statistically significant (p<0.05) in two out of four

correlations. Three out of four correlations were low: 0.06, 0.07 and 0.21; and

one was moderate: 0.31. This indicates a good ability to differentiate between

mental and physical disability groups (discriminative/divergent validity).

61



James, Mackenzie and Capra (2010) WorkHab

Functional Capacity

Evaluation

Test-retest reliability

Internal consistency

The manual handling component (floor to bench, bench to shoulder and bench

to bench lift) of the WorkHab FCE was administered twice in the subjects,

with a time interval of one week. The FCE subtests were administered by one

rater, at approximately the same time of the day. Subjects were asked to

perform to their maximum ability.

One-way random Intra-class Correlation Coefficients (ICCs), 95% Confidence

Intervals (CIs), Limits of Agreement (LoAs), paired sample t-test, kappa

(weighted for ordinal data) and percentage agreement were calculated where

appropriate. A ratio between the LoA and the mean score was also calculated

using the following formula (1.96 x standard deviation of mean

difference)/mean session 1 and 2 x 100%.

An ICC of 0.74 (95% CI = 0.42-0.88) was found for the overall manual

handling score, which is low. ICCs of 0.92, 0.90 and 0.91 were found for the

three subtests of this component, which are moderate to high.

Cronbach‟s alpha for the overall manual handling score was 0.92, which

indicates good internal consistency. Cronbach‟s alpha of the three subtests of

this component varied from 0.82 to 0.86, which is good.

James, Mackenzie and Capra (2011) WorkHab FCE Intra-rater reliability


A DVD was made with recordings of the subjects performing the subtests of

the handling component of the WorkHab FCE. This DVD was sent to 17 raters

who had to rate the recordings according to the WorkHab FCE protocol. They

had to fill in a score form which they had to send back to the researchers.

Fourteen of the raters re-assessed the same recordings in a different (random)

62

order approximately 14 days later. The assessors had to record the weight

lifted and calculate a manual handling score. Stance, posture, leverage, torque

and pacing comprise the manual handling score, which is based on the

principles of safe manual handling, with each of these components being rated

on a scale of 0–4 with „0‟ being no adherence and „4‟ being the highest safety

score. The sum of the score for each component is recorded as the manual

handling score for each subject.

Intra-class Correlation Coefficients (ICCs) and their 95% confidence intervals

were calculated to determine intra- and inter-rater reliability. Intra-rater

reliability, the level of agreement when the same rater viewed the same clip on

two different occasions, was calculated using the ICC – Model 3 ( mixed

model where the rater is considered the fixed effect and the subjects are

considered the random effect). The ICC used to determine inter-rater reliability

was Model 2 ( both raters and subjects are considered random effects).

For intra-rater reliability of the WorkHab manual handling component an ICC

of 0.97 was found for the manual handling score, which indicates high

agreement within raters. The ICCs of the subtests of the manual handling

component ranged between 0.81 and 0.97, which indicates moderate to high

agreement within raters.

For inter-rater reliability of the WorkHab manual handling component an ICC

of 0.90 was found for the manual handling score, indicating moderate

agreement between raters. The ICCs of the subtests of the manual handling

component ranged between 0.77 and 0.91 for the manual handling score,

which displays moderate to high agreement between raters.

63

3.5. Summary of the results and their interpretations

3.5.1. Baltimore Therapeutic Equipment (BTE) Work Simulator

The job-specific BTE has a moderate predictive validity for Return-To-Work/

employment status (29,30).

3.5.2. Blankenship WorkEval Functional Capacity Evaluation

Brubaker et al. (2007) found a sensitivity of 80.0% and specificity of 84.2% in the

Blankenship. This indicates good diagnostic abilities of the FCE.

3.5.3. Ergo-Kit (EK) Functional Capacity Evaluation

Overall agreement for outcomes on the lifting tests of the EK ranged from 8.6 to 3.4,

which might be interpreted as moderate and indicates moderate variability between

outcomes on the lifting tests (32). Agreement between raters strongly varied for the

lifting tests of the EK from low to high, but was mostly high (31,32). The same applies

to agreement within raters (31).

Low discriminative abilities were found for the EK lifting tests (discriminant/divergent

validity). Little or no association was found between the EK lifting tests and the Von

Korff Questionnaire, indicating good convergent validity (33). Concurrent validity with

the ERGOS Work Simulator was low to moderate (34).

3.5.4. ERGOS Work Simulator (EWS)

Concurrent validity with the Ergo-Kit FCE was low to moderate (34).


WorkWell Systems Functional Capacity Evaluation

A varying test-retest reliability/reproducibility was found in the IWS material-handling

component with moderate to high agreement between outcomes. Overall, this IWS

component provided relatively stable outcomes with limited variation (36). Agreement

between raters was moderate for the lifting tests of the IWS (37) and moderate to high

agreement between raters was found for the physical and behavioural scales used in the

IWS to observationally determine physical effort in eleven IWS subtests (inter-rater

64

reliability) (38). Agreement within raters was moderate to high for the physical and

behavioural scales used in the IWS to observationally determine physical effort (intra-

rater reliability) (38).

Studies show that performance in the IWS (number of failed tasks and weight lifted) has

no or low predictive value for recovery outcomes such as timely or sustained Return-

To-Work and future pain, based on days to benefit suspension, days to claim closure

and recurrence (11,39).

3.5.6. ErgoScience Physical Work Performance Evaluation (PWPE)

A varying test-retest reliability/reproducibility was found for the sections of the PWPE

with moderate to high agreement between outcomes and moderate agreement between

outcomes of the overall PWPE (41). In other words, the PWPE provides relatively

stable outcomes with limited variation. Agreement between raters was moderate to high

for the PWPE sections and high for the overall PWPE score (42).

Change within subjects was observed by two PWPE sections (the dynamic strength

section and the position tolerance section), but not by the mobility sections or by the

overall PWPE (internal responsiveness) (44). The PWPE was not able to distinguish

change on the subjects‟ outcomes for reference measures (concurrent and empirical

data) of health status (external responsiveness) (44).

A study by Lechner, Page and Sheffield (2008) shows that performance in the PWPE

(Return-To-Work recommendations based on PWPE outcomes) has high predictive

value for Return-To-Work/ employment status.

3.5.7. Short-Form Functional Capacity Evaluation

Performance in the Short-Form FCE (number of failed items) has got predictive value

for recovery outcomes such as timely and sustained Return-To-Work, based on days to

benefit suspension and days to claim closure (45).

3.5.8. Work Disability Functional Assessment Battery (WD-FAB)

Good discriminative abilities were found for the WD-FAB physical functioning and

behavioural health scales for differentiating between physical and mental disability

65

(discriminant/divergent validity) (46). Low to moderate association was found between

the WD-FAB physical functioning scale and the PROMIS (convergent validity) (46).

Low to high associations were found between the WD-FAB behavioural health scale

and the BASIS, with mostly moderate associations (convergent validity) (46).

3.5.9. WorkHab Functional Capacity Evaluation

A moderate to high agreement between outcomes in the three subtests of the WorkHab

manual handling component was found and a high agreement between outcomes on the

overall manual handling score (test-retest reliability/reproducibility) (47). Overall, the

WorkHab manual handling score provided stable outcomes with little variation.

Agreement between raters was moderate to high for the outcomes of the subtests of the

WorkHab manual handling component and moderate for the overall manual handling

component outcome (inter-rater reliability) (48). Agreement within raters was moderate

to high for the subtests and high for the overall manual handling component (intra-rater

reliability) (48).

James, Mackenzie and Capra (2010) found a high internal consistency for the manual

handling component and the individual tests of the manual handling component (47).

66

4. Discussion and conclusion

4.1. Results

Overall, the psychometric properties of the studied FCE methods somewhat vary

between, but also within, methods. The predictive validity, for example, is low in some

methods, but high in others. An important question, however, remains whether FCEs

should be used to „predict‟ or to „establish‟/‟diagnose‟ work ability, for more client-

specific interventions. In case of the latter, sensitivity and specificity become more

important measures.

Inter-rater reliability seems to be good in most FCE methods, as well as intra-rater

reliability. Test-retest outcomes are also promising in most cases. It can be suggested

that the reliability of current FCE methods is generally well-researched and shows

positive results.

Validity has been studied less frequently over the last years. Overall, the FCE methods

in this study show variable validity, however, most outcomes show moderate to good

validity values.

4.2. Interpreting the results of the reviewed studies

An important side note to the study results is that, “according to several authors, when a

rater is involved in scoring the evaluation, intra-rater reliability is equivalent to test-

retest reliability because the accuracy of the FCE is influenced by the skill of the rater”

(49). For the present study however, the term test-retest reliability is used, seeing this is

the term that was chosen by researchers of the articles in question. Furthermore,

concurrent validity should be determined by studying the relation between the studied

assessment and its gold standard (6). In FCE however, there is no gold standard

available (6). Therefore, the use of the term „concurrent validity‟ in the study by

Rustenburg, Kuijer and Frings-Dresen (2004) seems to be inadequate. It would be better

to speak in terms of „comparison‟ or „correlation‟ (6).

Another aspect that should be taken into consideration is the study sample in which the

psychometric properties of the FCE methods are researched. The outcomes of the

reviewed studies should not be generalized to broader populations, because they are

67

specific to the study sample at hand. Validity or reliability of FCE methods in certain

pathologies might not be the same for the general population.

Finally and most importantly, this review only included evidence published since May

2004. The outcomes of the present study should be integrated with the pre-existing

evidence found in systematic reviews by Innes and Straker (1999a, 1999b);

Gouttebarge, Wind, Kuijer and Frings-Dresen (2004); Wind, Gouttebarge, Kuijer and

Frings-Dresen (2005) and Innes (2006) (6,8,16–18), for a comprehensive representation.

4.3. More than psychometric properties

Since one of the main aims of this review was to enable comparison of FCE methods in

order to make objectively informed decisions, it is also important to look beyond their

psychometric properties (effectiveness). The practical side (efficiency) of FCE methods

is another essential aspect to take into account. According to Hart, Isernhagen and

Matheson (1993), safety should be achieved before considering validity and reliability.

When validity and reliability are demonstrated, practicality and utility should

subsequently be taken into account, in that order (50). Costs, time spent on the

assessment, user-friendliness and acceptability are some important factors, which might

sometimes play a bigger role in the choice of FCE method than their psychometric

properties. Many FCE methods have promising psychometric properties, but are

burdensome clinical tools in terms of time and cost (51). Short-Form FCEs or FCEs in

the form of a structured interview provide a potential answer to many problems.

Nonetheless, the psychometrics of these FCE methods need more scientific

substantiation.

Acceptability of FCE was studied by questioning its usefulness in an expert panel (13).

The results showed that two thirds of the experts found FCE useful because they

confirm personal judgements and provide objective information. However, reasons for

not finding FCE useful were that it did not seem objective and did not provide any new

information. Job-specific FCE might provide a solution to the lack of new or non-

specific information. The client-centeredness of this kind of FCE is an important asset,

as it should reduce redundant information and provide directly applicable input for the

vocational rehabilitation process. Furthermore, only 20% of the experts judged FCE to

68

be a useful prognostic instrument, which is low. Most of them argued that FCE is an

evaluative measure and is not to be used for predictive purposes. Studies on the

predictive validity of several FCE methods should replace these opinions with evidence.

4.4. Literature search and potentially missed studies

Some researchers state that the literature search in systematic reviews should be

exhaustive, so that all possibly relevant data are obtained (17). However, given a limited

time frame and limited resources, an exhaustive approach could not be guaranteed for

this review. Therefore, it is a possibility that potentially relevant studies were missed.

For example: studies that were not available in the researched databases, studies that

used different keywords or studies published in languages other than English, Dutch or

French. An important limitation of this study is that names of known FCE methods were

not used as search terms. This may have resulted in some relevant studies being missed

and this should be taken into consideration when interpreting the results. Lastly,

publication bias could have had a potential confounding effect on the literature search,

where studies with negative outcomes were possibly not published. Nevertheless, the

literature search was performed as thoroughly as possible, by researching all available

databases, using broad search terms and synonyms and applying the inclusion and

exclusion criteria.

4.5. Choice of the quality assessment method

It is important to interpret the quality appraisal scores in this review correctly. The

scoring system used does not consider all important factors that can determine the

quality of reviewed studies. Although a checklist might provide more useful

information on these factors, a scoring system was used to facilitate the interpretation of

the study results. To nuance the (over)simplified representation of the studies‟

methodological quality, extra information on the study designs/approaches was

provided in Table 4: Methodological aspects of the reviewed studies, susceptible to

bias‟. Another reason for using the three-level quality appraisal scale by Gouttebarge et

al. (2004) (6) was that most validated checklists, such as the COSMIN checklist, did not

seem to be suitable for some of the reviewed studies. Other checklists were mostly

69

designed for appraising experimental studies, but not for studies on psychometric

properties.

4.6. Recommendations for future research

This systematic review has provided a more extensive and updated representation of the

psychometric qualities of several FCE methods. Some more ground has been covered

on the better known FCE methods, while new methods with different approaches are on

the rise and gaining scientific support as well. The newer approaches, such as the Short-

Form FCE need to be further examined on several psychometric properties.

Psychometrics of most of the well-known methods are thoroughly researched but some

of the research indicates weaknesses in their reliability and validity. Future research

should address how these weaknesses can be overcome, while also taking into account

practicality and utility-aspects of the FCE.

70

REFERENCES

1. OECD. Sickness, Disability and Work: Breaking the barriers. 2011.

2. Andrén D. Work, Sickness, Earnings, and Early Exits from the Labor Market. An

Empirical Analysis Using Swedish Longitudinal Data. 2001.

3. Hakim C. The Social Consequences of High Unemployment. J Soc Policy.

1982;11:433–67.

4. Dooley D, Fielding J, Levi L. Health and Unemployment. Annu Rev Public

Health. 1996;17:449–56.

5. Takala J, Hämäläinen P, Saarela KL, Yun LY, Manickam K, Jin TW, et al.

Global estimates of the burden of injury and illness at work in 2012. J Occup

Environ Hyg. 2014;11(5):326–37.

6. Gouttebarge V, Wind H, Kuijer PP, Frings-Dresen MHW. Reliability and

validity of Functional Capacity Evaluation methods: A systematic review with

reference to Blankenship system, Ergos work simulator, Ergo-Kit and Isernhagen

work system. Int Arch Occup Environ Health. 2004;77(8):527–37.

7. Young AE, Wasiak R, Roessler RT, Mcpherson KM, Anema JR, Poppel MNM

Van. Return-to-Work Outcomes Following Work Disability: Stakeholder

Motivations, Interests and Concerns. J Neurol Neurosurg Psychiatry.

2005;15(4):543–56.

8. Innes E. Reliability and Validity of Functional Capacity Evaluations : An Update.

Int J Disabil Manag Res. 2006;1(1):135–48.

9. Reneman MF. Introduction to the Special Issue on Functional Capacity

Evaluations : From Expert Based to Evidence Based. J Occup Rehabil.

2003;13(4):203–6.

10. Groothoff JW, Geertzen JHB, Reneman MF. Towards Consensus in Operational

Definitions in Functional Capacity Evaluation: a Delphi Survey. J Occup

Rehabil. 2008;18:389–400.

71

11. Gross DP, Battié MC. Functional Capacity Evaluation Performance Does Not

Predict Sustained Return to Work in Claimants With Chronic Back Pain. J Occup

Rehabil. 2005;15(3):285–94.

12. Haglund L, Karlsson G, Kielhofner G, Lai JS. Validity of the Swedish version of

the Worker Role Interview. Scand J Occup Ther. 1997;4(1-4):23–9.

13. Wind H, Gouttebarge V, Frings-Dresen MHW. Het nut van Functionele

Capaciteit Evaluatie : de visie van experts. Tijdschr voor Bedrijfs- en Verzek.

2005;13(10):300–5.

14. Chen JJ. Functional Capacity Evaluation & Disability. Iowa Orthop J.

2004;27:121–7.

15. King PM, Tuckwell N, Barrett TE. A Critical Review of Functional Capacity

Evaluations. J Am Phys Ther Assoc. 1998;78(8):852–66.

16. Innes E, Straker L. Validity of work-related assessments. Work. 1999;13(2):125–

52.

17. Innes E, Straker L. Reliability of work-related assessments. Work.

1999;13(2):107–24.

18. Wind H, Gouttebarge V, Kuijer PP, Frings-Dresen MHW. Assessment of

Functional Capacity of the Musculoskeletal System in the Context of Work ,

Daily Living , and Sport : A Systematic Review. J Occup Rehabil.

2005;15(2):253–72.

19. Bieniek S, Bethge M. The reliability of WorkWell Systems Functional Capacity

Evaluation : a systematic review. BMC Musculoskelet Disord. BMC

Musculoskeletal Disorders; 2014;15(1):1–13.

20. van der Meer S, Trippolini MA, van der Palen J, Verhoeven J, Reneman MF.

Which instruments can detect submaximal physical and functional capacity in

patients with chronic nonspecific back pain? A systematic review. Spine (Phila

Pa 1976). 2013;38(25):E1608–15.

72

21. Kuijer PP, Gouttebarge V, Brouwer S, Reneman MF. Are performance-based

measures predictive of work participation in patients with musculoskeletal

disorders ? A systematic review. Int Arch Occup Environ Health. 2012;85:109–

23.

22. Altman DG. Practical statistics for medical research. London; 1991.

23. Deyo RA, Diehr P, Patrick DL. Reproducibility and responsiveness of health

status measures statistics and strategies for evaluation. Control Clin Trials.

1991;12(4):142–58.

24. Innes E, Straker L. Reliability of work-related assessments. Work. 1999;13:107–

24.

25. Numally J. Psychometric theory. 2nd ed. New York: McGraw-Hill; 1978.

26. Weiner E, Stewart B. Assessing individuals. Boston: Little Brown; 1984.

27. Terwee CB, Bot SDM, de Boer MR, van der Windt DAWM, Knol DL, Dekker J,

et al. Quality criteria were proposed for measurement properties of health status

questionnaires. J Clin Epidemiol. 2007;60(1):34–42.

28. Boland A, Cherry GM, Dickson R, editors. Doing a systematic review. A

student‟s guide. Sage; 2013.

29. Cheng ASK, Cheng SWC. The Predictive Validity of Job-Specific Functional

Capacity Evaluation on the Employment Status of Patients With Nonspecific

Low Back Pain. J Occup Environ Med. 2010;52(7):719–24.

30. Cheng ASK, Cheng SWC. Use of Job-Specific Functional Capacity Evaluation to

Predict the Return to Work of Patients With a Distal Radius Fracture. Am J

Occup Ther. 2011;65(4):445–52.

31. Gouttebarge V, Wind H, Kuijer PP, Sluiter JK. Intra- and Interrater Reliability of

the Ergo-Kit Functional. Arch Phys Med Rehabil. 2005;86:2354–60.

32. Gouttebarge V, Wind H, Kuijer PP, Sluiter JK, Frings-Dresen MHW. Reliability

and Agreement of 5 Ergo-Kit Functional Capacity Evaluation Lifting Tests in

73

Subjects With Low Back Pain. Arch Phys Med Rehabil. 2006;87:1365–70.

33. Gouttebarge V, Wind H, Kuijer PP, Sluiter JK, Frings-Dresen MHW. Construct

Validity of Functional Capacity Evaluation Lifting Musculoskeletal Disorders.

Arch Phys Med Rehabil. 2009;90(2):302–8.

34. Rustenburg G, Kuijer PP, Frings-Dresen MHW. The Concurrent Validity of the

ERGOS Work Simulator and the Ergo-Kit With Respect to Maximum Lifting

Capacity. J Occup Rehabil. 2004;14(2):107–18.

35. Brubaker PN, Fearon FJ, Smith SM, McKibben RJ, Alday J, Andrews SS, et al.

Sensitivity and Specificity of the Blankenship FCE System‟s Indicators of

Submaximal Effort. J Orthop Sport Phys Ther. 2007;37(4):161–8.

36. Reneman MF, Brouwer S, Meinema A, Dijkstra PU, Geertzen JHB, Groothoff

JW. Test-Retest Reliability of the Isernhagen Work Systems Functional Capacity

Evaluation in Healthy Adults. J Occup Rehabil. 2004;14(4):295–305.

37. Reneman MF, Fokkens AS, Dijkstra PU, Geertzen JHB, Groothoff JW. Testing

Lifting Capacity: Validity of Determining Effort Level by Means of Observation.

Spine (Phila Pa 1976). 2005;30(2):40–6.

38. Trippolini MA, Dijkstra PU, Jansen B, Oesch P, Geertzen JHB, Reneman MF.

Reliability of Clinician Rated Physical Effort Determination During Functional

Capacity Evaluation in Patients with Chronic Musculoskeletal Pain. J Occup

Rehabil. 2014;24:361–9.

39. Gross DP, Battié MC. Does Functional Capacity Evaluation predict recovery in

workers‟ compensation claimants with upper extremity disorders ? Occup

Environ Med. 2006;63:404–11.

40. Trippolini MA, Dijkstra PU, Côté P, Scholz-Odermatt SM, Geertzen JHB,

Reneman MF. Can Functional Capacity Tests Predict Future Work Capacity in

Patients With Whiplash-Associated Disorders? Arch Phys Med Rehabil.

2014;95:2357–66.

41. Brassard B, Durand M-J, Loisel P, Lemaire J. Étude de fidélité test-retest de l‟

74

Évaluation des Capacités Physiques reliées au Travail. Can J Occup Ther.

2006;73(4):206–14.

42. Durand M, Loisel P, Mercier R, Stock SR, Lemaire J. The Interrater Reliability

of a Functional Capacity Evaluation : The Physical Work Performance

Evaluation. J Occup Rehabil. 2004;14(2):119–29.

43. Lechner DE, Page JJ, Sheffield G. Predictive validity of a Functional Capacity

Evaluation: The Physical Work Performance Evaluation. Work. 2008;31:21–5.

44. Durand M-J, Brassard B, Nha Hong Q, Loisel P. Responsiveness of the Physical

Work Performance Evaluation, a Functional Capacity Evaluation, in Patients

with Low Back Pain. J Occup Rehabil. 2008;18:58–67.

45. Branton EN, Arnold KM, Appelt SR, Hodges MM, Gross DP, Battié MC. A

Short-Form Functional Capacity Evaluation Predicts Time to Recovery but Not

Sustained Return-to-Work. J Occup Rehabil. 2010;20:387–93.

46. Meterko M, Marfeo EE, McDonough CM, Jette AM, Ni P, Bogusz K, et al. Work

Disability Functional Assessment Battery : Feasibility and Psychometric

Properties. Arch Phys Med Rehabil. Elsevier Ltd; 2015;96(6):1028–35.

47. James C, Mackenzie L, Capra M. Test – retest reliability of the manual handling

component of the WorkHab Functional Capacity Evaluation in healthy adults.

Disabil Rehabil. 2010;32(22):1863–9.

48. James C, Mackenzie L, Capra M. Inter- and intra-rater reliability of the manual

handling component of the WorkHab Functional Capacity Evaluation. Disabil

Rehabil. 2011;33(19-20):1797–804.

49. Gibson L, Dang M, Strong J, Khan A. Test-retest reliability of GAPP Functional

Capacity Evaluation in healthy adults. Can J Occup Ther. 2010;77:38–48.

50. Hart DL, Isernhagen SJ, Matheson LN. Guidelines for Functional Capacity

Evaluation of people with medical conditions. J Orthop Sports Phys Ther.

1993;18(6):682–6.

75

51. Gross DP, Battié MC, Asante AK. Evaluation of a Short-form Functional

Capacity Evaluation : Less may be Best. J Occup Rehabil. 2007;17:422–35.

76

APPENDICES

1. List of figures and tables

Figure 1: PRISMA Flow Diagram

Table 1: Used key words and their synonyms

Table 2: The methodological quality appraisal

Table 3: Interpretation of reliability and validity outcomes

Table 4: Studies sorted by topic (type of studied psychometric property)

Table 5: Studies sorted by FCE method

Table 6: Overview of the study characteristics

Table 7: Results of the methodological quality appraisal

Table 8: Methodological aspects of the reviewed studies, susceptible to bias

Table 9: Study methods and results

77

2. Toelating voor consultatie

“De auteur en de promotor geven de toelating deze masterproef voor consultatie

beschikbaar te stellen en delen ervan te kopiëren voor persoonlijk gebruik. Elk ander

gebruik valt onder de beperkingen van het auteursrecht, in het bijzonder met betrekking

tot de verplichting uitdrukkelijk de bron te vermelden bij het aanhalen van resultaten uit

deze masterproef.”

12/05/2016

Noortje Schalley Dominique Van de Velde ( Promotor)

MASTER IN DE ERGOTHERAPEUTISCHE...

Documents

Transcript of MASTER IN DE ERGOTHERAPEUTISCHE...