MASTER IN DE ERGOTHERAPEUTISCHE...
Transcript of MASTER IN DE ERGOTHERAPEUTISCHE...
Faculteit Geneeskunde en Gezondheidswetenschappen
Updating the Evidence on Functional Capacity Evaluation Methods
A Systematic Review
Noortje SCHALLEY
Masterproef ingediend tot het verkrijgen van de graad van
Master of science in de ergotherapeutische wetenschap
Promotor: Prof. dr. Dominique Van de Velde
Co-promotor: Sofie Vertriest
Mentor: Lien Van Peteghem
Academiejaar 2015-2016
MASTER IN DE ERGOTHERAPEUTISCHE WETENSCHAP
Interuniversitaire master in samenwerking met:
UGent, KU Leuven, UHasselt, UAntwerpen,
Vives, HoGent, Arteveldehogeschool, AP Hogeschool Antwerpen, HoWest, Odisee, PXL,
Thomas More
Faculteit Geneeskunde en Gezondheidswetenschappen
Updating the Evidence on Functional Capacity Evaluation Methods
A Systematic Review
Noortje SCHALLEY
Masterproef ingediend tot het verkrijgen van de graad van
Master of science in de ergotherapeutische wetenschap
Promotor: Prof. dr. Dominique Van de Velde
Co-promotor: Sofie Vertriest
Mentor: Lien Van Peteghem
Academiejaar 2015-2016
MASTER IN DE ERGOTHERAPEUTISCHE WETENSCHAP
Interuniversitaire master in samenwerking met:
UGent, KU Leuven, UHasselt, UAntwerpen,
Vives, HoGent, Arteveldehogeschool, AP Hogeschool Antwerpen, HoWest, Odisee, PXL,
Thomas More
ABSTRACT
Objectives: The purpose of this systematic review is to synthesize the recent evidence
(published since May 2004) on the psychometrics of currently used Functional Capacity
Evaluation (FCE) methods. This way, information from previous systematic reviews on
this topic can be enriched with up-to-date evidence, enabling more objectively
substantiated decisions in everyday practice.
Methods: A systematic literature search was conducted in nine databases. The resulting
articles‟ title, abstract and full-text were screened based on predefined inclusion and
exclusion criteria. Two reviewers independently performed this screening process.
Included studies were appraised based on their methodological quality. Relevant data
were extracted into extraction tables.
Results: The search resulted in 20 eligible studies of varying methodological quality,
on nine different FCE methods. The Baltimore Therapeutic Equipment Work Simulator
showed a moderate predictive validity and good diagnostic abilities. The Ergo-Kit (EK)
showed moderate variability and high inter- and intra-rater reliability. Low
discriminative abilities and high convergent validity were found for the EK. Concurrent
validity of the EK and the ERGOS Work Simulator was low to moderate. Moderate to
high test-retest, inter- and intra-reliability was found in the Isernhagen Work Systems
(IWS) FCE. The predictive validity of the IWS was low. The Physical Work
Performance Evaluation (PWPE) showed moderate test-retest reliability and moderate
to high inter-rater reliability. Low internal and external responsiveness were found for
the PWPE, while predictive validity was high. The predictive value of the Short-Form
FCE was also high. Low discriminative and convergent validity were found for the
Work Disability Functional Assessment Battery. The WorkHab showed moderate to
high test-retest, inter- and intra-rater reliability. Its manual handling component‟s
internal consistency was high.
Conclusions: Well-known FCE methods have been rigorously studied, but some of the
research indicates weaknesses in their reliability and validity. Future research should
address how these weaknesses can be overcome. Newer methods, such as the Short-
Form FCE need to be further examined on several psychometric properties.
ABSTRACT
Doelstellingen: Het doel van deze systematische review is om recente evidentie
(gepubliceerd sinds mei 2004) omtrent de psychometrische eigenschappen van
Functionele Capaciteiten Evaluatie methoden samen te bundelen. Op deze manier kan
de informatie uit voorafgaande systematische reviews omtrent dit thema worden verrijkt
met up-to-date evidentie, om zo objectief onderbouwde beslissingen in de dagelijkse
praktijk te faciliteren.
Methodes: In negen databases werd een systematische literatuursearch uitgevoerd. De
titel, het abstract en de full-texts van de verkregen artikels werden achtereenvolgens
gescreend op basis van vooraf bepaalde inclusie- en exclusiecriteria. Twee reviewers
voerden dit screeningsproces onafhankelijk van elkaar uit. Geïncludeerde studies
werden beoordeeld op basis van methodologische kwaliteit. Relevante data werden
geëxtraheerd in extractietabellen.
Resultaten: De zoekopdrachten leverden 20 geschikte studies op van variërende
methodologische kwaliteit, omtrent negen verschillende FCE methoden. De Baltimore
Therapeutic Equipment Work Simulator vertoonde matige predictieve validiteit en een
goed diagnostisch vermogen. De Ergo-Kit (EK) vertoonde matige variabiliteit en hoge
inter- en intra-beoordelaarsbetrouwbaarheid. Verder werden voor de EK ook een laag
discriminerend vermogen en goede convergente validiteit gevonden. De concurrente
validiteit van de EK en de ERGOS Work Simulator was laag tot matig. Matige tot hoge
test-hertest betrouwbaarheid en inter- en intra-beoordelaarsbetrouwbaarheid werden
gevonden voor de Isernhagen Work Systems (IWS) FCE. De predictieve validiteit van
de IWS was laag. De Physical Work Performance Evaluation (PWPE) vertoonde matige
test-hertest betrouwbaarheid en matige tot hoge inter-beoordelaarsbetrouwbaarheid. Er
werd ook een lage interne- en externe responsiviteit gevonden voor de PWPE, terwijl de
predictieve validiteit hoog was. Ook de predictieve validiteit van de Short-Form FCE
was hoog. Lage discriminante en convergente validiteit werden gevonden voor de Work
Disability Functional Assessment Battery. De WorkHab vertoonde matige tot hoge test-
hertest betrouwbaarheid en inter- en intra beoordelaarsbetrouwbaarheid. De interne
consistentie van de manual handling component van het assessment was hoog.
Conclusies: Bekende FCE methoden werden reeds uitgebreid bestudeerd, maar
bepaalde studies tonen zwaktes aan op vlak van betrouwbaarheid en validiteit. Verder
onderzoek zou best focussen op hoe deze zwaktes overwonnen kunnen worden.
Nieuwere methodes zoals de Short-Form FCE, zouden verder onderzocht moeten
worden op vlak van diverse psychometrische eigenschappen.
Aantal woorden masterproef: 10 303
(exclusief inhoudstafel, tabellen, cijfermateriaal, bijlagen en bibliografie)
TABLE OF CONTENTS
PREFACE / ACKNOWLEDGEMENTS ......................................................................... 1
1. Introduction ........................................................................................................... 2
2. Methods ................................................................................................................. 6
2.1. Systematic literature search strategy .............................................................. 6
2.2. Quality assessment ......................................................................................... 8
2.3. Synthesis approach ....................................................................................... 12
3. Results ................................................................................................................. 13
3.1. Literature search ........................................................................................... 13
3.2. Reviewed studies .......................................................................................... 14
3.2.1. Baltimore Therapeutic Equipment (BTE) Work Simulator .................. 15
3.2.2. Ergo-Kit (EK) Functional Capacity Evaluation .................................... 15
3.2.3. ERGOS Work Simulator (EWS) .......................................................... 15
3.2.4. Blankenship WorkEval Functional Capacity Evaluation ..................... 15
3.2.5. Isernhagen Work Systems (IWS) Functional Capacity Evaluation /
WorkWell Systems Functional Capacity Evaluation .......................................... 15
3.2.6. ErgoScience Physical Work Performance Evaluation (PWPE) ........... 16
3.2.7. Short-Form Functional Capacity Evaluation ........................................ 16
3.2.8. Work Disability Functional Assessment Battery (WD-FAB) .............. 16
3.2.9. WorkHab Functional Capacity Evaluation ........................................... 16
3.3. Quality appraisal .......................................................................................... 19
3.4. Study outcomes ............................................................................................ 20
3.5. Summary of the results and their interpretations ......................................... 63
3.5.1. Baltimore Therapeutic Equipment (BTE) Work Simulator .................. 63
3.5.2. Blankenship WorkEval Functional Capacity Evaluation ..................... 63
3.5.3. Ergo-Kit (EK) Functional Capacity Evaluation .................................... 63
3.5.4. ERGOS Work Simulator (EWS) .......................................................... 63
3.5.5. Isernhagen Work Systems (IWS) Functional Capacity Evaluation /
WorkWell Systems Functional Capacity Evaluation .......................................... 63
3.5.6. ErgoScience Physical Work Performance Evaluation (PWPE) ........... 64
3.5.7. Short-Form Functional Capacity Evaluation ........................................ 64
3.5.8. Work Disability Functional Assessment Battery (WD-FAB) .............. 64
3.5.9. WorkHab Functional Capacity Evaluation ........................................... 65
4. Discussion and conclusion .................................................................................. 66
4.1. Results .......................................................................................................... 66
4.2. Interpreting the results of the reviewed studies ............................................ 66
4.3. More than psychometric properties .............................................................. 67
4.4. Literature search and potentially missed studies .......................................... 68
4.5. Choice of the quality assessment method .................................................... 68
4.6. Recommendations for future research.......................................................... 69
REFERENCES ............................................................................................................... 70
APPENDICES ................................................................................................................ 76
1. List of figures and tables ..................................................................................... 76
2. Toelating voor consultatie ................................................................................... 77
1
PREFACE / ACKNOWLEDGEMENTS
My gratitude goes out to all the helping hands that contributed to realizing this study.
First of all, and above all, I would like to extend my gratitude to doctor Dominique Van
de Velde, promoter of this thesis. I am very thankful for his guidance and support
throughout the entire writing process; especially during the many consultation meetings,
in which I could always turn to him with my questions and with difficulties I had
encountered. Another great help during these meetings was my mentor Lien Van
Peteghem, who helped me understand the practical aspects, usage and relevance of
Functional Capacity Evaluations. I would also like to thank Sofie Vertriest, co-promoter
of this thesis, who proposed this study topic and followed-up on the realization of the
systematic review.
Furthermore, a special thanks goes out to Stijn De Baets for generously providing tips,
materials and feedback; and most importantly, for independently performing the
literature screening process as an external reviewer. I would also like to express my
deep gratitude to the five external readers who took the time to attentively read my work
and provide their very helpful inputs and feedback. Thank you for your time and effort,
professor Ev Innes, professor Haije Wind, professor Michiel Reneman, Dirk Vandamme
and Linda Gabriël.
Finally, special recognition goes out to my family and close friends, for their support,
encouragement and patience during the realization of my thesis.
Without all this direct and indirect support, I would not have been able to realize this
final result.
2
1. Introduction
A synthesis of research by the Organization for Economic Co-operation and
Development (OECD) in 2011 showed that almost all countries in the OECD have
significant social and economic impacts from the high number of workers permanently
leaving the labour market due to health problems or disability (1,2). Additionally,
research shows that people with reduced work capacity or with more time off work, are
less likely to remain in employment (1,2). Costs associated with disability benefits
make up a significant proportion of public expenditure across OECD countries,
averaging a total of 1.2% of countries‟ Gross Domestic Product (GDP) (1). In the
Netherlands, Norway and Sweden the proportion of GDP is even higher at 3.5% (1).
Furthermore, employment rates of people with disability are on average 40% lower than
the overall level (1). These low employment rates are accompanied by high social costs
(1,3,4) due to unemployment benefits, lower incomes and much higher poverty risk (1).
Changes in the labour market due to the Global Financial Crisis (GFC), such as
increased unemployment rates, might lead to a higher number of people relying on
sickness and disability benefits as their main source of income (1). Meanwhile, the
incidence of occupational injuries leading to long-term absenteeism and potentially
leading to job loss is still on the rise in high income countries (5,6). Such developments
have led to governments shifting policy focus to concentrate on tackling rising
unemployment (1). Despite the growing awareness of abovementioned issues, Return-
To-Work approaches are not always adequate and in several cases result in higher level
of employee litigation and lower levels of employee morale (7); while the prevalence of
occupational disability and related costs continue to increase (5). Evidence shows,
however, that “incapacity-related spending is much higher than unemployment-related
spending” (1). Furthermore, “except for a few countries, the share of spending on
vocational rehabilitation and employment programs is less than 8%, and in most cases
less than 4%, of total disability-related spending” (1). The discrepancy between the
costs associated with unemployment and those related to incapacity and disability are of
concern. Alongside this is the relatively small investment in vocational rehabilitation
and employment programs to address the high unemployment rates of those with
disabilities and the poor work sustainability of those with reduced work capacity. There
3
is a need for greater investment in vocational rehabilitation in an effort to increase
employment rates, including employment of people with disabilities limiting their work
capacity. In order to determine a person‟s work capacity, it is necessary to have
appropriate measurement instruments. Functional Capacity Evaluations (FCE) are such
instruments that could subsequently play a key role in decreasing incapacity-related
spending by determining a person‟s work capacity and matching this with appropriate
employment; either by matching workers‟ abilities to job requirements or by identifying
necessary modifications to the work environment or workload (8).
Although experts have difficulties in agreeing upon a single definition of Functional
Capacity Evaluations (FCE) (9,10), there seems to be agreement on the different terms
that comprise FCE (10). The following definition, based on the International
Classification of Functioning, Disability and Health, achieved 63% agreement amongst
FCE experts responding to a Delphi Survey: “A FCE is an evaluation of capacity of
activities that is used to make recommendations for participation in work while
considering the person‟s body functions and structures, environmental factors, personal
factors and health status” (10). Regardless of the lack of a uniform definition, the
purpose of FCE is considered to be the evaluation of a person‟s ability to participate in
work (10) by matching his or her capacities with functional job requirements (11). The
underlying assumption here, is that a better performance in the FCE is associated with
faster Return-To-Work and lower risk of re-injury or pain exacerbation (11); however, it
is important to note that many other factors which are not measured by FCE, such as
personal causation, also influence Return-To-Work outcomes (12). Furthermore, FCE is
also described as a systematic, comprehensive and multi-faceted measurement tool,
designed to measure a person‟s current physical abilities in work-related tasks (6,13).
FCEs are used for a variety of reasons, in differing settings. Most commonly they are
used with individuals who have work disabilities (6) in a rehabilitation or clinical
setting to: develop an individually oriented, customized rehabilitation program; adjust
the currently used rehabilitation program; measure changes in physical abilities (pre-
and post-intervention); and to determine functional work abilities and match these with
employment prior to Return-To-Work (6,8,14). FCEs have also become part of medico–
legal assessments to determine whether claimants should receive disability benefits,
4
based on their assessed functional abilities (8). In conclusion, many (health) disciplines
from various organizations use FCEs as part of practice, including for making
recommendations and decisions that have implications for people with work-related
disabilities, employers, insurers, other health and medical professionals and other
stakeholders. Therefore, it is important that FCE users know whether FCE methods
provide reliable and valid information and subsequently which method is preferable.
Over the past 30 years, several FCE methods have been developed. Matheson provided
one of the earliest examples in 1984, followed by Isernhagen in 1988, who also
suggested that FCE should be a multidisciplinary matter. Since then, at least ten
different types of well-known FCEs have been described (14). Some are still being
researched and adapted, while others have fallen out of favour, their
manufacturers/distributors have ceased operation or they have been superseded by later
versions or computer based systems (8).
The credibility of these FCE methods was primarily based on the knowledge and
expertise of the developers (6,9). However there is a growing interest in FCE‟s
psychometric properties because, as for any other test, “a FCE should give reliable and
valid measurements” (15). Consequently, numerous studies have been conducted to
validate FCE methods and to demonstrate their reliability in varying client groups.
These studies have been reviewed over the years, commencing with two comprehensive
reviews of a wide range of work-related assessments by Innes and Straker
(1999a,1999b) (16,17). Since then, other comprehensive systematic reviews addressing
the psychometric properties of FCEs have been conducted by Gouttebarge, Wind,
Kuijer and Frings-Dresen (2004); Wind, Gouttebarge, Kuijer and Frings-Dresen (2005)
and Innes (2006) (6,8,18). These comprehensive reviews synthesize the studies on all
types of psychometric properties of several well-known FCE methods and allow the
comparison of the quality of these methods. There have also been multiple systematic
reviews on more specific topics regarding FCE, such as the psychometrics of one
specific FCE method, such as the review by Bieniek and Bethge (2014) (19).
Furthermore, there have been multiple systematic reviews on specific psychometrics of
FCE and on the use of FCE in specific target groups, such as the systematic reviews by
van der Meer, Trippolini, van der Palen, Verhoeven and Reneman (2013) (20) and
5
Kuijer, Gouttebarge, Brouwer, Reneman and Frings-Dresen (2012) (21). These reviews
provide answers on more focused questions regarding FCE.
With the most recent comprehensive review on FCE dating from 2006 (8), there is a
clear need for an updated global review of the existing evidence. Especially since the
past years, new, promising FCE methods have been developed and known methods
have been refined and more thoroughly researched (8). Another important reason for
updating current comprehensive evidence on FCE validity and reliability, is the
continuing use of FCEs in important decision-making processes regarding occupational
rehabilitation, insurances, disability benefits, and so on. The decision makers should be
objectively informed of the strengths, weaknesses and unknown properties of the
measurement tools they choose or have chosen. With this study it is attempted to enable
such decision makers, for example clinicians in the vocational rehabilitation setting, to
nuance and critically interpret the outcomes FCE methods provide. The main aim of
performing this systematic review on the psychometric properties of FCE methods,
therefore, is to synthesize the recent evidence (published since May 2004) on the
validity and reliability of currently used methods. This way, information in previous,
similar syntheses can be enriched with up-to-date evidence, providing a more
comprehensive frame of reference. By doing so, it will be more feasible to evaluate and
compare the reliability and validity of FCE methods in order to make more objective
and substantiated decisions, for example in choosing between FCE methods.
The research question of this systematic review is therefore: “What are the (recently
studied) psychometric properties of current Functional Capacity Evaluation methods?”.
6
2. Methods
2.1. Systematic literature search strategy
For this systematic review, a literature search was conducted to identify relevant studies
from the following electronic databases. These particular databases were chosen based
on convenience and relevance.
Broad database(s)
- Web Of Science
- Trip Database
- Journal Storage (JSTOR)
- The bibliographic database of the Catholic University of Leuven
Healthcare database(s)
- PubMed
- Embase
- Cochrane Library
Discipline specific database(s)
- PEDro
- OTSeeker
Relevant search terms and their synonyms were identified. MeSH-terms were used
when available. Before the actual literature search, scoping searches were performed to
determine which of these search terms would provide the highest number of relevant
results. These were used as key terms (Table 1: Used key words and their synonyms)
and were combined into search phrases using Boolean operators. For example:
(Functional Capacity Evaluation OR FCE) AND (Psychometrics OR Psychometric
properties OR Validity OR Reliability) AND (Return to Work OR Vocational
rehabilitation OR Work OR Job OR Participation in work). These search phrases were
entered into the abovementioned databases between the 12th
and the 22nd
of November
2015. Filters for publication date and type of study were used when available, in
accordance with the inclusion criteria. This complete process was performed in three
7
languages: English, Dutch and French and resulted in a total of 1381 hits. These hits
were screened by applying the following inclusion and exclusion criteria:
Inclusion criteria:
- type of study: RCT (clustered or individual), CCT, meta-analysis or systematic
review,
- type of subjects: healthy subjects or subjects with an occupational disability or
disease (in the process of vocational rehabilitation),
- type of outcome measure: psychometric properties (validity and reliability),
- type of assessment: (components of) Functional Capacity Evaluation methods,
that assess the global physical/functional capacity of the subject,
- publication date: May 2004 to the 12th
of November 2015,
- language: English, French or Dutch.
Exclusion criteria:
- studies reporting on FCE methods that are no longer commercially available or
no longer used,
- studies of FCE methods that only measure certain specific functional/physical
capacities, for example: the EPIC Lift Capacity Test and the Progressive
Isoinertial Lifting Evaluation (PILE), only assesses lifting,
- studies of FCE methods with a specific target population, for example: the
Whiplash Associated Disorders (WAD) FCE,
- studies that are not relevant to the research question.
In the first phase of the screening process, these criteria were applied to the titles of the
1381 hits to determine whether they would be included or excluded. In the second
phase, the abstracts of the studies included based on title, were also screened, based on
above described criteria. When studies showed questionable relevance in these first two
phases, they were included and more thoroughly screened in the following phase. The
studies that were included based on title and abstract, were screened a third time based
on their full texts. In this final phase of the screening process, references of included
systematic reviews were also screened for relevant studies through citation searching.
8
To ensure objectivity in the screening process, an external reviewer was asked to
independently complete the same process of screening the studies based on title and
abstract. Percentage of agreement and Cohen‟s kappa were calculated to compare the
results of the screening process by the external reviewer. Agreement between the two
reviewers on the inclusion of studies based on abstract, was high on the first review
with 98.17% agreement and κ = 0.96 (SE = 0.03; 95% CI = 0.910-1.00), and excellent
after deliberation/discussion, with 100% agreement and κ = 1.00. Disagreements on the
inclusion or exclusion of studies were resolved by a discussion between the two
reviewers. It was not necessary to consult a third reviewer.
The complete process of the literature search, selection and screening were documented
to guarantee transparency.
Table 1: Used key words and their synonyms
Assessment Study objective Study topic
Functional Capacity Evaluation(s)
FCE
Functional assessment(s)
Psychometric properties
Psychometrics
Vocational rehabilitation
Return-To-Work
Vocational participation
Reliability
Reliable
Repeatable
Work
Job
Validity
Valid
2.2. Quality assessment
The methodological quality of the included studies was assessed by using the three-
level quality appraisal scale, developed by Gouttebarge, Wind, Kuijer and Frings-
Dresen (2004) for a previous systematic review on the psychometrics of FCE methods
(6). The scale is mostly based on other studies (16,22–26) and its purpose is to evaluate
the scientific relevance of studies researching psychometric properties. The items
„internal consistency‟ and „responsiveness‟ were added for the current review and based
on the criteria proposed by Terwee, Bot, de Boer, van der Windt , Knol, Dekker, Bouter
and de Vet (2007). Overall, the scale considers five methodological quality appraisal
features (6), namely:
9
1. Functional Capacity Evaluation: to evaluate if it is clearly mentioned and
whether the full FCE method has been used or, if not, which subtests were used
2. Objective: to evaluate whether the objective of the study is clearly defined
3. Study population: to judge whether the study population is well described
4. Procedure: to evaluate whether the study used a well-defined procedure to
achieve the objective
5. Statistics: to evaluate whether the statistics used were clearly described and
properly used to test the hypothesis of the study
Each study was scored on these five features and a total score was calculated by adding
+ and - scores. A plus (+) adds one point to the score, a minus (-) subtracts one point
and a +/- does not change the overall score. The methodological quality of the studies
was rated as follows (6):
- High: + 4 or 5, indicating a high methodological quality
- Moderate: + 2 or 3, indicating a moderate methodological quality
- Low: + 0 or 1, indicating a low methodological quality
Table 2: „The methodological quality appraisal‟ shows a comprehensive description of
the full appraisal. When a study did not present the needed information, it immediately
scored a „-‟ for this item. To determine whether the study procedures and statistics used
were adequate for the study purpose, the descriptions by Terwee et al. (2007) were used
(27). The outcomes of the reviewed studies are expressed through different statistics,
which all require a specific interpretation. Table 3: Interpretation of reliability and
validity outcomes‟ displays how specific statistics were translated into a „good‟,
„moderate‟ or „poor‟ level of reliability or validity.
10
Table 2: The methodological quality appraisal
FCE method
+ It is clearly mentioned in this study whether the full FCE-method or which subtests have been
used
- It is not clearly mentioned in this study whether the full FCE-method or which subtests have
been used
Objective
+ The objective of the study is clearly mentioned
- The objective of the study is not clearly mentioned
Population
n: number of subjects/raters, G: gender, A: age, H: health status, W: work status
+ The five Population items (n, G, A, H and W) appear in the article
+/- 3–4 of the five items appear in the article
- 1–2 of the five items appear in the article
Procedure
Intra-rater reliability or Test-retest reliability
+ Time interval (days) between test–retest ranges from 7 to 14
+/- Time interval (days) between test–retest ranges from 3 to 6 and 15 to 21
- Time interval (days) between test–retest is less than 3 or more than 21
Inter-rater reliability
+ Number of raters used is more than 2
+/- Number of raters used is 2 within more than ten measurements
- Number of raters used is 2 within ten measurements or less
Validity
+ The study design is clearly described and appears properly defined to the type of validity that
is meant to be measured
+/- The study design satisfies only one of the conditions described above
- The study design is not clearly described and does not appear properly defined to the type of
validity that is meant to be measured
Internal consistency
+ The study design is clearly described and appears properly defined to measure internal
consistency
+/- The study design satisfies only one of the conditions described above
- The study design is not clearly described and does not appear properly defined to measure
internal consistency
Responsiveness
+ Hypotheses about changes in measures are predefined
- Hypotheses about changes in measures are not predefined
Statistics
+ The statistics used are clearly described and appear properly defined to achieve the objective
of the study
+/- The study design satisfies only one of the conditions described above
- The statistics used are not clearly described and do not appear properly defined to achieve the
objective of the study
11
Table 3: Interpretation of reliability and validity outcomes
Levels of reliability
Pearson product moment coefficient (r), Spearman correlation coefficient (ρ), Kendall-Tau correlation
High r/ρ > 0.80
Moderate 0.50 ≤ r/ρ ≤0.80
Low r/ρ < 0.50
Intra-class correlation coefficient (ICC)
High ICC > 0.90
Moderate 0.75 ≤ ICC ≤ 0.90
Low ICC < 0.75
Kappa value (κ)
High κ > 0.60
Moderate 0.41 ≤ κ ≤ 0.60
Low κ ≤ 0.40
Cronbach’s alpha (α)
High α > 0.80
Moderate 0.71 ≤ α ≤ 0.80
Low α ≤ 0.70
Percentage of agreement ( %)
High % >0.90 and the raters can choose between more than two score levels
Moderate % >0.90 and the raters can choose between two score levels
Low % <0.90 and/or the raters can choose between two score levels
Levels of validity
Face/content validity
High The test measures what it is intended to measure and all relevant components are included
Moderate The test measures what it is intended to measure but not all relevant components are included
Low The test does not measure what it is intended to measure
Criterion-related validity: concurrent and predictive validity
High Substantial similarity between the test and the criterion measure (percentage agreement
≥ 90%, κ > 0.60, r > 0.75)
Moderate Some similarity between the test and the criterion measure (percentage agreement ≥70%, κ ≥
0.40, r ≥ 0.50)
Low Little or no similarity between the test and the criterion measure (percentage agreement <70%,
κ < 0.40, r < 0.50)
Construct validity: convergent and divergent/discriminant validity
High Good ability to differentiate between groups or interventions, or good convergence/divergence
between similar tests (r ≥ 0.60)
Moderate Moderate ability to differentiate between groups or interventions, or moderate
convergence/divergence between similar tests (0.60> r ≥ 0.30)
Low Poor ability to differentiate between groups or interventions, or low convergence/divergence
between similar tests (r < 0.30)
Other
Responsiveness
Adequate Ability to detect clinically important changes over time (Responsiveness Ratio (RR) ≥ 1.96 or
the area under the receiver operating characteristics (ROC) curve (AUC) ≥ 0.70)
Inadequate No ability to detect clinically important changes over time (Responsiveness Ratio (RR) < 1.96
or the area under the receiver operating characteristics (ROC) curve (AUC) < 0.70)
12
Studies were not excluded based on their score on the three-level quality appraisal by
Gouttebarge et al. (2004) because this appraisal does not take into account all factors
that potentially influence methodological quality. To further nuance the studies‟ scores
on the methodological quality appraisal, additional information on methodological
aspects that are susceptible to specific types of bias (28) are displayed in Table 4:
„Methodological aspects of the reviewed studies, susceptible to bias‟.
2.3. Synthesis approach
Relevant data were extracted from the reviewed studies and gathered into a standardized
data-extraction form. This form was subdivided into extraction tables in which relevant
data were organized. The core findings in each study were expressed by measures of
validity and/or reliability. Where possible, these data were directly extracted from the
original article.
The complete systematic review was reviewed by five external reviewers: three experts
in FCE research with several topic-related publications and two professionals with
extensive practical FCE experience.
13
3. Results
3.1. Literature search
A total of 1381 hits were retrieved from the nine databases. Of these, 200 were
duplicates. After screening the remaining 1181 references by applying the inclusion and
exclusion criteria to their titles, 1073 were excluded (phase 1 screening). The most
frequent reason for exclusion was that the titles clearly indicated topics not related to
FCE and/or vocational rehabilitation. Of the 108 references included based on title, 69
were excluded following application of the inclusion and exclusion criteria to the
abstracts (phase 2 screening). Most frequent reasons for exclusion were the publication
date of the article (prior to May 2004), the type of article/study (not a RCT, CCT, meta-
analysis or systematic review), the study topic (not related to FCE or vocational
rehabilitation) and/or the study objective (not aiming to research psychometric
properties). The full-texts of the remaining 39 references were retrieved through
bibliographic databases of the Catholic University of Leuven and the University of
Ghent or by contacting the author(s). Original articles accounted for 34 of the 39
publications and the remaining five were systematic reviews. Citation searches were
performed by using the reference lists included in these reviews and applying the
inclusion and exclusion criteria, resulting in one additional eligible original paper. Other
citations were excluded based on title or abstract. Reasons for exclusion were:
duplicates, the publication date, the study topic and the study objective. Inclusion and
exclusion criteria were applied to the 40 eligible full-texts, resulting in the exclusion of
20 full-texts with the following reasons for exclusion: not containing relevant
psychometric data, researching a target group-specific FCE method or not providing
information of relevance to the research question in this current systematic review. This
left 20 original studies to be included in this systematic review. With the exception of
one study written in French, all others were in English.
14
Figure 1: PRISMA Flow Diagram
3.2. Reviewed studies
Studies on nine different FCE methods were reviewed (Table 5: Studies sorted by FCE
method). This included two studies on the full Baltimore Therapeutic Equipment Work
Simulator and one on the full Blankenship FCE. Three studies researched the
psychometric properties of subtests of the Ergo-Kit FCE and another study examined
subtests of the Ergo-Kit FCE, in comparison to subtests of the ERGOS Work Simulator.
Five studies were retrieved on the Isernhagen Work Systems FCE, of which two studied
the full assessment and three studied its subtests. Four studies researched the full
Physical Work Performance Evaluation and one study was found on the psychometrics
of the full Short-Form FCE. Lastly, one study researched the subtests of the Work
Disability Functional Assessment Battery and two studies researched the properties of
the WorkHab FCE‟s subtests. Table 6: „Studies sorted by topic (type of psychometric
studied‟ shows which psychometric properties were researched per FCE method.
15
3.2.1. Baltimore Therapeutic Equipment (BTE) Work Simulator
The systematic literature search retrieved no studies on the reliability of the BTE. Two
studies were found on the predictive validity of the assessment (29,30).
3.2.2. Ergo-Kit (EK) Functional Capacity Evaluation
Two studies were found on the reliability of the EK: one on the intra- and inter-rater
reliability (31) and one on the inter-rater reliability and agreement (32). Two studies
were found on the validity of the EK. One studied the discriminant/divergent and
convergent validity (33) and the other study researched the concurrent validity of the
EK and the ERGOS Work Simulator (34).
3.2.3. ERGOS Work Simulator (EWS)
No studies were found on the reliability of the ERGOS Work Simulator. One study was
found on the concurrent validity of the EWS with the Ergo-Kit FCE (34).
3.2.4. Blankenship WorkEval Functional Capacity Evaluation
No studies found on the reliability or validity of the Blankenship FCE. One study on the
sensitivity and specificity of the assessment was found (35).
3.2.5. Isernhagen Work Systems (IWS) Functional Capacity Evaluation /
WorkWell Systems Functional Capacity Evaluation
Three studies were found on the reliability of IWS, one on the test-retest
reliability/reproducibility (36), one on the inter-rater reliability (37) and one on both
inter-rater as intra-rater reliability (38). Two studies on the validity of the IWS were
found, both examined its predictive validity (10,25).
An important note is that the name of the Isernhagen Work Systems FCE has recently
been officially replaced with the name „WorkWell Systems (WWS) FCE‟. However, in
the five studies that were retrieved on this FCE method, the name „Isernhagen Work
Systems FCE‟ was still used. Therefore, it was chosen to continue using this name for
this systematic review. However, it is important to bear in mind that both names refer to
the same assessment.
16
3.2.6. ErgoScience Physical Work Performance Evaluation (PWPE)
The systematic literature search retrieved two studies on the reliability of the PWPE:
one on the test-retest reliability/reproducibility (41) and one on the inter-rater reliability
(42). One study on the predictive validity of the PWPE was found (43) and one on the
internal and external responsiveness of the assessment (44).
3.2.7. Short-Form Functional Capacity Evaluation
The systematic literature search retrieved no studies on the reliability of the Short-Form
FCE and one study on its predictive validity (45).
3.2.8. Work Disability Functional Assessment Battery (WD-FAB)
No studies were retrieved on the reliability of the WD-FAB and one study was retrieved
on the discriminant/divergent and convergent validity of the assessment (46).
3.2.9. WorkHab Functional Capacity Evaluation
Two studies were found on the reliability of the WorkHab FCE, one researching test-
retest reliability (47) and one researching intra- and inter-rater reliability (48). The study
by James, Mackenzie and Capra (2010) also researched the WorkHab‟s internal
consistency.
17
Table 4: Studies sorted by topic (type of studied psychometric property)
Reliability Validity Other
Reproducibility / Test-retest reliability Criterion-related validity Construct validity Diagnostic properties Responsiveness Internal
consistency
Agreement Reliability Concurrent
validity
Predictive
validity
Discriminant/
divergent
validity
Convergent
validity Sensitivity Specificity
Internal
responsiveness
External
responsiveness
Gouttebarge
et al. (2006)
Brassard et al. (2006) Rustenburg,
Kuijer and
Frings-
Dresen
(2004)
Branton et
al. (2010)
Gouttebarge
et al. (2009)
Gouttebarge
et al. (2009)
Brubaker et
al. (2007)
Brubaker et
al. (2007)
Durand et al.
(2008)
Durand et al.
(2008)
James,
Mackenzie
and Capra
(2010)
Reneman et al. (2004)
James,
Mackenzie
and Capra
(2010)
Inter-rater
reliability
Intra-rater
reliability
Cheng and
Cheng
(2010)
Meterko et al.
(2015)
Meterko et
al. (2015)
Durand et al.
(2004)
Gouttebarge
et al. (2005)
Cheng and
Cheng
(2011)
Gouttebarge
et al. (2005)
James,
Mackenzie
and Capra
(2011)
Gross and
Battié
(2005)
Gouttebarge
et al. (2006)
Trippolini et
al. (2014)
Gross and
Battié
(2006)
James,
Mackenzie
and Capra
(2011)
Lechner,
Page and
Sheffield
(2008)
Reneman et
al. (2005)
Trippolini et
al. (2014)
Note: some studies are mentioned more than once, because they study multiple psychometric properties
18
Table 5: Studies sorted by FCE method
Baltimore
Therapeutic
Equipment
(BTE) Work
Simulator
Blankenship
WorkEval
Functional
Capacity
Evaluation
Ergo-Kit (EK)
Functional
Capacity
Evaluation
ERGOS Work
Simulator
Isernhagen
Work Systems
FCE
Physical Work
Performance
Evaluation
(PWPE)
Short-Form
Functional
Capacity
Evaluation
Work Disability
Functional
Assessment
Battery (WD-
FAB)
WorkHab
Functional
Capacity
Evaluation
Full FCE studied Full FCE studied Full FCE studied Full FCE studied Full FCE studied Full FCE studied Full FCE studied Full FCE studied Full FCE studied
Cheng and
Cheng (2010)
Brubaker et al.
(2007)
Reneman et al.
(2004)
Brassard et al.
(2006)
Branton et al.
(2010)
Cheng and
Cheng (2011)
Gross and Battié
(2005)
Durand et al.
(2004)
Durand et al.
(2008)
Lechner, Page
and Sheffield
(2008)
Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied Subtest(s) studied
Gouttebarge et al.
(2005)
Rustenburg,
Kuijer and
Frings-Dresen
(2004)
Reneman et al.
(2005)
Meterko et al.
(2015)
James,
Mackenzie and
Capra (2010)
Gouttebarge et al.
(2006)
Trippolini et al.
(2014)
James,
Mackenzie and
Capra (2011)
Gouttebarge et al.
(2009)
Gross and Battié
(2006)
Rustenburg,
Kuijer and
Frings-Dresen
(2004)
Note: some studies are mentioned more than once, because they study multiple FCE methods
19
3.3. Quality appraisal
The overall methodological quality of the reviewed studies was moderate to high, based
on the three-level quality appraisal scale by Gouttebarge et al. (2004). Sixteen studies
showed high methodological quality (11,29–34,37,39,42–48), three studies showed
moderate methodological quality (35,36,41) and only one study was rated with low
methodological quality (40). Results of the methodological quality appraisal are
displayed in Table 7. Nearly all studies reported whether a full FCE method or subtests
were studied and clearly mentioned the study objective. A few studies did not (clearly)
report data on subjects‟ health status or work status, while the number of subjects, their
gender-distribution and mean age were always mentioned. In most cases, scores
decreased because of a questionable study design/procedure or the use of inadequate
statistics, based on the criteria proposed by Terwee et al. (2007) (27).
Most included studies appeared to be somewhat susceptible to several specific types of
bias, or failed to report important methodological aspects of the study (Table 8). Firstly,
none of the studies reported that study-subjects were randomly sampled. Most studies
did not report the subject selection method, or used convenience sampling. However,
almost all studies reported the used inclusion and exclusion criteria. Of the studies in
which allocation was relevant, only three performed random allocation. The subjects
were never reported to be blinded, mostly because blinding of the subjects was not
possible. In most cases, rater blinding was not performed or not mentioned, when this
would have been relevant. Furthermore, many studies were faced with missing data,
however, did not report how missing data were handled. All studies did report the
necessary outcomes, but in some studies it was unclear whether extra results were
reported (possibly to embellish poor results). In studies which needed to use two
equivalent groups of subjects, it was mostly not mentioned whether these groups were
really comparable or significantly different at baseline. In most of the studies that did
report this aspect, groups were equivalent for important characteristics. Lastly, it was
sometimes unclear whether studies received funding from (possibly) involved parties.
Four studies clearly reported that there was no financial influence, while most others did
not mention financial support.
20
3.4. Study outcomes
Table 6: Overview of the study characteristics‟ shows the main characteristics and
outcomes of the reviewed studies, sorted by FCE method. The study methods that were
used in the reviewed studies and the corresponding outcomes with their interpretations
are more extensively displayed in Table 5: Study methods and results‟. Some of the
reviewed studies also researched other aspects of the FCE methods besides the
psychometric properties. This information however, is not included in the results of the
present review because it falls outside the scope of the research.
21
Table 6: Overview of the study characteristics and main outcomes
ICC = Intra-class Correlation Coefficient, % = percentage of agreement, κ = kappa value, r = Pearson Correlation Coefficient; ρ = Spearman’s rank correlation,
(A)HR = (Adjusted) Hazard Ratio, (A)OR = (Adjusted) Odds Ratio, *= significant, NS= Not Significant.
n number of subjects or raters/G gender/A age/H health status/W work status.
Studies on the Baltimore Therapeutic Equipment (BTE) Work Simulator
Full FCE method or
subtests studied?
Objective
Psychometric(s) studied Population Procedure Outcomes
Author(s) and year of
publication
Job-specific FCE:
varying composition of
assessment components,
depending on the subject
Predictive validity n: 645
G: 390 male, 255 female
A: 41.59±10.49 years
H: Subjects with non-
specific low back pain
W: Not working because of
low back pain
Retrospective cohort
study: ability of the BTE
to predict Return-To-
Work/employment status
Recommendations & employment
status: κ = 0.435
Cheng and Cheng
(2010)
Job-specific FCE:
varying composition of
assessment components,
depending on the subject
Predictive validity n: 194
G: 126 male, 68 female
A: 43.6±10.11 years
H: Subjects with distal
radius fracture
W: Not working
Retrospective cohort
study: ability of the BTE
to predict Return-To-
Work/employment status
Recommendations & employment
status: κ = 0.449
Percentages of correct predictions:
„Previous job‟: 94.83%
„Change job‟: 60.47%
„Previous job with modification‟:
9.38%
„Do not work at the moment‟:
60.47%
Cheng and Cheng
(2011)
22
Studies on the Blankenship WorkEval Functional Capacity Evaluation
Full FCE method or
subtests studied?
Objective
Psychometric(s) studied Population Procedure Outcomes
Author(s) and year of
publication
Studied subtests:
- Repetitive-
movement tests
- Static-strength tests
- Occasional-material-
handling tests
- Hand tests
Sensitivity
Specificity
n: 49 subjects
G: 17 male, 32 female
A: 36 years, range 18-65
years
H: Subjects with or without
musculoskeletal injury
W: Working, not working
or retired
Single-blinded,
randomized, post-test
only study: comparing
results from two groups,
one performing at
submaximal effort (50%)
and one performing at
maximal effort (100%)
Sensitivity = 80.0%
Specificity = 84.2%
Positive likelihood ratio = 5
Negative likelihood ratio = 0.2
Brubaker et al. (2007)
23
Studies on the Ergo-Kit (EK) Functional Capacity Evaluation
Full FCE method or
subtests studied?
Objective
Psychometric(s) studied Population Procedure Outcomes
Author(s) and year of
publication
Subtests studied:
- Back-torso lift test
(BTLT)
- Shoulder lift test
(SLT)
- Forward
manipulation test
(FMT)
- Lower manipulation
test crouching
(LMTC)
- Carrying lifting
strength test (CLST)
- Lower lifting
strength test (LLST)
- Upper lifting
strength test (ULST)
Intra-rater reliability (1)
Inter-rater reliability (2)
n: 2 raters (27 subjects)
G: 15 male, 12 female
A: 40±16 years
H: Subjects without
musculoskeletal complaints
W: Working part-time or
full-time
Time-interval (1):
between t1&t2: 4±1
days, between t2&t3:
8±2 days
Raters used (2): 2
(1) BTLT: ICC = 0.96 (95% CI =
0.91-0.98), SLT: ICC = 0.93 (95%
CI = 0.85-0.97), CLST: ICC = 0.88
(95% CI = 0.75-0.94), LLST: ICC
= 0.86 (95% CI = 0.72-0.93),
ULST: ICC = 0.84 (95% CI =
0.69-0.92), FMT: ICC = 0.76 (95%
CI = 0.46-0.89), LMTC: ICC =
0.55 (95% CI = 0.21-0.77)
(2) After a 4-day time interval
BTLT: ICC = 0.96(95% CI = 0.91-
0.98), LLST: ICC = 0.95 (95% CI
= 0.89-0.98), SLT: ICC = 0.94
(95% CI = 0.87-0.97), ULST: ICC
= 0.94 (95% CI = 0.88-0.97),
CLST: ICC = 0.93 (95% CI =
0.85-0.97), LMTC: ICC = 0.90
(95% CI = 0.78-0.95), FMT: ICC =
0.88 (95% CI = 0.74-0.94)
Gouttebarge, Wind,
Kuijer and Sluiter
(2005)
24
(2) After a 8-day time interval
BTLT: ICC = 0.96 (95% CI =
0.90-0.98), SLT: ICC = 0.95 (95%
CI = 0.88-0.98), LLST: ICC = 0.92
(95% CI = 0.79-0.97), CLST: ICC
= 0.91 (95% CI = 0.75-0.97),
ULST: ICC = 0.91 (95% CI =0.67-
0.97), LMTC: ICC = 0.62 (95% CI
= 0.01-0.85), FMT: ICC = 0.53
(95% CI = -0.27-0.82)
Subtests studied:
- Back-torso lift test
(BTLT)
- Shoulder lift test
(SLT)
- Carrying lifting
strength test (CLST)
- Lower lifting
strength test (LLST)
- Upper lifting
strength test (ULST)
Inter-rater reliability (1)
Agreement (2)
n: 2 raters (25 subjects)
G: 11 male, 14 female
A: 49±8 years
H: Subjects with low back
pain
W: Working part-time or
full-time
Raters used: 2 (1) BTLT: ICC = 0.97 (95% CI =
0.94-0.99), SLT: ICC = 0.96 (95%
CI = 0.91-0.98), CLST: ICC = 0.95
(95% CI = 0.84-0.98), LLST: ICC
= 0.94 (95% CI = 0.85-0.97),
ULST: ICC = 0.95 (95% CI =
0.89-0.98)
(2) BTLT: SEM = 8.6 (95% CI =
X±16.7), SLT: SEM = 5.0 (95% CI
= X±9.8), CLST: SEM = 3.4 (95%
CI = X±6.6), LLST: SEM = 3.7
(95% CI = X±7.2), ULST: SEM =
8.6 (95% CI = X±3.8)
Gouttebarge, Wind,
Kuijer, Sluiter and
Frings-Dresen (2006)
25
Subtests studied:
- Back-torso lift test
(BTLT)
- Shoulder lift test
(SLT)
- Carrying lifting
strength test (CLST)
- Lower lifting
strength test (LLST)
- Upper lifting
strength test (ULST)
Construct validity
- Discriminant/
divergent validity
(1)
- Convergent validity
(2)
n: 72 subjects
G: 72 male, 0 female
A: 41±10 years
H: Subjects with
musculoskeletal disorders
(MSD)
W: Construction workers
on sick leave as a result of
MSD
A cross-sectional study
with within-subject
design: comparison
between the Ergo-Kit
and the Instrument for
Disability Risk (IDR) +
the Von Korff
Questionnaire (VKQ)
(1) Differences in performance in
EK lifting tests between low risk
group & high risk group (based on
IDR score): NS (p>0.05)
(2) VKQ & BTLT: ρ = -0.17, VKQ
& SLT: ρ = -0.18, VKQ & CLST:
ρ = -0.27* (p<0.05), VKQ &
LLST: ρ = -0.23* (p<0.05), VKQ
& ULST: ρ = -0.16
Gouttebarge, Wind,
Kuijer, Sluiter and
Frings-Dresen (2009)
26
Studies on the ERGOS Work Simulator
Full FCE method or
subtests studied?
Objective
Psychometric(s) studied Population Procedure Outcomes
Author(s) and year of
publication
ERGOS Work Simulator
subtests studied:
- Dynamic lifting tests
The Ergo-Kit (EK)
Functional Capacity
Evaluation
subtests studied:
- Dynamic lifting tests
Concurrent validity n: 25 subjects
G: 25 male, 0 female
A: 34.8±9.5 years
H: No musculoskeletal
problems
W: Fire-fighters
Balanced study:
comparing lifting
capacity for the ERGOS
Work Simulator and for
the Ergo-Kit
Correlation EK & ERGOS
outcomes:
sporadic lower lifting ρ = ?,
frequent lower lifting ρ = 0.50*,
constant lower lifting ρ = 0.50*
sporadic upper lifting ρ = 0.49*,
frequent upper lifting ρ = 0.66**,
constant upper lifting ρ = 0.56**
* = p<0.05, ** = p<0.01
Rustenburg, Kuijer and
Frings-Dresen (2004)
27
Studies on the Isernhagen Work Systems FCE
Full FCE method or
subtests studied?
Objective
Psychometric(s) studied Population Procedure Outcomes
Author(s) and year of
publication
Full FCE method studied Predictive validity n: 54 subjects
G: 30 male, 24 female
A: 41±10.4 years
H: Subjects with
compensable back injuries
W: Not working,
compensation claimants
Prospective cohort study:
ability of the Isernhagen
Work Systems FCE to
predict timely and
sustained Return-To-
Work
Number of failed tasks (1 out of
25): AHR days to benefit
suspension = 0.91 (95% CI = 0.87-
0.96), AHR days to claim closure
= 0.93 (95% CI = 0.89-0.98), AOR
recurrence = 0.95 (95% CI = 0.89-
1. 03)
Floor-to-waist lift: AHR days to
benefit suspension = 1.55 (95% CI
= 1.28-1.98), AHR days to claim
closure = 1.42 (95% CI = 1.12-
1.80), AOR recurrence = 0.91
(95% CI = 0.63-1. 33)
Association FCE performance
indicators with future recurrence,
work status, pain intensity and
disability: NS (p>0.05)
Gross and Battié
(2005)
28
Subtests studied:
- The 15 activities in
the upper extremity
protocol
Predictive validity n: 336 subjects
G: 239 male, 97 female
A: 45±11.2 years
H: Subjects with specific
(a) and nonspecific (b)
upper extremity conditions
W: Not working,
compensation claimants
Longitudinal cohort
study: ability of the
Isernhagen Work
Systems FCE upper
extremity protocol to
predict timely and
sustained recovery
Lifting performance:
AHR days to benefit suspension =
1.51
AHR days to claim closure = 1.23
OR recurrence = 1.17 (95% CI
0.96 to 1.43), NS (p>0.05) after
controlling for confounding
variables
Gross and Battié
(2006)
Full FCE method
studied:
- Modified versions of
nine out of the 24
tests
The test-retest reliability
/ reproducibility
n: 26 subjects
G: 14 male, 12 female
A: 34.9±12.7 years
H: Healthy subjects
W: ?
Time interval: 2-3 weeks Material handling tests
ICC range = 0.68-0.98, Shuttle
walk test ICC = 0.64
Five criterion/ceiling tests
κ > 0.60, one criterion/ceiling test
κ = 0.57, 11 other criterion/ceiling
tests κ = ?
Reneman et al. (2004)
Subtests studied:
- Lifting tests
Inter-rater reliability n: 9 raters (31 subjects)
G: 19 male, 12 female
A: (a): 29.5±10.8 years,
(b): 39.6 ±7.1 years.
H: healthy subjects (a),
subjects with chronic, non-
specific low back pain (b)
W: ?
Raters used: 9 (a) CR-10 ratings: ICC = 0.87
(95% CI = 0.69–0.91)
Categorical ratings: κ = 0.58
(b) CR-10 rating: ICC = 0.76 (95%
CI = 0.69–0.83)
Categorical ratings: κ= 0.50
Reneman et al. (2005)
29
Subtests studied:
- “11 tests”, not
specified
Intra-rater reliability (1)
of the Physical Effort
Determination Scale (a)
& the Submaximal Effort
Determination Scale (b)
Inter-rater reliability (2)
of the Physical Effort
Determination Scale (a)
& the Submaximal Effort
Determination Scale (b)
n: 21 raters (4 subjects)
G: ?
A: 35.5 years, range 21-49
years
H: 3 subjects with non-
specific low back pain and
1 with non-specific neck
pain
W: ?
(1) Raters used: 21
(2) Time interval: 10
months
(1) (a) κ = 0.49 (95% CI = 0.22–
0.75) (1) (b) κ = 0.68 (95% CI =
0.60–0.76)
(2) Rating 1: 73% agreement,
rating 2: 85% agreement
(2) (a) Rating 1: κ = 0.51 (95% CI
= 0.23–0.80), rating 2: κ = 0.7
(95% CI = 0.49–0.94)
(2) (b) Rating 1: κ = 0.68 (95% CI
= 0.60–0.76), rating 2: κ = 0.77
(95% CI = 0.70-0.84)
Trippolini et al. (2014)
30
Studies on the Physical Work Performance Evaluation (PWPE)
Full FCE method or
subtests studied?
Objective
Psychometric(s) studied Population Procedure Outcomes
Author(s) and year of
publication
Full FCE method studied Test-retest reliability/
reproducibility
n: 30 subjects
G: 24 male, 6 female
A: 43±7.3 years
H: Healthy subjects
W: Working
Time interval: 6 weeks Dynamic force-section: ICC =
0.79-0.91
Tolerance at different positions
section: κ = 0.05-0.50
Mobility-section: κ = 0.34-0.83
Dynamic strength section: κ = 0.49
Mobility section: κ = 0.52
Overall PWPE score: κ = 0.43
Brassard, Durand,
Loisel and Lemaire
(2006)
Full FCE method studied Inter-rater reliability n: 5 raters (40 subjects)
G: 31 male, 9 female
A: 40.9±9.9 years
H: Subjects with back pain
W: Not working or only
working light duties
because of back pain
Raters used: 5 Dynamic Strength section: κ =
0.81 (95% CI = 0.65-0.96)
Position Tolerance section: κ =
0.72 (95% CI = 0.54-0.90)
Mobility section: κ = 0.54 (95% CI
= 0.28-0.81)
Overall PWPE score: κ = 0.76
(95% CI = 0.58-0.93)
Durand, Loisel,
Mercier, Stock and
Lemaire (2004)
31
Full FCE method studied Responsiveness
- Internal
responsiveness (1)
- External
responsiveness (2)
n: 57 subjects
G: Experimental group (a):
23 male, 4 female
Control group (b): 24 male,
6 female
A: Experimental group (a):
42±9.4 years. Control
group (b): 43±7.4 years
H: Experimental group (a):
work-related non-specific
low back pain. Control
group (b): healthy.
W: Experimental group (a):
not working because of
work-related non-specific
low back pain. Control
group (b): working
Correlational prospective
pre-/post-test study:
comparing change scores
(1) between group (a)
and (b) and comparing
change scores (2) in
group (a) with concurrent
and empirical data
(1) (a) differences in PWPE-score
pre-/post-test: dynamic strength*
(p = 0.001), postural tolerance (p =
0.0195), NS (p>0.05) for mobility
and overall PWPE-score
(b) differences in PWPE-score pre-
/ post-test: overall PWPE-score NS
(p>0.05) and PWPE sections NS
(p>0.05)
(2) difference between (a) & (b)
pre-/post-test: Dynamic strength*
(p = 0.0008), other PWPE sections
& overall PWPE score NS
(p>0.05)
Difference between concurrent
criteria* pre-/post-test (a) (p≤0.05)
Change-scores & changes in
concurrent criteria: perceived
disability and postural tolerance
Kendall Tau = -0.41* (p =0.0131),
other PWPE sections + overall
PWPE score and measures Kendal
Tau = NS (p>0.05)
Durand, Brassard, Nha
Hong and Loisel
(2008)
32
Full FCE method studied Predictive validity n: 30 subjects
G: 26 male, 4 female
A: 40.5±10.7 years
H: Subjects with
musculoskeletal
dysfunction
W: Construction workers
on sick leave
Retrospective study:
ability of the PWPE to
predict Return-To-Work
/employment status
Recommendations & employment
status at discharge κ = 0.74
Recommendations & employment
status 3 months after discharge: κ
= 0.69
Recommendations & employment
status at 6 months after discharge
κ= 0.70
Lechner, Page and
Sheffield (2008)
33
Studies on the Short-Form Functional Capacity Evaluation
Full FCE method or
subtests studied?
Objective
Psychometric(s) studied Population Procedure Outcomes
Author(s) and year of
publication
Full FCE method studied Predictive validity n: 147 subjects
G: 101 male, 36 female
A: 44.3±11.1 years
H: Subjects with
musculoskeletal injuries
W: Not working,
compensation claimants
Prospective cohort study:
ability of the Short-Form
FCE to predict timely
and sustained Return-To-
Work
AHR days to benefit suspension =
5.45 (95% CI = 2.73-10.85)
AHR days to claim closure = 5.80
(95% CI = 3.50-9.61)
AOR recurrence = 1.31 (95% CI =
0.48-3.60)
Branton et al. (2010)
34
Studies on the Work Disability Functional Assessment Battery (WD-FAB)
Full FCE method or
subtests studied?
Objective
Psychometric(s) studied Population Procedure Outcomes
Author(s) and year of
publication
Subscales studied:
- Physical
Functioning scales
(PF): Changing and
Maintaining Body
Position (CMBP),
Upper Body
Function (UBF),
Upper Extremity
Fine Motor (UEFM)
and Whole Body
Mobility (WBM)
- Behavioural Health
scales (BH): Self-
Efficacy (SELF-E),
Mood and Emotions
(MOOD),
Behavioural Control
(BC) and Social
Interactions
(SOCIAL)
Construct validity
- Discriminant/
divergent validity
(1)
- Convergent validity
(2)
n: 973 subjects
G: 420 male, 553 female
A: 56 ±8.52 years
H: Subjects with physical
(a) or mental (b) disability,
no further specifics
W: Not working because of
disability
Cross-sectional study
comparison between:
established and new
WD-FAB scales
(Physical Functioning
(PF) & Behavioural
Health (BH))
(1) Correlations with the cross-
domain measure for PF: CMBP
r=0.12*, UBF r=0.21*, UEFM
r=0.24*, WBM r=0.15*
Correlations with the cross-domain
measure for BH: SELF-E r=0.46*,
MOOD r=0.67*, BC r=0.32*,
SOCIAL r=0.56*
(2) Correlations with the two
same-domain measures for PF:
CMBP r=0.42* & r=0.65*, UBF
r=0.43* and r=0.69*, UEFM
r=0.23* & r=0.54*, WBM r=0.55*
& r=0.70*
Correlations with the scales of the
same domain measure for BH: r
range = -0.24 to -0.74, all
significant (p<0.05)
* = (p<0.05)
Meterko et al. (2015)
35
Studies on the WorkHab Functional Capacity Evaluation
Full FCE method or
subtests studied?
Objective
Psychometric(s) studied Population Procedure Outcomes
Author(s) and year of
publication
Subtests studied:
- Tests of the manual
handling component
Test-retest reliability /
reproducibility (1)
Internal consistency (2)
n: 25 subjects
G: 6 male, 19 female
A: 29±12.0 years
H: Healthy subjects
W: Students and staff
members from a university
Time interval: 1 week (1) Overall manual handling score:
ICC = 0.74 (95%CI = 0.42-0.88)
Tests of the manual handling
score: floor to bench ICC = 0.92,
bench to shoulder ICC = 0.90,
bench to bench ICC = 0.91
(2) Manual handling score:
Cronbach‟s α = 0.92
Tests of the manual handling
score: floor to bench Cronbach‟s α
= 0.86, bench to shoulder
Cronbach‟s α = 0.85, bench to
bench Cronbach‟s α = 0.82
James, Mackenzie and
Capra (2010)
Subtests studied:
- Tests of the manual
handling component
(1) Intra-rater reliability
(2) Inter-rater reliability
n: 17 raters (4 subjects)
G: ?
A: ?
H: Subjects with a work
disability, not specified
W: Not working
(1) Time interval:
“approximately 2 weeks”
(2) Raters used: 17
(1) Overall ICC 0.97 (ICC range =
0.81–0.97)
(2) ICC 0.90 (ICC range = 0.77-
0.91)
James, Mackenzie and
Capra (2011)
36
Table 7: Results of the methodological quality appraisal
Studies on the Baltimore Therapeutic Equipment (BTE) Work Simulator
Author(s) and year of publication FCE method Objective Population Procedure Statistics Methodological
quality
Cheng and Cheng (2010) + + + + + 5 = High
Cheng and Cheng (2011) + + + + + 5 = High
Studies on the Blankenship WorkEval Functional Capacity Evaluation
Author(s) and year of publication FCE method Objective Population Procedure Statistics Methodological
quality
Brubaker et al. (2007) + - + + + 3 =Moderate
Studies on the Ergo-Kit (EK) Functional Capacity Evaluation
Author(s) and year of publication FCE method Objective Population Procedure Statistics Methodological
quality
Gouttebarge et al. (2005) + + + + & +/- + 4-5 = High
Gouttebarge et al. (2006) + + + +/- + 4 = High
Gouttebarge et al. (2009) + + + + & + + 5 = High
Studies on the ERGOS Work Simulator (EWS)
Author(s) and year of publication FCE method Objective Population Procedure Statistics Methodological
quality
Rustenburg, Kuijer and Frings-Dresen (2004) + + + + + 5 = High
37
Studies on the Isernhagen Work Systems (IWS) Functional Capacity Evaluation
Author(s) and year of publication FCE method Objective Population Procedure Statistics Methodological
quality
Gross and Battié (2005) + + + + + 5 = High
Gross and Battié (2006) + + + + + 5 = High
Reneman et al. (2004) + + +/- +/- + 3 = Moderate
Reneman et al. (2005) + + +/- + + 4 = High
Trippolini et al. (2014) - + +/- -* & + + 1-2* = Low
* The test-retest time interval was ten months long. However, video recordings of subjects were used to assess intra- and inter-rater reliability, preventing learning effects and
carry-over effects in the subjects.
Studies on the Physical Work Performance Evaluation (PWPE)
Author(s) and year of publication FCE method Objective Population Procedure Statistics Methodological
quality
Brassard et al. (2006) + + + - + 3 = Moderate
Durand et al. (2004) + + + + + 5 = High
Durand et al. (2008) + + + + +/- 4 = High
Lechner, Page and Sheffield (2008) + + + + + 5 = High
Studies on the Short-Form Functional Capacity Evaluation
Author(s) and year of publication FCE method Objective Population Procedure Statistics Methodological
quality
Branton et al. (2010) + + + + + 5 = High
38
Studies on the Work Disability Functional Assessment Battery (WD-FAB)
Author(s) and year of publication FCE method Objective Population Procedure Statistics Methodological
quality
Meterko et al. (2015) + + +/- + + 4 = High
Studies on the WorkHab Functional Capacity Evaluation
Author(s) and year of publication FCE method Objective Population Procedure Statistics Methodological
quality
James, Mackenzie and Capra (2010) + + + + & + + 5 = High
James, Mackenzie and Capra (2011) + + - + & + + 4 = High
39
Table 4: Methodological aspects of the reviewed studies, susceptible to bias
NA = Not Applicable, NM = Not Mentioned, U=Unclear
Met
ho
do
log
ical
asp
ect
and
th
e as
soci
ated
typ
e o
f p
ote
nti
al b
ias
Bra
nto
n e
t al
. (2
01
0)
Bra
ssar
d e
t al
. (2
00
6)
Bru
bak
er e
t al
. (2
00
7)
Ch
eng
an
d C
hen
g (
20
10
)
Ch
eng
an
d C
hen
g (
20
11
)
Du
ran
d e
t al
. (2
00
4)
Du
ran
d e
t al
. (2
00
8)
Go
utt
ebar
ge
et a
l. (
200
5)
Go
utt
ebar
ge
et a
l. (
200
6)
Go
utt
ebar
ge
et a
l. (
200
9)
Gro
ss a
nd
Bat
tié
(20
05
)
Gro
ss a
nd
Bat
tié
(20
06
)
Jam
es,
Mac
ken
zie
and
Cap
ra (
20
10)
Jam
es,
Mac
ken
zie
and
Cap
ra (
20
11)
Lec
hn
er,
Pag
e an
d S
hef
fiel
d (
20
08)
Met
erk
o e
t al
. (2
015
)
Ren
eman
et
al.
(20
04
)
Ren
eman
et
al.
(20
05
)
Ru
sten
bu
rg,
Ku
ijer
an
d F
ring
s-D
rese
n (
200
4)
Tri
pp
oli
ni
et a
l. (
201
4)
Selection bias
Was subject selection random?
Were inclusion/exclusion criteria used?
NM
NM
No
Yes
No
Yes
NM
Yes
No
Yes
NM
Yes
U
Yes
NM
Yes
NM
Yes
U
Yes
NM
Yes
NM
Yes
No
NM
No
NM
No
Yes
U
NM
No
NM
No
Yes
No
NM
No
NM
Allocation bias
Was allocation random? Yes NA Yes NA NA NA NA NA NA NA NA NA NA NA NA Yes NA NA NM NA
Performance bias
Were subjects blinded to allocation? NM NA No NA NA NA NA NA NA NA NA NA NA NA NA NM NA NA NA NA
Detection bias
Were raters blinded to allocation? No NA Yes NA NA NA NM NA NA NM NA NA NA NA NA NM NA NA NA NA
Attrition bias
Were there missing data/drop-
outs/withdrawals/…?
NM No Yes Yes Yes NM Yes No Yes NM Yes Yes NM Yes NM Yes Yes Yes Yes NM
40
Reporting bias
Were all outcomes stated to be measured
actually reported?
Were there extra results measured post-
hoc?
Yes
No
Yes
No
Yes
No
Yes
U
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
U
U
Yes
U
Yes
No
Yes
No
Yes
No
Confounders
Were there no significant differences
between groups at baseline?
U NA Yes NA NA NA NA NA NA NA NA NA NA NA NA NM NA NA NM NA
Analysis
Were the data for all subjects included in
the final analysis, even those who
withdrew from the study?
NM Yes NM NM NM NM NM Yes NM NM No NM NM NM NM NM NM U No NM
Funding bias
Was the study funded by a (possibly)
involved party?
U U NM NM NM U NM No No No U U U U Yes U NM No NM U
Note: Scores are only based on the methodological aspects of the part of the study in which inter-rater reliability is examined
41
Table 5: Study methods and results
Study FCE method Psychometric property/
properties studied Study methods and results
Cheng and Cheng (2010) Baltimore
Therapeutic
Equipment (BTE)
Work Simulator
Predictive validity FCE data on subjects were extracted from clinical databases of rehabilitation
multiple centres. Based on their performance in the BTE, compared to physical
work demands of the current job in terms of duration, frequency and intensity
of tasks and activities, subjects were given recommendations on Return-To-
Work. Three months after the evaluation, all subjects were contacted by
telephone to find out their current employment status.
Kappa coefficients were calculated to evaluate the strength of agreement and
the McNemar-Bowker test was used to test the symmetry between the Return-
To-Work recommendations and employment status after three months. The
percentage of correct predictions (hit rate) of each recommendation was
measured and the relative percentage difference (deviation) was used to
evaluate the precision of Return-To-Work recommendations with respect to
each FCE-based rating.
A moderate agreement of κ = 0.435 was found between Return-To-Work
recommendations and employment status after three months. A statistically
significant difference (p<0.0001) was obtained by the McNemar-Bowker test,
indicating that there is more disagreement over some categories of Return-To-
Work outcomes than others. A percentage of correct predictions of 88.36%
was observed in returning to previous job, 71.57% in changing job, 23.17% in
returning to previous job with modification and 37.50% in not working at the
moment.
42
Cheng and Cheng (2011) Baltimore
Therapeutic
Equipment (BTE)
Work Simulator
Predictive validity Data of interest on the subjects were extracted from the clinical databases of
multiple rehabilitation centres. Based on their performance in the BTE,
compared to physical work demands of the current job (in terms of duration,
frequency and intensity of tasks and activities), subjects were given
recommendations on Return-To-Work. Three months after evaluation, all
subjects were contacted by telephone to find out their current employment
status.
Kappa coefficients were calculated to evaluate the strength of agreement and
the McNemar-Bowker test was used to test the symmetry between the Return-
To-Work recommendations and employment status after three months. The
percentage of correct predictions (hit rate) of each recommendation was
measured and the relative percentage difference (deviation) was used to
evaluate the precision of Return-To-Work recommendations with respect to
each FCE-based rating.
A moderate agreement of κ = 0.449 was found between Return-To-Work
recommendations and employment status after three months. A statistically
significant difference (p<0.0001) was obtained by the McNemar-Bowker test.
A percentage of correct predictions of 94.83% was observed in returning to
previous job, 60.47% in changing job, 9.38% in returning to previous job with
modification and 60.47% in not working at the moment.
43
Study FCE method Psychometric property/
properties studied Study methods and results
Brubaker et al. (2007) Blankenship
WorkEval
Functional Capacity
Evaluation
Sensitivity
Specificity
Subjects were randomly assigned to a 100% effort group and a 50% effort
group. Raters were blinded to allocation. Two raters had to apply an ordinal
scale based on the Blankenship with three possible ratings (valid, invalid or
equivocal) to each of the four observed components: repetitive-movement
tests, static-strength tests, occasional-material-handling tests and hand tests.
Sensitivity, specificity and likelihood ratios were computed from contingency
tables.
A sensitivity of 80.0% and specificity of 84.2% „respectively‟ were found. The
positive likelihood ratio was 5 and the negative likelihood ratio 0.2.
44
Study FCE method Psychometric property/
properties studied Study methods and results
Gouttebarge et al. (2005) Ergo-Kit (EK)
Functional Capacity
Evaluation
Intra-rater reliability
Inter-rater reliability
Each subject was assessed three different times on the EK, twice by rater 1 and
one time by rater 2. Both raters were blinded to the others‟ ratings. A time
interval of 4±1 days was used between the first and second assessment and the
second and third assessment. A time interval of 8±2 days was used between
the first and the last/third assessment. Seven EK tests were administered in
following order: the Back-Torso Lift Test (BTLT), Shoulder Lift Test (SLT),
Forward Manipulation Test (FMT), Lower Manipulation Test Crouching
(LMTC), Carrying Lifting Strength Test (CLST), Lower Lifting Strength Test
(LLST) and the Upper Lifting Strength Test (ULST). Three groups (A, B & C)
of nine subjects were formed, according to their availability. Each group was
assessed on the EK with a counterbalanced order of raters.
Levels of intra- and inter-rater reliability were expressed as an Intra-class
Correlation Coefficient (ICC); and a 95% confidence interval was calculated
for each ICC.
Intra-rater reliability per EK test: high ICCs were found in the BTLT, with
ICC 0.96 (95% CI = 0.91-0.98) and the SLT, with ICC 0.93 (95% CI = 0.85-
0.97). Moderate ICCs were found in the CLST with ICC 0.88 (95% CI = 0.75-
0.94), in the LLST with ICC 0.86 (95% CI = 0.72-0.93), in the ULST with
ICC 0.84 (95% CI =0.69-0.92) and in the FMT with ICC 0.76 (95% CI =0.46-
0.89). The ICC for the LMTC was low, with ICC 0.55 (95% CI =.0.21-0.77).
Inter-rater reliability after a four-day time interval: high ICCs were found in
the BTLT with ICC 0.96 (95% CI = 0.91-0.98), in the LLST with ICC 0.95
(95% CI = 0.89-0.98), in the SLT with ICC 0.94 (95% CI = 0.87-0.97), in the
45
ULST with ICC 0.94 (95% CI =0.88-0.97) and in the CLST with ICC 0.93
(95% CI = 0.85-0.97). A moderate ICC was found in the LMTC with ICC 0.90
(95% CI = 0.78-0.95) and in the FMT with ICC 0.88 (95% CI =0.74-0.94).
These ICCs indicate moderate to high agreement between raters after a four-
day time interval.
Inter-rater reliability after an eight-day time interval: high ICCs were found in
the BTLT with ICC 0.96 (95% CI = 0.90-0.98), in the SLT with ICC 0.95
(95% CI = 0.88-0.98), in the LLST with ICC 0.92 (95% CI = 0.79-0.97), in the
CLST with ICC 0.91 (95% CI = 0.75-0.97) and in the ULST with ICC 0.91
(95% CI = 0.67-0.97). Low ICCs were found in the LMTC with ICC 0.62
(95% CI = 0.01-0.85) and the FMT with ICC 0.53 (95% CI = -0.27-0.82).
These ICCs indicate moderate agreement between raters after a four-day time
interval.
Gouttebarge et al. (2006) Ergo-Kit (EK)
Functional Capacity
Evaluation
Inter-rater reliability
Agreement
Subjects were independently assessed twice by different raters on five lifting
tests of the EK: the Back-Torso Lift Test (BTLT), Shoulder Lift Test (SLT),
Carrying Lifting Strength Test (CLST), Lower Lifting Strength Test (LLST)
and the Upper Lifting Strength Test (ULST). The time interval was set at three
days and the EK tests were performed at the same time of the day.
Level of reliability and agreement were expressed with an Intra-class
Correlation Coefficient (ICC) and their 95% confidence intervals were
calculated. To assess the raters‟ stability in repeated measurements over time
and to gain an insight into the clinical relevance of the Ergo-Kit lifting tests,
agreement was expressed with the standard error (SE) of measurement and its
95% Confidence Interval.
46
The inter-rater reliability was high in the BTLT with ICC 0.97 (95% CI =
0.94-0.99), in the SLT with ICC 0.96 (95% CI = 0.91-0.98), in CLST with ICC
0.95 (95% CI = 0.84-0.98), in the LLST with ICC 0.94 (95% CI = 0.85-0.97)
and in the ULST with ICC 0.95 (95% CI = 0.89-0.98).
Agreement: the BTLT had a SEM of 8.6 (95% CI = X±16.7), the SLT had a
SEM of 5.0 (95% CI = X±9.8), the CLST had a SEM of 3.4 (95% CI =
X±6.6), the LLST had a SEM of 3.7 (95% CI = X±7.2) and the ULST had a
SEM of 8.6 (95% CI = X±3.8).
Gouttebarge et al. (2009) Ergo-Kit (EK)
Functional Capacity
Evaluation
Construct validity:
Discriminant/ divergent
validity
Convergent validity
Subjects were independently assessed twice by different raters (how many was
not specified) on five lifting tests of the EK: the Back-Torso Lift Test (BTLT),
Shoulder Lift Test (SLT), Carrying Lifting Strength Test (CLST), Lower
Lifting Strength Test (LLST) and the Upper Lifting Strength Test (ULST). The
time interval was set at less than ten days and the EK tests were performed at
the same time of the day. Subjects were divided into two groups post hoc,
based on their outcomes on the Instrument for Disability Risk (IDR), with cut-
off score 38% for high risk for work disability. Scores on the five lifting tests
were compared between the high-risk group and the low-risk group to
determine discriminant validity. It was hypothesized that the high-risk group
would have a lower score. To determine convergent validity, the association
between outcomes on the five lifting tests with the Von Korff Questionnaire
(VKQ) (an adapted version) was researched.
Independent sample t-tests were performed to determine discriminant validity
of the EK lifting tests in relation to the IDR (p<0.05). To determine convergent
validity, EK lifting tests outcomes and VKQ outcomes were correlated using a
Pearson correlation coefficient.
47
No statistically significant differences (p>0.05) in the outcomes of the five EK
lifting tests were found between the high risk and the low risk group. This
indicates low discriminative abilities.
Correlations of -0.17,-0.18, -0.27, -0.23 and -0.16 were found between the
VKQ outcomes and the EK lifting tests outcomes, which are all low. Only two
of these correlations were statistically significant (p<0.05). This indicates low
or no association between EK lifting tests outcomes and VKQ outcomes.
48
Study FCE method Psychometric property/
properties studied Study methods and results
Rustenburg, Kuijer and Frings-Dresen
(2004)
ERGOS Work
Simulator and the
Ergo-Kit FCE (EK)
Concurrent validity Subjects performed the dynamic lifting tests of both the ERGOS Work
Simulator and the Ergo-Kit to determine frequent, and constant lower and
sporadic, frequent and constant upper lifting capacity. Testing occurred on two
different days, separated by an interval of seven days (six or eight). The time
of day was kept constant. The number of raters is not mentioned.
Concurrent validity was determined by comparing the mean values for the
lifting tests by means of a paired t-test and a Spearman rank correlation
analysis (two-sided).
Some similarity between the FCEs was found for frequent and the constant
lower lifting capacity outcomes, with moderate correlations for both items: ρ =
0.50. Both correlations were statistically significant at p<0.05. No correlation
coefficient could be calculated for sporadic lower lifting capacity, and therefor
concurrent validity for this item was not determined. Little or no similarity
between the FCEs was found for sporadic upper lifting capacity, with a low
correlation of ρ = 0.49. This correlation was statistically significant at p<0.05.
Some similarity between the FCEs was found for frequent lower lifting with ρ
= 0.66 and for constant lower lifting with ρ = 0.56, which are moderate
correlations. Both correlations were statistically significant at p<0.01.
49
Study FCE method Psychometric property/
properties studied Study methods and results
Gross and Battié (2005) Isernhagen Work
Systems FCE
Predictive validity Data were extracted from databases on the subjects undergoing the Isernhagen
Work Systems FCE. Specifically, data on their number of failed tasks and
floor-to-waist lift weight. Subjects were contacted by telephone after one year
for a follow-up interview on their current work status and employment duties,
current pain intensity and perceived disability.
Analysis included Cox and logistic regression to determine the predictive
ability of the number of failed tasks and the floor-to-waist lift weight to predict
recovery. Recovery indicators were days to suspension of time-loss benefits,
days to claim closure and future recurrence.
Associations between FCE results (1 out of 25 tasks failed) and recovery
outcomes: The Adjusted Hazard Ratio (AHR) for days to benefit suspension
was 0.91 (95% CI = 0.87-0.96), the AHR for days to claim closure was 0.93
(95% CI = 0.89-0.98) and the Adjusted Odds Ratio for recurrence was 0.95
(95% CI = 0.89-1. 03).
Associations between FCE results (Floor-to-waist lift, 10 kg units) and
recovery outcomes: The Adjusted Hazard Ratio (AHR) for days to benefit
suspension was 1.55 (95% CI = 1.28-1.98), the AHR for days to claim closure
was 1.42 (95% CI = 1.12-1.80) and the Adjusted Odds Ratio for recurrence
was 0.91 (95% CI = 0.63-1. 33).
FCE performance indicators were not significantly (p>0.05) associated with
future recurrence, self-reported outcomes of work status, pain intensity and
disability.
50
Gross and Battié (2006) Isernhagen Work
Systems FCE
Predictive validity Data on variables of interest were extracted from clinical and administrative
databases of a rehabilitation centre. Based on the nature of the subjects‟
diagnosis, they were divided into two groups: the specific injuries group and
the non-specific injuries group. There were also other significant differences
between the two groups, based on days between injury and admission,
previous WCB upper extremity claims, number of previous health visits, sex,
side of injury and employment status (p<0.05). Performance in the 15
activities in the upper extremity FCE protocol of the Isernhagen Work Systems
FCE was reported. Based on the subjects‟ job requirements, they were graded
with a „pass‟ or „fail‟ for these activities (met/didn‟t meet the required job
demand level). Following FCE indicators were used, based on previous
research: maximum performance during handgrip and lift testing and the
number of tasks where performance was rated as „fail‟ (below required job
demands). Investigated indicators for timely and sustained recovery were: days
receiving time-loss benefits in the year following FCE, days until claim
closure and future recurrence (whether benefits restarted, the claim reopened
or a new upper extremity claim was filed).
To determine relationships between FCE and days to suspension of time-loss
benefits and claim closure, Cox regression was used. Initially, a univariate
screen of the relation between all predictor variables and outcomes was
performed. The adjusted effect of the FCE indicators was then determined by
entering them into multivariable Cox regression models along with the
potential confounding measures found significant in the univariate screen.
Tests of confounding were also undertaken to determine if any other predictor
variable altered regression coefficients by more than 20%. These analyses
51
were performed for each FCE variable separately. Logistic regression was
used to evaluate the relation between FCE and future recurrence.
Association between number of failed FCE items (out of 15 tasks) and days to
suspension of time-loss benefits resulted in a Adjusted Hazard Ratio (AHR) of
0.97 (95% CI = 0.94-1.00) and associations between maximum performance in
handgrip and lifting tests (10 kg units) varied between AHR 0.98-1.55 (95%
CI = 0.92-1.78). Association between number of failed FCE items (out of 15
tasks) and days to claim closure resulted in a Adjusted Hazard Ratio (AHR) of
0.97 (95% CI = 0.94-1.00) and associations between maximum performance in
handgrip and lifting tests (10 kg units) varied between AHR 1.05-1.81 (95%
CI = 0.98-2.20). Association between number of failed FCE items (out of 15
tasks) and future recurrence resulted in a AHR of 0.97 (95% CI = 0.90-1.04)
and associations between maximum performance in handgrip and lifting tests
(10 kg units) varied between AHR 0.87-1.08 (95% CI = 0.60-1.27).
This indicates low or no associations between the FCE indicators and the
indicators for timely and sustained recovery.
Reneman et al. (2004) Isernhagen Work
Systems FCE
Test-retest reliability Twenty-eight tests of the Isernhagen Work Systems FCE were administered
twice to the subjects within a two to three week interval by one rater. Time of
the day of assessment was kept constant. Subjects were asked to perform at
maximum capacity.
For all criterion tests, the number of subjects who met the criterion for each
test session was calculated and based on these results, Cohen‟s kappa‟s and
percentages of absolute agreement for the two measurements were calculated.
For all tests with a ceiling effect, the number of subjects who reached the
ceiling for each test session was calculated and based on these results, Cohen‟s
52
kappa‟s and percentages of absolute agreement for the two measurements were
calculated.
ICCs of the material-handling tests ranged from 0.68 to 0.98, one of these
ICCs showed low, five showed moderate and two showed high agreement.
ICC of the shuttle walk test was 0.64, which indicates low agreement. Kappa
coefficients of 0.60 and higher were found in five of the criterion and ceiling
tests, which indicates high agreement. One moderate Kappa coefficient was
found and kappa coefficients could not be calculated in eleven of the tests
because of incomplete filling of the cells in the 2x2 tables.
Reneman et al. (2005) Isernhagen Work
Systems FCE
Inter-rater reliability The subjects were videotaped while performing lifting tests as outlined in the
Isernhagen Work System FCE. Nine observers had to independently rate the
effort levels used by the subjects based on the videos.
Inter-rater reliability for the Isernhagen Work Systems FCE lifting tests was
analysed by means of Cohen‟s‟ kappa.
In healthy subjects kappa was 0.58, which indicates moderate agreement
between raters. In the subjects with chronic low back pain kappa was 0.50,
which is moderate.
Trippolini et al. (2014) Isernhagen Work
Systems FCE
Intra-rater reliability
Inter-rater reliability
Twenty-one raters independently rated physical effort in eighteen soundless
videos of four subjects performing eleven tests of the Isernhagen Work
Systems FCE. Each video was rated according to observational criteria
indicative of physical effort for material handling tests as „light to moderate‟,
„heavy‟ or „maximal‟. Observational criteria for postural tolerance tests and
ambulation tests were rated on a scale from „no or slight functional
problem/limitation‟, „some functional problem/limitation‟ to „substantial
functional problem/limitation‟. This categorical scale was termed the physical
53
effort determination (PED) scale. Submaximal effort was assumed when a
client stopped a material or non-material handling test before the FCE rater
observed sufficient criteria indicative of maximal weight or significant
functional problems/limitation. This dichotomous scale was termed the
submaximal effort determination (SED). In total the videos were 28 minutes
long and all raters watched it at the same time. For every test the raters
received standardized information on heart rate, lifted weight (kg) and duration
of the test. Raters had to fill in ratings of physical effort. This rating was
performed twice, once in September 2010 and once in July 2011, with a time
interval of ten months to prevent a learning effect. Between these sessions
each rater performed approximately 30 short FCEs (material handling tests
only), as part of the regular clinical procedure of a work rehabilitation
program.
Intra-rater reliability for the PED and the SED scales were assessed by
comparing the scores from the first rating session with the scores from the
second session for each rater. Inter-rater reliability was assessed twice: by
comparing between the scores all the raters in session 1 and in session 2.
Category 5 „„not classifiable‟‟ was excluded from the analyses. Inter-rater and
intra-rater reliability was calculated using Cohen‟s kappa values for
dichotomous data, and squared weighted kappa values for categorical data and
percentages of agreement.
Agreement within raters (intra-rater reliability) for the PED scale was
moderate, with a kappa value of 0.49 (95% CI = 0.22–0.75). Agreement within
raters (intra-rater reliability) for the SED scale was high with a kappa value of
0.68 (95% CI = 0.60–0.76).
54
Inter-rater agreement was 73% for rating 1 and 85% for rating 2 (percentages
of agreement), which is low.
Agreement between raters (inter-rater reliability) for the PED scale was
moderate in rating 1 with a kappa value of 0.51 (95% CI = 0.23–0.80) and
high in rating 2 with a kappa value of 0.7 (95% CI = 0.49–0.94). Agreement
between raters (inter-rater reliability) for the SED scale was high in rating 1
with a kappa value of 0.68 (95% CI = 0.60–0.76) and also high in rating 2 with
a kappa value of 0.77 (95% CI = 0.70-0.84).
55
Study FCE method Psychometric property/
properties studied Study methods and results
Brassard et al. (2006) Physical Work
Performance
Evaluation (PWPE)
Test-retest reliability The PWPE was administered at two separate occasions by the same rater out of
four raters, within a time interval of six weeks.
ICC and Cohen‟s‟ kappa were calculated to determine test-retest reliability.
ICCs for the tasks of the dynamic force-section ranged from 0.79 to 0.91,
displaying moderate to high agreement between test-retest results. A kappa value
ranging from 0.05 to 0.50 was found for the tasks of the tolerance at different
positions-section, which is low to moderate. A kappa ranging from 0.34 to 0.83
was found or the tasks of the mobility-section, with kappa 0.49 for dynamic
strength, kappa 0.52 for mobility and kappa 0.43 for the total PWPE score;
indicating a moderate agreement between test-retest results.
Durand et al. (2004) Physical Work
Performance
Evaluation (PWPE)
Inter-rater reliability Subjects were evaluated by five raters on the PWPE and a scores for each section
and an overall scores were determined. These scores were matched to the
corresponding category of the five categories of work, proposed by the Dictionary
of Occupational Titles (DOT): Very heavy, heavy, medium, light or sedentary
work.
The PWPE was carried out by two selected raters, one of them administered the
FCE (the direct rater) and the other one functioned as silent rater, which only
observed the performance without intervening in the evaluation process. Both
raters alternated being the direct rater.
Unweighted Cohen‟s kappa coefficients were calculated to determine the inter-
rater reliability for each task and section of the PWPE and the overall score.
Percentage of agreement and 95% confidence intervals were also calculated.
A high agreement of κ = 0.76 was found for the overall PWPE score (95% CI =
56
0.58-0.93), with 85% of agreement. High agreement was also found for the
Dynamic Strength section, with κ = 0.81 (95% CI = 0.65-0.96) and for the
Position Tolerance section, with κ = 0.72 (95% CI = 0.54-0.90). Percentages of
agreement were 87.5% and 82.5% . Moderate agreement was found for the
Mobility section, with κ = 0.54 (95% CI = 0.28-0.81) and 80% of agreement.
Durand et al. (2008) Physical Work
Performance
Evaluation (PWPE)
Internal responsiveness
External responsiveness
Subjects were a group of participants in a work rehabilitation program because of
low back pain (rehabilitation group) and a group of healthy subjects (comparison
group). The PWPE was administered twice to both groups by the same rater, with
a time interval of six weeks. There were five raters. To determine the internal
responsiveness, the change in the pre-/post-test PWPE scores of both groups was
compared. To determine external responsiveness, the change in PWPE scores in
the rehabilitation group was compared to concurrent criteria. Six concurrent
criteria were chosen: aerobic capacity, perceived disability in ADL‟s, fear-
avoidance of activity, psychological distress, pain level and therapist and worker
judgement. According to literature, these are predictors of Return-To-Work.
To compare results pre-/post-test per group, Wilcox Signed-Ranks Test were used.
Differences between both groups were analysed using the Wilcoxon Mann-
Whitney Test. Kendall correlation coefficients were used to determine whether a
significant correlation existed between pre-/post-test differences.
In the rehabilitation group, PWPE-scores significantly (p≤0.05) differed pre-/post-
test for the dynamic strength section (p=0.001) and the position tolerance section
(p=0.0195) but not for the mobility section or the overall PWPE score (internal
responsiveness). In the comparison group no significant (p≤0.05) differences on
PWPE-score pre-/ post-test were found, not for the overall PWPE score or PWPE
sections (internal responsiveness). A statistically significant difference between
57
both groups pre-/post-test was found only for the dynamic strength section of the
test (p = 0.0008). All concurrent criteria for the rehabilitation group were
significantly different pre-/post-test but correlations between change-scores and
changes in concurrent criteria were not statistically significant; except for the
postural tolerance section: Kendall Tau = -0.41, (p =0.0131) (external
responsiveness). Perceived change in the rehabilitation group was statistically
significant, but did not correlate with the overall PWPE scores 0.28≤Kendall
Tau≤0.37. Change in therapists‟ perception of change was significant but did not
correlate with the overall PWPE scores.
Lechner, Page and Sheffield (2008) Physical Work
Performance
Evaluation (PWPE)
Predictive validity This study used data from December 1993 to October 1994 from a previous study.
The PWPE was administered to the subjects by two raters and the results were
compared to their job requirements (which were observationally analysed by the
raters). After a maximum of five weeks of intervention based on the functional
areas of deficit, the subjects were re-assessed for the PWPE components for which
there was a deficit between job demands and clients‟ physical abilities at the initial
assessment. Based on final test results, a recommendation was made: „Return-To-
Work‟, „Return-To-Work: modified duty‟ or „no Return-To-Work‟. The subjects
were contacted by telephone three and six months after discharge to determine
their current working level.
A kappa coefficient was calculated to determine the level of agreement between
discharge recommendations and actual Return-To-Work.
At discharge kappa was 0.74 for Return-To-Work recommendations, which is
high. At three months after discharge kappa was 0.69 for Return-To-Work
recommendations, which is high. At six months after discharge kappa was 0.70 for
Return-To-Work recommendations, which is also high.
58
Study FCE method Psychometric property/
properties studied Study methods and results
Branton et al. (2010) Short-Form
Functional Capacity
Evaluation
Predictive validity Previously collected data were used to compare subjects‟ performance in the
items of the Short-Form FCE to administrative recovery outcomes (days to
claim closure, days to time loss benefit suspension and future recurrence). Key
FCE variables for prediction analysis included: overall subject performance,
number of failed FCE items and performance in the individual FCE items.
Analysis included multivariable Cox and logistic regression using a risk factor
modelling strategy.
Associations between FCE results (number of failed items) and administrative
outcomes: Adjusted Hazard Ratio (AHR) for days to benefit suspension = 5.45
(95% CI = 2.73-10.85), AHR for days to claim closure = 5.80 (95% CI = 3.50-
9.61), Adjusted Odds Ratio for recurrence = 1.31 (95% CI = 0.48-3.60). The
proportion of variance explained by the FCE ranged from 18%-27%.
59
Study FCE method Psychometric property/
properties studied Study methods and results
Meterko et al. (2015) Work Disability
Functional
Assessment Battery
(WD-FAB)
Discriminant/divergent
validity
Convergent validity
Subjects were unable to work because of physical or mental disability (two
disability groups). Each disability group responded to a survey, consisting of
the relevant WD-FAB scales and existing measures (the RAND 36-Item
Health Survey‟s Physical Component Summary (PCS) and the Mental
Component Summary (MCS). The number of raters was not mentioned. For
the physical disability group, four multi-item physical functioning (PF) scales
were identified: changing and maintaining body position, upper body function,
upper extremity fine motor and whole body mobility. For the mental disability
group, the behavioural health (BH) components‟ four multi-item scales were
identified: Self-Efficacy, Mood and Emotions, Behavioural Control and Social
Interactions. The physical disability group were also administered the Patient-
Reported Outcomes Measurement Information System (PROMIS) Physical
Function 10-Item Short-Form, which measures current capability for mobility,
walking, hand and arm use and activities of daily living. The behavioural
disability group was also administered five scales of the six-scale Behaviour
and Symptom Identification Scale (BASIS).
Construct validity was assessed by examining both convergent and
discriminant correlations between the WD-FAB scales and scores on same-
domain and cross-domain measures.
All correlations between the measures of physical functioning (the PCS and
the PROMIS) and the four PF scales of the WD-FAB were statistically
significant (p<0.05) in the predicted direction. Correlations with the PCS and
the four items of the physical functioning scale were 0.23, which is low; and
60
0.42, 0.43 and 0.55, which is moderate. Correlations with the PROMIS and the
four PF scales were 0.54, which is moderate and 0.65, 0.69 and 0.70, which is
high. This indicates a low to moderate association between the four PF scales
and same-domain measures (convergent validity). All correlations between the
measures of behavioural health (5 scales of the BASIS) and the four BH scales
of the WD-FAB were statistically significant (p<0.05) in the predicted
direction. These correlations ranged from -0.24 to -0.74, with two high
correlations, 15 moderate correlations and three low correlations. This
indicates a moderate association between the four BH scales and same-domain
measures (convergent validity).
Correlations between the four PF scales and the MCS were all statistically
significant (p<0.05) and they were all low: 0.12, 0.15, 0.21 and 0.24. This
indicates a good ability to differentiate between mental and physical disability
groups (discriminative/divergent validity). Correlations between the four BH
scales and the PCS statistically significant (p<0.05) in two out of four
correlations. Three out of four correlations were low: 0.06, 0.07 and 0.21; and
one was moderate: 0.31. This indicates a good ability to differentiate between
mental and physical disability groups (discriminative/divergent validity).
61
Study FCE method Psychometric property/
properties studied Study methods and results
James, Mackenzie and Capra (2010) WorkHab
Functional Capacity
Evaluation
Test-retest reliability
Internal consistency
The manual handling component (floor to bench, bench to shoulder and bench
to bench lift) of the WorkHab FCE was administered twice in the subjects,
with a time interval of one week. The FCE subtests were administered by one
rater, at approximately the same time of the day. Subjects were asked to
perform to their maximum ability.
One-way random Intra-class Correlation Coefficients (ICCs), 95% Confidence
Intervals (CIs), Limits of Agreement (LoAs), paired sample t-test, kappa
(weighted for ordinal data) and percentage agreement were calculated where
appropriate. A ratio between the LoA and the mean score was also calculated
using the following formula (1.96 x standard deviation of mean
difference)/mean session 1 and 2 x 100%.
An ICC of 0.74 (95% CI = 0.42-0.88) was found for the overall manual
handling score, which is low. ICCs of 0.92, 0.90 and 0.91 were found for the
three subtests of this component, which are moderate to high.
Cronbach‟s alpha for the overall manual handling score was 0.92, which
indicates good internal consistency. Cronbach‟s alpha of the three subtests of
this component varied from 0.82 to 0.86, which is good.
James, Mackenzie and Capra (2011) WorkHab FCE Intra-rater reliability
Inter-rater reliability
A DVD was made with recordings of the subjects performing the subtests of
the handling component of the WorkHab FCE. This DVD was sent to 17 raters
who had to rate the recordings according to the WorkHab FCE protocol. They
had to fill in a score form which they had to send back to the researchers.
Fourteen of the raters re-assessed the same recordings in a different (random)
62
order approximately 14 days later. The assessors had to record the weight
lifted and calculate a manual handling score. Stance, posture, leverage, torque
and pacing comprise the manual handling score, which is based on the
principles of safe manual handling, with each of these components being rated
on a scale of 0–4 with „0‟ being no adherence and „4‟ being the highest safety
score. The sum of the score for each component is recorded as the manual
handling score for each subject.
Intra-class Correlation Coefficients (ICCs) and their 95% confidence intervals
were calculated to determine intra- and inter-rater reliability. Intra-rater
reliability, the level of agreement when the same rater viewed the same clip on
two different occasions, was calculated using the ICC – Model 3 ( mixed
model where the rater is considered the fixed effect and the subjects are
considered the random effect). The ICC used to determine inter-rater reliability
was Model 2 ( both raters and subjects are considered random effects).
For intra-rater reliability of the WorkHab manual handling component an ICC
of 0.97 was found for the manual handling score, which indicates high
agreement within raters. The ICCs of the subtests of the manual handling
component ranged between 0.81 and 0.97, which indicates moderate to high
agreement within raters.
For inter-rater reliability of the WorkHab manual handling component an ICC
of 0.90 was found for the manual handling score, indicating moderate
agreement between raters. The ICCs of the subtests of the manual handling
component ranged between 0.77 and 0.91 for the manual handling score,
which displays moderate to high agreement between raters.
63
3.5. Summary of the results and their interpretations
3.5.1. Baltimore Therapeutic Equipment (BTE) Work Simulator
The job-specific BTE has a moderate predictive validity for Return-To-Work/
employment status (29,30).
3.5.2. Blankenship WorkEval Functional Capacity Evaluation
Brubaker et al. (2007) found a sensitivity of 80.0% and specificity of 84.2% in the
Blankenship. This indicates good diagnostic abilities of the FCE.
3.5.3. Ergo-Kit (EK) Functional Capacity Evaluation
Overall agreement for outcomes on the lifting tests of the EK ranged from 8.6 to 3.4,
which might be interpreted as moderate and indicates moderate variability between
outcomes on the lifting tests (32). Agreement between raters strongly varied for the
lifting tests of the EK from low to high, but was mostly high (31,32). The same applies
to agreement within raters (31).
Low discriminative abilities were found for the EK lifting tests (discriminant/divergent
validity). Little or no association was found between the EK lifting tests and the Von
Korff Questionnaire, indicating good convergent validity (33). Concurrent validity with
the ERGOS Work Simulator was low to moderate (34).
3.5.4. ERGOS Work Simulator (EWS)
Concurrent validity with the Ergo-Kit FCE was low to moderate (34).
3.5.5. Isernhagen Work Systems (IWS) Functional Capacity Evaluation /
WorkWell Systems Functional Capacity Evaluation
A varying test-retest reliability/reproducibility was found in the IWS material-handling
component with moderate to high agreement between outcomes. Overall, this IWS
component provided relatively stable outcomes with limited variation (36). Agreement
between raters was moderate for the lifting tests of the IWS (37) and moderate to high
agreement between raters was found for the physical and behavioural scales used in the
IWS to observationally determine physical effort in eleven IWS subtests (inter-rater
64
reliability) (38). Agreement within raters was moderate to high for the physical and
behavioural scales used in the IWS to observationally determine physical effort (intra-
rater reliability) (38).
Studies show that performance in the IWS (number of failed tasks and weight lifted) has
no or low predictive value for recovery outcomes such as timely or sustained Return-
To-Work and future pain, based on days to benefit suspension, days to claim closure
and recurrence (11,39).
3.5.6. ErgoScience Physical Work Performance Evaluation (PWPE)
A varying test-retest reliability/reproducibility was found for the sections of the PWPE
with moderate to high agreement between outcomes and moderate agreement between
outcomes of the overall PWPE (41). In other words, the PWPE provides relatively
stable outcomes with limited variation. Agreement between raters was moderate to high
for the PWPE sections and high for the overall PWPE score (42).
Change within subjects was observed by two PWPE sections (the dynamic strength
section and the position tolerance section), but not by the mobility sections or by the
overall PWPE (internal responsiveness) (44). The PWPE was not able to distinguish
change on the subjects‟ outcomes for reference measures (concurrent and empirical
data) of health status (external responsiveness) (44).
A study by Lechner, Page and Sheffield (2008) shows that performance in the PWPE
(Return-To-Work recommendations based on PWPE outcomes) has high predictive
value for Return-To-Work/ employment status.
3.5.7. Short-Form Functional Capacity Evaluation
Performance in the Short-Form FCE (number of failed items) has got predictive value
for recovery outcomes such as timely and sustained Return-To-Work, based on days to
benefit suspension and days to claim closure (45).
3.5.8. Work Disability Functional Assessment Battery (WD-FAB)
Good discriminative abilities were found for the WD-FAB physical functioning and
behavioural health scales for differentiating between physical and mental disability
65
(discriminant/divergent validity) (46). Low to moderate association was found between
the WD-FAB physical functioning scale and the PROMIS (convergent validity) (46).
Low to high associations were found between the WD-FAB behavioural health scale
and the BASIS, with mostly moderate associations (convergent validity) (46).
3.5.9. WorkHab Functional Capacity Evaluation
A moderate to high agreement between outcomes in the three subtests of the WorkHab
manual handling component was found and a high agreement between outcomes on the
overall manual handling score (test-retest reliability/reproducibility) (47). Overall, the
WorkHab manual handling score provided stable outcomes with little variation.
Agreement between raters was moderate to high for the outcomes of the subtests of the
WorkHab manual handling component and moderate for the overall manual handling
component outcome (inter-rater reliability) (48). Agreement within raters was moderate
to high for the subtests and high for the overall manual handling component (intra-rater
reliability) (48).
James, Mackenzie and Capra (2010) found a high internal consistency for the manual
handling component and the individual tests of the manual handling component (47).
66
4. Discussion and conclusion
4.1. Results
Overall, the psychometric properties of the studied FCE methods somewhat vary
between, but also within, methods. The predictive validity, for example, is low in some
methods, but high in others. An important question, however, remains whether FCEs
should be used to „predict‟ or to „establish‟/‟diagnose‟ work ability, for more client-
specific interventions. In case of the latter, sensitivity and specificity become more
important measures.
Inter-rater reliability seems to be good in most FCE methods, as well as intra-rater
reliability. Test-retest outcomes are also promising in most cases. It can be suggested
that the reliability of current FCE methods is generally well-researched and shows
positive results.
Validity has been studied less frequently over the last years. Overall, the FCE methods
in this study show variable validity, however, most outcomes show moderate to good
validity values.
4.2. Interpreting the results of the reviewed studies
An important side note to the study results is that, “according to several authors, when a
rater is involved in scoring the evaluation, intra-rater reliability is equivalent to test-
retest reliability because the accuracy of the FCE is influenced by the skill of the rater”
(49). For the present study however, the term test-retest reliability is used, seeing this is
the term that was chosen by researchers of the articles in question. Furthermore,
concurrent validity should be determined by studying the relation between the studied
assessment and its gold standard (6). In FCE however, there is no gold standard
available (6). Therefore, the use of the term „concurrent validity‟ in the study by
Rustenburg, Kuijer and Frings-Dresen (2004) seems to be inadequate. It would be better
to speak in terms of „comparison‟ or „correlation‟ (6).
Another aspect that should be taken into consideration is the study sample in which the
psychometric properties of the FCE methods are researched. The outcomes of the
reviewed studies should not be generalized to broader populations, because they are
67
specific to the study sample at hand. Validity or reliability of FCE methods in certain
pathologies might not be the same for the general population.
Finally and most importantly, this review only included evidence published since May
2004. The outcomes of the present study should be integrated with the pre-existing
evidence found in systematic reviews by Innes and Straker (1999a, 1999b);
Gouttebarge, Wind, Kuijer and Frings-Dresen (2004); Wind, Gouttebarge, Kuijer and
Frings-Dresen (2005) and Innes (2006) (6,8,16–18), for a comprehensive representation.
4.3. More than psychometric properties
Since one of the main aims of this review was to enable comparison of FCE methods in
order to make objectively informed decisions, it is also important to look beyond their
psychometric properties (effectiveness). The practical side (efficiency) of FCE methods
is another essential aspect to take into account. According to Hart, Isernhagen and
Matheson (1993), safety should be achieved before considering validity and reliability.
When validity and reliability are demonstrated, practicality and utility should
subsequently be taken into account, in that order (50). Costs, time spent on the
assessment, user-friendliness and acceptability are some important factors, which might
sometimes play a bigger role in the choice of FCE method than their psychometric
properties. Many FCE methods have promising psychometric properties, but are
burdensome clinical tools in terms of time and cost (51). Short-Form FCEs or FCEs in
the form of a structured interview provide a potential answer to many problems.
Nonetheless, the psychometrics of these FCE methods need more scientific
substantiation.
Acceptability of FCE was studied by questioning its usefulness in an expert panel (13).
The results showed that two thirds of the experts found FCE useful because they
confirm personal judgements and provide objective information. However, reasons for
not finding FCE useful were that it did not seem objective and did not provide any new
information. Job-specific FCE might provide a solution to the lack of new or non-
specific information. The client-centeredness of this kind of FCE is an important asset,
as it should reduce redundant information and provide directly applicable input for the
vocational rehabilitation process. Furthermore, only 20% of the experts judged FCE to
68
be a useful prognostic instrument, which is low. Most of them argued that FCE is an
evaluative measure and is not to be used for predictive purposes. Studies on the
predictive validity of several FCE methods should replace these opinions with evidence.
4.4. Literature search and potentially missed studies
Some researchers state that the literature search in systematic reviews should be
exhaustive, so that all possibly relevant data are obtained (17). However, given a limited
time frame and limited resources, an exhaustive approach could not be guaranteed for
this review. Therefore, it is a possibility that potentially relevant studies were missed.
For example: studies that were not available in the researched databases, studies that
used different keywords or studies published in languages other than English, Dutch or
French. An important limitation of this study is that names of known FCE methods were
not used as search terms. This may have resulted in some relevant studies being missed
and this should be taken into consideration when interpreting the results. Lastly,
publication bias could have had a potential confounding effect on the literature search,
where studies with negative outcomes were possibly not published. Nevertheless, the
literature search was performed as thoroughly as possible, by researching all available
databases, using broad search terms and synonyms and applying the inclusion and
exclusion criteria.
4.5. Choice of the quality assessment method
It is important to interpret the quality appraisal scores in this review correctly. The
scoring system used does not consider all important factors that can determine the
quality of reviewed studies. Although a checklist might provide more useful
information on these factors, a scoring system was used to facilitate the interpretation of
the study results. To nuance the (over)simplified representation of the studies‟
methodological quality, extra information on the study designs/approaches was
provided in Table 4: Methodological aspects of the reviewed studies, susceptible to
bias‟. Another reason for using the three-level quality appraisal scale by Gouttebarge et
al. (2004) (6) was that most validated checklists, such as the COSMIN checklist, did not
seem to be suitable for some of the reviewed studies. Other checklists were mostly
69
designed for appraising experimental studies, but not for studies on psychometric
properties.
4.6. Recommendations for future research
This systematic review has provided a more extensive and updated representation of the
psychometric qualities of several FCE methods. Some more ground has been covered
on the better known FCE methods, while new methods with different approaches are on
the rise and gaining scientific support as well. The newer approaches, such as the Short-
Form FCE need to be further examined on several psychometric properties.
Psychometrics of most of the well-known methods are thoroughly researched but some
of the research indicates weaknesses in their reliability and validity. Future research
should address how these weaknesses can be overcome, while also taking into account
practicality and utility-aspects of the FCE.
70
REFERENCES
1. OECD. Sickness, Disability and Work: Breaking the barriers. 2011.
2. Andrén D. Work, Sickness, Earnings, and Early Exits from the Labor Market. An
Empirical Analysis Using Swedish Longitudinal Data. 2001.
3. Hakim C. The Social Consequences of High Unemployment. J Soc Policy.
1982;11:433–67.
4. Dooley D, Fielding J, Levi L. Health and Unemployment. Annu Rev Public
Health. 1996;17:449–56.
5. Takala J, Hämäläinen P, Saarela KL, Yun LY, Manickam K, Jin TW, et al.
Global estimates of the burden of injury and illness at work in 2012. J Occup
Environ Hyg. 2014;11(5):326–37.
6. Gouttebarge V, Wind H, Kuijer PP, Frings-Dresen MHW. Reliability and
validity of Functional Capacity Evaluation methods: A systematic review with
reference to Blankenship system, Ergos work simulator, Ergo-Kit and Isernhagen
work system. Int Arch Occup Environ Health. 2004;77(8):527–37.
7. Young AE, Wasiak R, Roessler RT, Mcpherson KM, Anema JR, Poppel MNM
Van. Return-to-Work Outcomes Following Work Disability: Stakeholder
Motivations, Interests and Concerns. J Neurol Neurosurg Psychiatry.
2005;15(4):543–56.
8. Innes E. Reliability and Validity of Functional Capacity Evaluations : An Update.
Int J Disabil Manag Res. 2006;1(1):135–48.
9. Reneman MF. Introduction to the Special Issue on Functional Capacity
Evaluations : From Expert Based to Evidence Based. J Occup Rehabil.
2003;13(4):203–6.
10. Groothoff JW, Geertzen JHB, Reneman MF. Towards Consensus in Operational
Definitions in Functional Capacity Evaluation: a Delphi Survey. J Occup
Rehabil. 2008;18:389–400.
71
11. Gross DP, Battié MC. Functional Capacity Evaluation Performance Does Not
Predict Sustained Return to Work in Claimants With Chronic Back Pain. J Occup
Rehabil. 2005;15(3):285–94.
12. Haglund L, Karlsson G, Kielhofner G, Lai JS. Validity of the Swedish version of
the Worker Role Interview. Scand J Occup Ther. 1997;4(1-4):23–9.
13. Wind H, Gouttebarge V, Frings-Dresen MHW. Het nut van Functionele
Capaciteit Evaluatie : de visie van experts. Tijdschr voor Bedrijfs- en Verzek.
2005;13(10):300–5.
14. Chen JJ. Functional Capacity Evaluation & Disability. Iowa Orthop J.
2004;27:121–7.
15. King PM, Tuckwell N, Barrett TE. A Critical Review of Functional Capacity
Evaluations. J Am Phys Ther Assoc. 1998;78(8):852–66.
16. Innes E, Straker L. Validity of work-related assessments. Work. 1999;13(2):125–
52.
17. Innes E, Straker L. Reliability of work-related assessments. Work.
1999;13(2):107–24.
18. Wind H, Gouttebarge V, Kuijer PP, Frings-Dresen MHW. Assessment of
Functional Capacity of the Musculoskeletal System in the Context of Work ,
Daily Living , and Sport : A Systematic Review. J Occup Rehabil.
2005;15(2):253–72.
19. Bieniek S, Bethge M. The reliability of WorkWell Systems Functional Capacity
Evaluation : a systematic review. BMC Musculoskelet Disord. BMC
Musculoskeletal Disorders; 2014;15(1):1–13.
20. van der Meer S, Trippolini MA, van der Palen J, Verhoeven J, Reneman MF.
Which instruments can detect submaximal physical and functional capacity in
patients with chronic nonspecific back pain? A systematic review. Spine (Phila
Pa 1976). 2013;38(25):E1608–15.
72
21. Kuijer PP, Gouttebarge V, Brouwer S, Reneman MF. Are performance-based
measures predictive of work participation in patients with musculoskeletal
disorders ? A systematic review. Int Arch Occup Environ Health. 2012;85:109–
23.
22. Altman DG. Practical statistics for medical research. London; 1991.
23. Deyo RA, Diehr P, Patrick DL. Reproducibility and responsiveness of health
status measures statistics and strategies for evaluation. Control Clin Trials.
1991;12(4):142–58.
24. Innes E, Straker L. Reliability of work-related assessments. Work. 1999;13:107–
24.
25. Numally J. Psychometric theory. 2nd ed. New York: McGraw-Hill; 1978.
26. Weiner E, Stewart B. Assessing individuals. Boston: Little Brown; 1984.
27. Terwee CB, Bot SDM, de Boer MR, van der Windt DAWM, Knol DL, Dekker J,
et al. Quality criteria were proposed for measurement properties of health status
questionnaires. J Clin Epidemiol. 2007;60(1):34–42.
28. Boland A, Cherry GM, Dickson R, editors. Doing a systematic review. A
student‟s guide. Sage; 2013.
29. Cheng ASK, Cheng SWC. The Predictive Validity of Job-Specific Functional
Capacity Evaluation on the Employment Status of Patients With Nonspecific
Low Back Pain. J Occup Environ Med. 2010;52(7):719–24.
30. Cheng ASK, Cheng SWC. Use of Job-Specific Functional Capacity Evaluation to
Predict the Return to Work of Patients With a Distal Radius Fracture. Am J
Occup Ther. 2011;65(4):445–52.
31. Gouttebarge V, Wind H, Kuijer PP, Sluiter JK. Intra- and Interrater Reliability of
the Ergo-Kit Functional. Arch Phys Med Rehabil. 2005;86:2354–60.
32. Gouttebarge V, Wind H, Kuijer PP, Sluiter JK, Frings-Dresen MHW. Reliability
and Agreement of 5 Ergo-Kit Functional Capacity Evaluation Lifting Tests in
73
Subjects With Low Back Pain. Arch Phys Med Rehabil. 2006;87:1365–70.
33. Gouttebarge V, Wind H, Kuijer PP, Sluiter JK, Frings-Dresen MHW. Construct
Validity of Functional Capacity Evaluation Lifting Musculoskeletal Disorders.
Arch Phys Med Rehabil. 2009;90(2):302–8.
34. Rustenburg G, Kuijer PP, Frings-Dresen MHW. The Concurrent Validity of the
ERGOS Work Simulator and the Ergo-Kit With Respect to Maximum Lifting
Capacity. J Occup Rehabil. 2004;14(2):107–18.
35. Brubaker PN, Fearon FJ, Smith SM, McKibben RJ, Alday J, Andrews SS, et al.
Sensitivity and Specificity of the Blankenship FCE System‟s Indicators of
Submaximal Effort. J Orthop Sport Phys Ther. 2007;37(4):161–8.
36. Reneman MF, Brouwer S, Meinema A, Dijkstra PU, Geertzen JHB, Groothoff
JW. Test-Retest Reliability of the Isernhagen Work Systems Functional Capacity
Evaluation in Healthy Adults. J Occup Rehabil. 2004;14(4):295–305.
37. Reneman MF, Fokkens AS, Dijkstra PU, Geertzen JHB, Groothoff JW. Testing
Lifting Capacity: Validity of Determining Effort Level by Means of Observation.
Spine (Phila Pa 1976). 2005;30(2):40–6.
38. Trippolini MA, Dijkstra PU, Jansen B, Oesch P, Geertzen JHB, Reneman MF.
Reliability of Clinician Rated Physical Effort Determination During Functional
Capacity Evaluation in Patients with Chronic Musculoskeletal Pain. J Occup
Rehabil. 2014;24:361–9.
39. Gross DP, Battié MC. Does Functional Capacity Evaluation predict recovery in
workers‟ compensation claimants with upper extremity disorders ? Occup
Environ Med. 2006;63:404–11.
40. Trippolini MA, Dijkstra PU, Côté P, Scholz-Odermatt SM, Geertzen JHB,
Reneman MF. Can Functional Capacity Tests Predict Future Work Capacity in
Patients With Whiplash-Associated Disorders? Arch Phys Med Rehabil.
2014;95:2357–66.
41. Brassard B, Durand M-J, Loisel P, Lemaire J. Étude de fidélité test-retest de l‟
74
Évaluation des Capacités Physiques reliées au Travail. Can J Occup Ther.
2006;73(4):206–14.
42. Durand M, Loisel P, Mercier R, Stock SR, Lemaire J. The Interrater Reliability
of a Functional Capacity Evaluation : The Physical Work Performance
Evaluation. J Occup Rehabil. 2004;14(2):119–29.
43. Lechner DE, Page JJ, Sheffield G. Predictive validity of a Functional Capacity
Evaluation: The Physical Work Performance Evaluation. Work. 2008;31:21–5.
44. Durand M-J, Brassard B, Nha Hong Q, Loisel P. Responsiveness of the Physical
Work Performance Evaluation, a Functional Capacity Evaluation, in Patients
with Low Back Pain. J Occup Rehabil. 2008;18:58–67.
45. Branton EN, Arnold KM, Appelt SR, Hodges MM, Gross DP, Battié MC. A
Short-Form Functional Capacity Evaluation Predicts Time to Recovery but Not
Sustained Return-to-Work. J Occup Rehabil. 2010;20:387–93.
46. Meterko M, Marfeo EE, McDonough CM, Jette AM, Ni P, Bogusz K, et al. Work
Disability Functional Assessment Battery : Feasibility and Psychometric
Properties. Arch Phys Med Rehabil. Elsevier Ltd; 2015;96(6):1028–35.
47. James C, Mackenzie L, Capra M. Test – retest reliability of the manual handling
component of the WorkHab Functional Capacity Evaluation in healthy adults.
Disabil Rehabil. 2010;32(22):1863–9.
48. James C, Mackenzie L, Capra M. Inter- and intra-rater reliability of the manual
handling component of the WorkHab Functional Capacity Evaluation. Disabil
Rehabil. 2011;33(19-20):1797–804.
49. Gibson L, Dang M, Strong J, Khan A. Test-retest reliability of GAPP Functional
Capacity Evaluation in healthy adults. Can J Occup Ther. 2010;77:38–48.
50. Hart DL, Isernhagen SJ, Matheson LN. Guidelines for Functional Capacity
Evaluation of people with medical conditions. J Orthop Sports Phys Ther.
1993;18(6):682–6.
75
51. Gross DP, Battié MC, Asante AK. Evaluation of a Short-form Functional
Capacity Evaluation : Less may be Best. J Occup Rehabil. 2007;17:422–35.
76
APPENDICES
1. List of figures and tables
Figure 1: PRISMA Flow Diagram
Table 1: Used key words and their synonyms
Table 2: The methodological quality appraisal
Table 3: Interpretation of reliability and validity outcomes
Table 4: Studies sorted by topic (type of studied psychometric property)
Table 5: Studies sorted by FCE method
Table 6: Overview of the study characteristics
Table 7: Results of the methodological quality appraisal
Table 8: Methodological aspects of the reviewed studies, susceptible to bias
Table 9: Study methods and results
77
2. Toelating voor consultatie
“De auteur en de promotor geven de toelating deze masterproef voor consultatie
beschikbaar te stellen en delen ervan te kopiëren voor persoonlijk gebruik. Elk ander
gebruik valt onder de beperkingen van het auteursrecht, in het bijzonder met betrekking
tot de verplichting uitdrukkelijk de bron te vermelden bij het aanhalen van resultaten uit
deze masterproef.”
12/05/2016
Noortje Schalley Dominique Van de Velde ( Promotor)