Author's response to reviews Addressing the challenges of validity …10.1186... · Author's...

21
Author's response to reviews Title:Addressing the challenges of validity and reliability of mental health measures in a 27 year longitudinal cohort study - the Northern Swedish Cohort study Authors: Anne Hammarström ( [email protected]) Hugo Westerlund ( [email protected]) Kaisa Kirves ( [email protected]) Karina Nygren ( [email protected]) Pekka J Virtanen ( [email protected]) Bruno Hägglöf ( [email protected]) Version:3Date:16 September 2015 Author's response to reviews: see over

Transcript of Author's response to reviews Addressing the challenges of validity …10.1186... · Author's...

Page 1: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

Author's response to reviews

Title:Addressing the challenges of validity and reliability of mental healthmeasures in a 27 year longitudinal cohort study - the Northern Swedish Cohortstudy

Authors:

Anne Hammarström ([email protected])Hugo Westerlund ([email protected])Kaisa Kirves ([email protected])Karina Nygren ([email protected])Pekka J Virtanen ([email protected])Bruno Hägglöf ([email protected])

Version:3Date:16 September 2015

Author's response to reviews: see over

Page 2: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

1

Authors’ answers to reviewers’ comments

1st Reviewer's report Title: Addressing the challenges of validity and reliability of mental health measures in a 27 year longitudinal cohort study - the Northern Swedish Cohort study Version:2 Date:8 April 2015 Reviewer: Karen van Leeuwen Reviewer's report: With this paper the authors provided a valuable insight in the multivariate structure of a group of single mental health items used in a cohort study. This provides confidence that single items can be used to construct composite measures using the prespecified factor structure of the dimensions anxiousness, depressiveness, and functional somatic symptoms.

As such, the paper provides support for the internal consistency and structural validity of the mental health subscales, and shows that the multivariate structure is retained over time. However, the methods used and results do not provide evidence for the test-retest reliability of the composite measures, for the content validity of the measures, and for the ability of the scales to detect changes over time. There is no information about whether the items adequately reflect (changes in) the intended construct, it is very well possible that important signs or symptoms are missing in the composite measures, or that other unidimensional constructs are being measured with the composite measures. The authors should acknowledge this and could provide more information about the constructs that are intended to be measured to increase our confidence in the content validity of the scales. These precautions need be taken into account before concluding that the composite measures are valid and reliable.

Moreover, concluding from this study that it is possible to overcome inherent methodological challenges in using historical data in longitudinal research (see abstract) is a step too far, as the study only deals with some specific challenges.

ANSWER: Thank you for your valuable comments. In relation to this major comment we provide four answers below. First, in relation to your comment about test-retest reliability we have added the following text to our new subtitle “Strengths and limitations” (last in the Discussion):

Another limitation is that we have not been able to assess the test-retest reliability, which would have required repeated measurements with the same respondents at time points that are close enough to each other so that actual changes in the underlying phenomena are unlikely.

Page 3: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

2

Second, analysing the extent to which the composite measures are able to detect changes over time in the phenomena that the measures are intended to reflect is, unfortunately, not possible to do in the data that we have access to, since it would require some kind of external criterion of the real change (for instance repeated psychiatric examinations). However, we would argue that the close correspondence between the items in the composite measures and the DSM 5 criteria makes it reasonably plausible that changes would be detected. A test of this assumption, although not a proof, will be when the composite measures are used in longitudinal analyses with repeated measurements.

In order to acknowledge this, we have added the following sentence in the Strengths and limitations section: ‘

“Analysing responsiveness, the extent to which the composite measures are able to detect changes over time in the phenomena that the measures are intended to reflect is, unfortunately, not possible to do in the data that we have access to, since it would require some kind of external criterion of the real change (for instance repeated psychiatric examinations). However, we would argue that the correspondence between the items in the composite measures with current concepts of mental problem and symptoms make it reasonably plausible that changes would be detected. A test of this assumption, although not a proof, will be made when the composite measures are used in longitudinal analyses with repeat measurements.”

Third, in relation to your comment about content validity we have added the following text under a new subheading in the Method:

“Content validity

The question about the content validity of our mental health measures can be analysed in relation to the categorical diagnostic criteria of DSM 5 as well as in relation to a dimensional symptoms approach, based on self-reported questionnaires [19-21]. In the current discussions about diagnostic systems in psychiatry the focus tends to be on both categorical diagnosis and on symptom dimensions [22, 23].

For depressive symptoms our measure can be fairly well validated by DSM 5 since all six symptoms of our measures are within the nine DSM 5 criteria for major depression.

Our items represent rather broad aspects of anxiety symptoms. “Worried or Anxious” and “Anxiety or Panic” that are included in our questionnaire are a main criteria for most anxiety syndromes of DSM 5. “Restlessness” and “Concentration difficulties” are symptoms in General Anxiety Disorder. “Palpitation or stomach problems” are symptoms of both social anxiety disorder and panic disorder. We would argue that our items have a high face validity which is corroborated by the fact that similar items are included in validated measures of anxiety like the Hospitality, Anxiety and Depression Scale [24]. Overall, we believe that there are good reasons to regard our measure of anxiety symptoms to have acceptable content validity.

Page 4: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

3

FSS is a complex concept and there is an ongoing debate about its nature, diagnosis and impact [25]. As described above we used a panel in order to construct our FSS measure and thus, the face validity of our measure is high. The symptoms of our measure also correspond well with what most researchers agree upon [20, 21, 26]. An additional support for the predictive validity of our measure was found in a study of FSS among 16-years old pupils which showed that FSS can predict severe adult mental health disorders [27]. DSM 5 cannot be used as comparison as its main focus of the corresponding diagnosis (Somatic Symptom Disorder) is on all possible somatic symptoms which are distressing or disruptive of daily life.

In summary, the same or similar items can be found in different self-reported measures that assess depression, anxiety and FSS symptoms as well as in categorical diagnostic systems such as DSM. Also, the symptom criteria for depression and anxiety disorders are almost identical according to the DSM manual from mid adolescence up to adulthood. Therefore, we believe that the content validity of our measurements on depressive symptoms, anxiety symptoms and FSS is good.”

Fourth, in relation to your comment about the conclusion we have added the words” some specific” in the sentence:

Thus, it can be possible to overcome some specific inherent methodological challenges in using historical data in longitudinal research.

Reviewer: Next to these precautions, the limitations of the study are not discussed and the results are not placed in a wider context. I suggest major revisions, as the discussion and conclusion should be rewritten with the several study limitations in mind. In the current version the conclusions are not completely backed by the results.

ANSWER:

The conclusion has been rewritten in order to be backed by our findings:

“Conclusion

Testing the properties of the mental health measures used in older studies according to the standards of today is of great importance in longitudinal research. Our study demonstrates that composite measures of mental health problems can be constructed from single items which are more than 30 years old and that these measures seem to have the same factorial structure and internal consistency across a significant part of the life course. Thus, it can be possible to overcome some specific inherent methodological challenges in using historical data in longitudinal research.”

Page 5: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

4

In addition, a new subtitle ‘Strengths and limitations’ has been added at the end of the Discussion:

“ Strengths and limitations

One of the major strengths of the Northern Swedish Cohort study has been its extraordinarily high response rate. In the last follow-up, 94.3% of those still alive participated in the study. As a result, the cohort includes a group of people who otherwise are hard to reach [42], e.g. due to poor health, where mental health problems interferes with their ability or willingness to respond to questionnaires, threatening the representativeness of the findings.

A possible limitation is that, although CFA was developed to study the structure of a proposed measure, it is often criticized because of the fit indices and their vague cut-off values [43]. However, these problems are most pronounced in small datasets, and since our data consists of more than 900 respondents, we see CFA as the most appropriate method to investigate structure of the proposed mental health measures.

Analysing responsiveness, the extent to which the composite measures are able to detect changes over time in the phenomena that the measures are intended to reflect is, unfortunately, not possible to do in the data that we have access to, since it would require some kind of external criterion of the real change (for instance repeated psychiatric examinations). However, we would argue that the correspondence between the items in the composite measures with current concepts of mental problem and symptoms make it reasonably plausible that changes would be detected. A test of this assumption, although not a proof, will be made when the composite measures are used in longitudinal analyses with repeat measurements.

We would furthermore argue for that the content validity of the measures of depressive and anxiety symptoms as well as of FSS is high due to face validity and a relatively close correspondence between the included items and internationally used self-report scales and the DSM 5 criteria for depression and anxiety. About functional somatic symptoms, the symptoms included in our FSS scale are commonly found in measurements of FSS in children and adults. There is, however, no clear gold standard for FSS.

Although the content validity is arguably high, a clear limitation is the lack of a quantitative assessment for criterion validity. This will, however, be analysed in an ongoing study where the measures presented in this paper are validated in a clinical population of youths who are diagnosed according to DSM 5 system combined with self-reports on mental health problems by young people (YSR, SDQ).

Another limitation is that we have not been able to assess the test-retest reliability, which would have required repeated measurements with the same respondents at time points that are close enough to each other so that actual changes in the underlying phenomena are unlikely.

Although the data mainly come from one region in Sweden, the cohort has been shown to be representative of the country as a whole [16]. “

Page 6: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

5

Reviewer: Furthermore, there are some other issues which should be addressed to further improve the paper, as well as the quality of the writing. I will address specific issues below. Specific comments:

Major compulsory revisions Title/abstract

Reviewer 1. It is true that internal consistency, structural validity and factorial invariance over time are supported by the results. However, I am urging caution to use the overall conclusion that the composite measures are valid and reliable and that methodological challenges in using historical data in longitudinal research can be overcome.

ANSWER: The conclusion has been rewritten as follows:

“Testing the properties of the mental health measures used in older studies according to the standards of today is of great importance in longitudinal research. Our study demonstrates that composite measures of mental health problems can be constructed from single items which are more than 30 years old and that these measures seem to have the same factorial structure and internal consistency across a significant part of the life course. Thus, it can be possible to overcome some specific inherent methodological challenges in using historical data in longitudinal research.”

Reviewer 2The discussion should be rewritten:

The limitations of the study should be discussed

ANSWER: As shown above, a new section has been included at the end of the Discussion with the sub title Strengths and limitations

Reviewer 3. See my comments to the abstract

ANSWER:: The abstract has been changed accordingly.

Reviewer 4. The authors should acknowledge that the structural validity was assessed but that this does not provide information about content validity of the composite measures. Some discussion about deviations from definitions of constructs or more established instruments concerning the content of the items in each scale would be welcome.

ANSWER: In relation to your question about structural validity we refer to your comment no 22 below where you suggest that we use the COSMIN terminology to describe the measurement properties. We have followed your advice (see below). However, structural validity is not applicable, since it is meaningful only when the items on a scale are highly correlated and interchangeable since they are manifestations of the

Page 7: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

6

same underlying concept, which is not in accordance with the multi-dimensional nature of mental diagnoses.

In relation to your question about content validity we have written a new section about this topic in the Method (as shown above). We have also included the following discussion of the content validity of our composite measures under Strengths and limitations:

“Although the content validity is arguably high, a clear limitation is the lack of a quantitative assessment for criterion validity. This will, however, be analysed in an ongoing study where the measures presented in this paper are validated in a clinical population of youths who are diagnosed according to DSM 5 system combined with self-reports on mental health problems by young people (YSR, SDQ).”

Reviewer: Which items/symptoms are included in the scales that were used as inspiration but were not included in the scales composed in this study?

ANSWER: As a response to your question we have added the following text to the Method:

“When the study started at the beginning of the 1980s we found no validated measures of mental health, directed towards young people themselves. Instead, we were inspired by the single item questions about mental health symptoms used by a Norwegian child psychiatrist in his studies of 16 year old pupils [5].”

Reviewer: 5. What do you mean with ‘The focus of the present study is valid for most longitudinal studies’?

ANSWER: The sentence has been changed: The focus of the present study is relevant for most longitudinal studies.

6. The paragraph starting with ‘since our sample’ should be reformulated. It is unclear what the authors mean.

ANSWER: We agree and have deleted the sentence.

Reviewer: 7. Last paragraph of discussion: A threat to validity of what? Do you mean external validity? The use of the word ‘validity’ is confusing as it seems that something different is meant than in the rest of the paper. Where do ‘you’ and ‘those’ refer to? ‘Worse’ than what? ANSWER:: The sentence has been changed: “As a result, the cohort includes a group of people who otherwise are hard to reach [42], e.g. due to poor health, where mental health problems interferes with their ability or willingness to respond to questionnaires, threatening the representativeness of the findings.”

Page 8: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

7

8. The results should be placed in a wider context. Are there other cohort studies in which mental health scales are composed of single items? Are the composite measures similar? How can other researchers (that do not use data from the Northern Swedish Cohort) benefit from this study?

ANSWER:: The following paragraph has been added to the Result Discussion:

“Placing our findings in a wider context, our analysis provides an innovative approach and could be an inspiration for both old and newer cohorts. Many of the other old public health oriented cohort studies from the early 1980ies included single items about mental health symptoms, rather than clinical investigations or validated measures, at least in their first wave(s). This is the case for the Isle of Wight study [35], the 1958 British birth cohort [36], The Nord-Trøndelag Health Study (the so called HUNT-study) [37], the Tampere cohort study of school leavers [38], and the US Wisconsin Longitudinal Study [39]. However, the consistency between data collections is far lower for several of these cohorts, which means that longitudinal analyses of composite measures of mental health would be more difficult to perform. In the National Longitudinal Study of Youths from US [40, 41] factor structure of anxiety and depressive symptoms was analysed by CFA longitudinally as in our study but in a younger population of children from 4 to 14 years of age. Overall, we argue that our work could be useful for several of the existing old cohort studies. Also, our paper is an inspiration for newer cohorts to keep their initial questions over time.”

In addition, we have clarified the main implications as well as recommendations to other researchers in the conclusions.

9. There are no recommendations for further research work ANSWER: We have added the following sentence in the Conclusions:

Our recommendations to old cohorts are to stick to their original questions of mental health symptoms and to test their validity as composite measures.

10. What are the implications of the results? ANSWER: We have clarified the implications in the Conclusion:

“The main implication of our study is that composite measures of mental health problems can be constructed from single item which are more than 30 year old and that these measures seem to have the same factorial structure and internal consistency across a significant part of the life course. “

The conclusions should be rewritten:

ANSWER: The conclusion has been rewritten as follows:

“CONCLUSIONS

Page 9: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

8

“Testing the properties of the mental health measures used in older studies according to the standards of today is of great importance in longitudinal research. The main implication of our study is that composite measures of mental health problems can be constructed from single item which are more than 30 year old and that these measures seem to have the same factorial structure and internal consistency across a significant part of the life course. Thus, it can be possible to overcome some specific inherent methodological challenges in using historical data in longitudinal research.

Our recommendations to old cohorts are to stick to their original questions of mental health symptoms and to test their validity as composite measures. “

11. Although the study shows that composite measures can be constructed from single items, it does not show that the development of mental health can be traced. To trace the development of mental health, there should be evidence for the content validity and responsiveness of the scales. ANSWER::As stated above we have written a new section about content validity in the Method. In relation to your question about development of mental health we have added the following sentences under Strengths and limitations:

“Analysing responsiveness, the extent to which the composite measures are able to detect changes over time in the phenomena that the measures are intended to reflect is, unfortunately, not possible to do in the data that we have access to, since it would require some kind of external criterion of the real change (for instance repeated psychiatric examinations). However, we would argue that the correspondence between the items in the composite measures with current concepts of mental problem and symptoms make it reasonably plausible that changes would be detected. A test of this assumption, although not a proof, will be made when the composite measures are used in longitudinal analyses with repeat measurements.”

12. The results only apply to the Northern Swedish Cohort, or to other studies in which the same items are used.

ANSWER: We have answered this question by adding the new paragraph about the wider context (as shown in our answer to your comment number 8 above) as well as by adding more information about the representativity of the cohort:

“The cohort has been shown to be representative for Sweden as a whole in relation to demographics, socioeconomic status and health complaints [15] as well as to Scandinavian young people in relation to self-reported mental health symptoms [16]. “

13. This study shows that some of methodological challenges can be overcome

Quality of the writing

14. This article would benefit from a close editing. I found it difficult to follow some

Page 10: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

9

of the author’s argument due to stylistic and grammatical errors ANSWER: We have performed a close editing of the paper including a careful correction of the language. Minor essential revisions Introduction

15. The authors provide little information about the construct ‘mental health’ and the several subscales that they intend to measure. The introduction could benefit from describing definitions for these constructs.

ANSWER: As a response we have added the following sentences in the Introduction:

“There is a constant development of definitions of mental health problems as well as of the demands on the properties of mental health measures. Also, mental health problems are by period and age defined and described in different terms and taxonomies. In addition, adolescents may have different understanding and expressions of the same mental health problems compared with adults of different ages [1, 2].

---

Internalised problems represent depression, anxiety, and functional somatic symptoms (FSS) whereas externalised problems describe different symptoms of out-acting behaviour as antisocial, delinquent and aggressive behaviour [8, 10]. These self-report measures can predict depression and anxiety disorders [11, 12]. There is also a concept of positive mental health which is seen as individual or contextual characteristics that could promote good mental health. “

16. Please describe what (extended) internalized and externalized problems are

ANSWER: We have added the following sentence in the Introduction:

Internalised problems represent depression, anxiety, and functional somatic symptoms (FSS) whereas externalised problems describe different symptoms of out-acting behaviour as antisocial, delinquent and aggressive behaviour [8, 10]. These self-report measures can predict depression and anxiety disorders [11, 12]. There is also the concept of positive mental health that is seen as individual or context characteristics that could promote good mental health. “

Methods

17. Could you provide more information about the Northern Swedish Cohort? What was the aim of the cohort? Is the municipality representative for the rest of the country? How and where are the questionnaires administered? Was informed consent obtained?

ANSWER: We have provided the following information in the Method:

“The cohort

The initial aim of the Northern Swedish Cohort was to analyse the health consequences of

Page 11: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

10

youth unemployment. Thus, the questionnaires have from start contained a large number of questions about both somatic and mental health symptoms. The cohort has been shown to be representative for Sweden as a whole in relation to demographics, socioeconomic status and health complaints [15] as well as to Scandinavian young people in relation to self-reported mental health symptoms [17]. “

The following text has been added to the Data collection:

The questionnaires were collected during school hours at age 16 and at school class reunions at the follow-ups. The questionnaire was posted to those who could not participate in these reunions.

18. The percentage of missing responses is not described, and there is no description of how missing items were handled. Furthermore, the authors report the effective sample size (line 119), but did not describe why data from participants not included in this effective sample size were not used.

ANSWER: The sentence about effective sample size has been replaced by the following text under “The Population”: “In the final questionnaire analyses of this paper, the sample size varied between 914–934 cases due to missing values. The missing data was handled with maximum likelihood estimation provided by Mplus. Of the 934 participants in 2008, 44.1% were women and 34.9% worked as a blue-collar, 13.6% as a lower white-collar, and 51.6% as an upper white-collar worker. Moreover, 57.9% rated their general health as good, 4.5% as bad, and 28.1% evaluated themselves to be in between good and bad health. “

19. Why were scale values computed as the mean of item values?

ANSWER: In order to obtain scales with the same range in order to facilitate interpretation of the values.

20. Confirmatory factor analysis requires sufficient variation in the data. However, there is no report of response distributions (See discretionary revisions for a suggestion to rearrange this table)

ANSWER: The response distribution of all variables in Table 1 has been added from age 16. Due to the categorical nature of the items and huge amount of them (altogether 108), we are not able to present all the response distributions in the manuscript. Nevertheless, we have checked that there is enough variation in the data (i.e. answers in each response category) and furthermore the estimation method, i.e. WLSMV, takes into account the categorical nature of the variables and non-even distributions.

Results

21. What were the characteristics of the final sample in terms of age, sex distribution, socio-economic status and disease characteristics (including mental health)?

Page 12: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

11

ANSWER:: The following text has been added in the Method: “In the final questionnaire analyses of this paper, the sample size varied between 914–934 cases due to missing values. The missing data was handled with maximum likelihood estimation provided by Mplus. Of the 934 participants in 2008, 44.1% were women and 34.9% worked as a blue-collar, 13.6% as a lower white-collar, and 51.6% as an upper white-collar worker. Moreover, 57.9% rated their general health as good, 4.5% as bad, and 28.1% evaluated themselves to be in between good and bad health. “

Discretionary revisions Title/abstract

22. I would suggest using the COSMIN terminology to describe the measurement properties that are assessed in this paper (i.e. internal consistency and structural validity). COSMIN stands for COnsensus#based Standards for the selection of health Measurement Instruments, and the group performed a Delphi study in which international consensus was reached on terminology, definitions, and a taxonomy of the relationships of measurement properties of health-related measurement instruments.

ANSWER: We have now gone through the paper to check that the COSMIN terminology is consistently used according to the COSMIN definitions (allowing for discrepancies such as replacing ‘patients’ with ‘participants’). However, due to limitations in our material, we are not able to cover all aspects of measurement properties in the COSMIN manual.

The COSMIN terms that seem to be applicable to our study are Content validity (the degree to which an instrument measures the construct(s) it purports to measure) and Internal consistency. Structural validity is not applicable, since it is meaningful only when the items on a scale are highly correlated and interchangeable since they are manifestations of the same underlying concept, which is not in accordance with the multi-dimensional nature of mental diagnoses. We have therefore retained “factorial invariance” which is not covered by COSMIN but should be a part of construct validity (just as cross-cultural validity is).

Discussion

23. Would it be possible to compare the proportion of participants with mental health problems (or with symptoms of anxiety, depression etc.) or mean number of symptoms with prevalence numbers known from other studies? This would improve the confidence in the validity of the composite scales

ANSWER: We have no cut –off points in our composite measures and therefore such a comparison is difficult. However, we have compared single items and added the following information:

“The cohort has been shown to be representative for Sweden as a whole in relation to demographics, socioeconomic status and health complaints [15] as well as to Scandinavian young people in relation to self-reported mental health symptoms [17]. “

Page 13: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

12

Tables

24. Table 1 is a bit confusing, it is not very clear which items are included in which composite scales and the scoring is more clearly described in the text. Furthermore, what does x mean?

ANSWER: We have now re-arranged Table 1 to make it clearer according to the suggestions of the reviewer. It may still be fairly complex, but hopefully it is easier to understand.

25. In addition, the response distributions could be included in Table 1

ANSWER The response distribution from age 16 has been added to Table 1.

26. To improve the comprehensibility of this table, it could be rearranged in the following columns: item – response categories – response distribution – inclusion in composite measure anx/dep/fss/ghq6-n/ghq6-p

ANSWER: Thank you; we have followed your suggestion, although we decided to merge response categories and response distribution into one column in order to avoid repetition of the response categories.

27. Table 2-4: it would be informative to show the range of factor loadings for each measurement model

ANSWER: The range of factor loadings for each measurement model has been added to tables 2-4. Quality of writing:

Written below are some examples that indicate insufficient quality of writing. Lines

86-88: clumsy sentence

ANSWER: The sentence has been deleted.

95: replace cluster with clusters ANSWER: Done.

102: What do you mean with ‘the acceptability of the scientific rigor of constructs over the life course’? Scientific rigor is very broad, could you be more specific?

ANSWER: We agree that this was not well formulated, and we have now re-written the whole sentence to read: “A question that remains to be analysed is if measures of more modern constructs of mental health symptoms can be derived from old single items as well as if the properties of such measures are acceptable over the life course.”

105: replace items with item

ANSWER: Done.

104-109: long and confusing sentence.

Page 14: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

13

ANSWER: The sentence has been reformulated as follows: The aim of this paper was to construct composite measures of mental health problems from single item questions of mental health problems from the early 1980s which conform to contemporary measurement standards with items largely parallel to the criteria in the DSM diagnostic system [13] and constructs from internationally validated self-reported questionnaires [14, 15]. The aim was further to evaluate the content validity, internal consistency and factorial invariance of these composite measures from adolescence to middle age using the Northern Swedish Cohort.

108: I don’t think ‘over the life course’ is the correct phrase:

ANSWER: The phrase has been changed to: “from adolescence to midlife”.

188: what are ‘measurement models’? Could you be more specific?

ANSWER: The following explanation has been added to our paper:

The measurement model is a model that examines relationship between the latent factors and items related to them.

188: use ‘tested by’ instead of ‘tested with’ (also in rest of the paper):

ANSWER: The change has been made throughout the paper.

188: replace factors with factor

ANSWER: Done.

233/234: clumsy sentence 239: use ‘disappear’ instead of ‘are disappeared’

ANSWER: Done.

241: replace ‘were from acceptable to good’ with ‘ranged from acceptable to good’.

ANSWER: Done.

293 ‘as well as’ and ‘also’ is a bit excessive

ANSWER: ‘Also’ has been deleted.

316: replace ‘strength’ with ‘strengths’

ANSWER:: Done. Level of interest:An article of limited interest Quality of written English: Needs some language corrections before being published ANSWER: The language has been checked again and corrected. Statistical review:No, the manuscript does not need to be seen by a statistician. Declaration of competing interests: I declare that I have no competing interests

2nd Reviewer's report Title: Addressing the challenges of validity and reliability of mental health measures in

Page 15: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

14

a 27 year longitudinal cohort study - the Northern Swedish Cohort study Version:2Date:9 July 2015 Reviewer:Michael King Reviewer's report: This is an interesting analysis of data from a longitudinal cohort in Norway, which has been followed for several decades using mainly the same questions. The authors’ aim was to score the various questions on anxiety, depression and psychosomatic concern into groupings that would align with current DSM diagnostic categories or mimic modern psychological scales. Their interest is to do this not solely for this study but also to show that older style questions can be used to tap modern concepts of psychological distress. It is elegant and well written, the follow-up rate is impressive and they make a good point about the use of the GHQ scoring. However there is need for clarification on a number of points: Major compulsory 1. Although the authors have confirmed a stable factor structure across time, does their analysis violate the assumption of independence? In other words, should they have taken account of repeated measures in the same participants? ANSWER:: Thank you for your valuable comments. The issue of repeated measures has already been taken into account by allowing the corresponding measurement errors of original items to covary across time as is now specified in the Method. “In both models, corresponding measurement errors of original items were allowed to covary across time.” Minor essential 1. The terms validity and reliability are used somewhat loosely. They have shown that the questions group in patterns rather like modern classifications – whether this means they are truly valid can never be certain. It might be more accurate to say it is a form of concurrent or even content validity. Instead of reliability, it is essentially internal consistency that they have demonstrated. ANSWER:: We have now, in line with Reviewer 1, made sure that the terminology conforms to the COSMIN criteria. 2. Could table 1 be improved to make it clearer how parts of questions (anxiousness and depressiveness) were combined and scored? Those terms might be better expressed as ‘anxiety symptoms’ and ‘depression symptoms’ as their words are very clumsy in English. ANSWER Table 1 has been changed according to your suggestion (as well as according to comments from Reviewer 1). Level of interest:An article of importance in its field

Page 16: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

15

Quality of written English: Needs some language corrections before being published. ANSWER: The language has been corrected. Statistical review: Yes, and I have assessed the statistics in my report. Declaration of competing interests: I declare that I have no competing interests

Page 17: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

16

3rd Reviewer's report Title: Addressing the challenges of validity and reliability of mental health measures in a 27 year longitudinal cohort study - the Northern Swedish Cohort study Version:2Date:3 July 2015 Reviewer:Sarah White Reviewer's report: This paper sets out to explore whether composite measures of mental health problems can be constructed and shown to be valid and reliable over time. Using the Northern Swedish Cohort they show that composite measures of anxiousness, depressiveness, functional somatic symptoms, GHQ and positive health have adequate psychometric properties which hold over time and test explicitly that the factor structure is invariant over time. Competing models of the composite measures are compared. This is an interesting and important methodological issue, the study of which is warranted given the very high retention rate of this particular cohort ensuring it as an invaluable resource for studying the change in mental health problems over the life course. As the same single item questions have been used at each data collection the opportunity arises to explore the factor structure in this manner. The methods and statistical analysis are appropriate and the results are largely written up in a satisfactory manner. The authors have used complex statistical methods and have made a good attempt at making them accessible to a wide audience. I have no doubts in the veracity of the data. However there are areas where greater clarity is required to improve the readability of the paper. I describe these below. Major Compulsory Revisions

1. General - It is necessary to define the terminology explicitly with respect to the following terms; item, measure, scale, composite, single-item, factor, and then be consistent in their use throughout. This is particularly when the combined scales analysis is reported. It may be helpful to edit Table 1 so to indicate what factor structures are being tested and this would be an opportunity to indicate the relationship between item, measure, scale, etc.

ANSWER: Thank you for valuable comments. We have defined the terminology in the Method. Also, we do not use ‘scale’ any longer:

“By questionnaire item we denote an individual question that the respondents have answered in the questionnaire by a single response. By measure (or composite measure) we mean a set of items that are thought to represent the same latent concept (e.g. depressive symptoms). A factor denotes a statistical variable which summarises variance shared between a number of observed variables, e.g. responses to questionnaire items, potentially corresponding to an underlying, unobserved latent variable. The extent to which the observed variance of the individual items in a theoretically constructed

Page 18: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

17

(composite) measure can be described by such a factor is an indication of the internal consistency of the measure.”

Methods

2. The first stage of a confirmatory factor analysis (CFA) is to specify the models to be tested. These models need to be supported by theory and/or previous research. More detail on the approach taken to define the models is needed.

ANSWER: As a response to this comment, as well as to a comment by Reviewer 1, we have added a new paragraph about Content validity in the Methods in which we also support the models by previous research:

“Content validity

The question about the content validity of our mental health measures can be analysed in relation to the categorical diagnostic criteria of DSM 5 as well as in relation to a dimensional symptoms approach, based on self-reported questionnaires [7, 18, 19]. In the current discussions about diagnostic systems in psychiatry the focus tends to be on both categorical diagnosis and on symptom dimensions [20, 21].

For depressive symptoms our measure can be fairly well validated by DSM 5 since all six symptoms of our measures are within the nine DSM 5 criteria for major depression.

Our items represent rather broad aspects of anxiety symptoms. “Worried or Anxious” and “Anxiety or Panic” that are included in our questionnaire are a main criteria for most anxiety syndromes of DSM 5. “Restlessness” and “Concentration difficulties” are symptoms in General Anxiety Disorder. “Palpitation or stomach problems” are symptoms of both social anxiety disorder and panic disorder. We would argue that our items have a high face validity which is corroborated by the fact that similar items are included in validated measures of anxiety like the Hospitality, Anxiety and Depression Scale [22]. Overall, we believe that there are good reasons to regard our measure of anxiety symptoms to have acceptable content validity.

FSS is a complex concept and there is an ongoing debate about its nature, diagnosis and impact [23]. As described above we used a panel in order to construct our FSS measure and thus, the face validity of our measure is high. The symptoms of our measure also correspond well with what most researchers agree upon [19, 24, 25]. An additional support for the predictive validity of our measure was found in a study of FSS among 16-years old pupils which showed that FSS can predict severe adult mental health disorders [26]. DSM 5 cannot be used as comparison as its main focus of the corresponding diagnosis (Somatic Symptom Disorder) is on all possible somatic symptoms which are distressing or disruptive of daily life.

In summary, the same or similar items can be found in different self-reported measures that assess depression, anxiety and FSS symptoms as well as in categorical diagnostic systems such as DSM. Also, the symptom criteria for depression and anxiety disorders are almost identical according to the DSM manual from mid adolescence up to adulthood. Therefore, we believe that the content validity of our measurements on

Page 19: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

18

depressive symptoms, anxiety symptoms and FSS is good.

3. Line 119 - Explain ‘effective sample size’. Is this because there is missing data in the 1010, leaving 929 complete cases for analysis? ANSWER: The sentence has been rewritten as follows:

In the final analyses of this paper, the sample size varied between 914–934 cases due to missing values. The missing data was handled with maximum likelihood estimation provided by Mplus 4. Expand on ‘factorial invariance’. ANSWER: The following text has been added to the Method:

“Factorial invariance was tested at the level of factor loadings in order to verify that the same manifest variables are measuring the same latent attributes (e.g., anxiousness) in the same way in each year.”

5. Line 142 to 154 – to make this clearer it would be helpful to include an example respondent, ie, someone who checked three symptoms, with a frequency of Often would score (0+0+2+2+2)/5= 1.2. I assume this is correct.

ANSWER: Yes, this is correct and we have now provided such an example: “For example: Someone who had first indicated that they had experienced restlessness and palpitation and then answered that (s)he had had such symptoms often, thus received the total score of (1*2+0*2+0*2+1*2+0*2)/5=0.8 for anxiety symptoms, the theoretical range being 0-2.”

6. Line 213 – this paragraph needs a final sentence indicating what criteria are used to judge which is the most appropriate factor structure for a combined scale.

ANSWER: The following sentence has been added:

”The significantly lowest q-square was chosen.” Results

7. Paragraph beginning on line 232 – as well as the fit indices described in the data analysis section the issue of significant factor loadings above 0.6 is considered here. Include in the data analysis section an additional criteria to judge the fit of the measurement model, ie, regarding factor loadings.

ANSWER: The information which you ask for has been included in the following sentence: “The fit of the measurement models were evaluated using χ

2, the Comparative Fit Index (CFI), the Root Mean Square Error of Approximation (RMSEA) and its 90% confidence interval and factor loadings. A good fit is indicated by a non-significant χ

2, CFI ≥ .95, RMSEA ≤ .06 and loadings ≥ .40 [28]. “

Page 20: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

19

Discussion

8. Line 298-302 – I don’t understand this point, it needs to be elaborated upon.

ANSWER: The sentences have been rewritten as follows:

“GHQ differs from the rest of the studied composite measures in that it is based on an established measure [17]. Also its validity in detecting ‘cases’ of ‘non-psychotic psychiatric disease’ has already been established [33]. Our analysis shows that there are problems in the factor structure of GHQ when used as a simple score, but they disappear when modelled over time. In the other words, at cross-sectional setting it is preferable to use GHQ as a dichotomous screening instrument, while in longitudinal settings it seems to be possible to use it as a measure.”

9. Some of the English is not very fluent. Lines 304-305 should be rewritten to make clearer.

ANSWER: The text has been rewritten:

“There are problems in longitudinal cohort studies as informants grow older and develop, as culture and society differs by time and as the same items might have different meanings over time. “

10. I would have liked to have seen some discussion of whether another similar study had been done elsewhere using the same methods, but maybe different measures and/or different cohort. If what is being described here is an innovative approach this should be highlighted in the discussion.

ANSWER: As a response to your comment and to Reviewer 1 we have added a discussion about this topic:

“Placing our findings in a wider context, our analysis provides an innovative approach and could be an inspiration for both old and newer cohorts. Many of the other old public health oriented cohort studies from the early 1980ies included single items about mental health symptoms, rather than clinical investigations or validated measures, at least in their first wave(s). This is the case for the Isle of Wight study [35], the 1958 British birth cohort [36], The Nord-Trøndelag Health Study (the so called HUNT-study) [37], the Tampere cohort study of school leavers [38], and the US Wisconsin Longitudinal Study [39]. However, the consistency between data collections is far lower for several of these cohorts, which means that longitudinal analyses of composite measures of mental health would be more difficult to perform. In the National Longitudinal Study of Youths from US [40, 41] factor structure of anxiety and depressive symptoms was analysed by CFA longitudinally as in our study but in a younger population of children from 4 to 14 years of age. Overall, we argue that our work could be useful for several of the existing old cohort studies. Also, our paper is an inspiration for newer cohorts to keep their initial questions over time.”

Page 21: Author's response to reviews Addressing the challenges of validity …10.1186... · Author's response to reviews Title:Addressing the challenges of validity and reliability of mental

20

11. There are known limitations to using confirmatory factor analysis. It would be useful lto include a short critique of the pros and cons of CFA and how this may be relevant in the paper.

ANSWER: As a response, the following text has been added in discussion of the limitations:

“A possible limitation is that, although CFA was developed to study the structure of a proposed measure, it is often criticized because of the fit indices and their vague cut-off values [42]. However, these problems are most pronounced in small datasets, and since our data consists of more than 900 respondents, we see CFA as the most appropriate method to investigate structure of the proposed mental health measures.”

Minor Essential Revisions

12. In Table 2, 3 and 4 rather than the age value in Age column I would use T1, T2 , etc as this is how the results are described, with respect to timepoints.

ANSWER: These changes have been performed in the Tables.

13. FSS, GHQ – abbreviations should be given at their first use of the full phrase and then abbreviations only after that.

ANSWER: These corrections have been made.

14. Line 316 – strength should be strengths ANSWER:: Has been corrected. Level of interest:An article of importance in its field Quality of written English: Needs some language corrections before being published ANSWER: The language has again been corrected. Statistical review:Yes, and I have assessed the statistics in my report. Declaration of competing interests: I declare that I have no competing interests