The joys of consulting

4

Click here to load reader

Transcript of The joys of consulting

Page 1: The joys of consulting

33march2007

The joys of consulting

No two consulting problems are ever the same. This story, related by Chris Chatfield, shows how what appears to be a straightforward problem can become a tricky exercise when non-standard data include a few extreme values. And it shows how good statistics can turn a failure into a success.

ter being refrigerated overnight. Although several meas-urements were taken, we confi ne attention to measure-ments of the concentration of one particular hormone whose acronym is NT-proBNP. Th e results, in units of

Most consultancies start on the phone or by personal contact. Th is one started in church. I knew Mary—it is not her real name—had been writing up a disserta-tion for membership of her professional body, but that morning she looked rather upset. I enquired what the matter was. It transpired that her dissertation had been rejected, primarily for statistical reasons. Naturally, I of-fered to help.

The problem

Th e fi rst step for the statistical consultant is to fi nd out exactly what the problem is. Th e following is a descrip-tion of what I found.

Mary is a pathologist. Pathology departments in hospitals routinely analyse large numbers of blood sam-ples. Ideally, these specimens should be analysed straight away to avoid any deterioration in the content, but, for operational reasons, this is not always possible. Th ere-fore, a study was carried out to assess the eff ect of:

A, Keeping a specimen in a refrigerator overnight;B, Keeping a specimen frozen for 1 month.

Th e experiments and analysis of the results were the subjects of Mary’s dissertation.

For experiment A, nine samples were collected on a particular day at two local clinics, and each sample was divided in two. One half of each sample was analysed straight away; the other was analysed in the morning af- Cr

edit

: Ka

ren

Skin

ner

Page 2: The joys of consulting

34 march2007

micromoles per litre (μmol l–1), are shown in Table 1, exactly as they were given to me.

For experiment B, 19 samples were se-lected subjectively by the pathologist from the many samples collected over a 1-week period in order to get a good range of values from ‘small’ to ‘large’. Th ese samples were also di-vided so that one half could be analysed im-mediately, and the other half frozen, to be analysed 1 month later. Th e results are shown in Table 2, again exactly as they were given to me.

Th e pathologist gave the results to a stat-istician who carried out paired-sample t-tests and reported non-signifi cant results for both sets of data. Th e pathologist was slightly sur-prised by this outcome, as she was expecting the reverse result, but naturally trusted the statistician’s fi ndings. Was she right to do so?

Before attempting to answer this question, I sought more information. Understanding the background context is vital. Further ques-tioning revealed that readings under ~15–20 μmol l–1 (depending on age) were considered

“normal”, while high values could indicate health problems. Th e pathologist also said that a reduction of more than ~5% in any reading could be clinically important, especially when it resulted in a reading going into the “normal” range.

Initial data analysis

Having clarifi ed the problem, I began, as al-ways, by looking at the data and carrying out what I call an “initial data analysis 1”. Tables 1 and 2 are presented, as they came, in text fi les from the pathologist, and are horrid. Little can be seen until the decimal points are lined up. Th e reader will also notice that

the computer producing the data fi les has cut off any fi nal zeros. Th is is still a surprisingly common occurrence and can also be very misleading.

I began by tidying up the data, lining up the decimal points, and adding fi nal zeros, so that data within a given experiment are listed to the same number of decimal places (in the fi rst experiment, readings are given to one deci-mal place, whereas, in the second experiment, it appears that all readings have been given to four signifi cant fi gures, except where this gives more than two decimal places). I also calculat-ed numerical diff erences and percentage diff er-ences, as well as the averages for each column and the standard deviations for the two types of diff erences. My results are shown in Tables 3 and 4.

It is now evident that the readings have a (very) skewed distribution and that diff erences are also severely skewed. In contrast the per-centage diff erences look much more normally distributed, as one might expect. Note that the results in Table 1 were already given in ascend-ing order of magnitude of the “before” value (why?), whereas those in Table 2 were not. It may be worth reordering the data in Table 4

Table 1. Results for experiment A

Sample number NT-proBNP concentration of fresh sample

NT-proBNP concentration of refrigerated sample

101102103104105106107108109

7.48.911.112.321.839.643

46.6279.6

6.37.910.311.320.736.838.744.2254.8

Table 2. Results for experiment B

Sample number NT-proBNP concentration before freezing

NT-proBNP concentration after freezing

12345678910111213141516171819

7.1960.6420.56407614.6923.6515.3711.7136.48211.1413.724.518.496.2315.616.78141.7187.729.57

6.5560.8519.92395413.7423.6515.211.8836.7211.1422.123.53

1893.0115.36.47140.6185.329.17

The computer producing the data fi les has cut off any fi nal zeroes. This is still surprisingly common, and can be very misleading.

Page 3: The joys of consulting

35march2007

to see the shape of the distribution even more clearly. Note also that all the summary statis-tics have been given to two decimal places, even when the data were recorded to one decimal place.

In experiment A, all the diff erences had a negative sign, suggesting a signifi cant reduc-tion. However, in experiment B, the picture was more mixed, with 13 negative diff erences, two unchanged and four positive diff erences.

Further analysis

A little detective work showed that the statisti-cal tests previously conducted on the data were paired-sample tests on the numerical diff erenc-es, as is the usual textbook approach. For ex-periment A, the average diff erence was –4.37. However, the standard deviation of the diff er-ences is as high as 7.75 leading to a t-value of –1.69 on 8DF (degrees of freedom), for which p = 0.13. Th is suggests that the average diff er-ence is not signifi cantly diff erent from 0. For experiment B, the average diff erence is higher, at –6.55, but the standard deviation is also larger, at 28.05, leading to a smaller t-value of –1.02. Th is is nowhere near signifi cant. Th ese results agree with those computed by the pa-thologist’s statistician. Th us, the analysis has

been done correctly—but is it appropriate?Laying out the tables properly was not

just an exercise in neatness. It showed that the ordinary diff erences were clearly not normally distributed, which nullifi es one of the standard assumptions required for the t-test. Th e distri-butions were skewed and the outlying observa-tions—the diff erence of –24.8 in Table 3, and the whopping –122.00 in Table 4, two orders of magnitude greater than any of the oth-ers—had infl ated the standard deviations in each case. Th is meant that Mary’s statistician’s paired t-test had not been appropriate at all.

Th ere are at least three possible alterna-tives. One approach is to carry out a non-para-metric test on the numerical diff erences. Th e simplest such test is a sign test, and I suspect that is what the reader did visually when look-ing at Tables 3 and 4. As all the diff erences for experiment A have a negative sign, the sign test will certainly produce a signifi cant result, whereas, for experiment B, the result looks less clear cut. Th e reader will therefore wish to carry out a more powerful non-parametric test and the Wilcoxon test gives a signifi cant result for both experiments, A and B.

Table 3. Revised results for experiment A

Before After Difference Percentage difference

7.4 6.3 –1.1 –14.9

8.9 7.9 –1.0 –11.2

11.1 10.3 –0.8 –7.2

12.3 11.3 –1.0 –8.1

21.8 20.7 –1.1 –5.0

39.6 36.8 –2.8 –7.1

43.0 38.7 –4.3 –10.0

46.6 44.2 –2.4 –5.2

279.6 254.8 –24.8 –8.9

Average 52.26 47.89 –4.37 –8.62

Standard deviation 7.75 3.11

Table 4. Revised results for experiment B

Before After Difference Percentage difference

7.19 6.55 –0.64 –8.90

60.64 60.85 0.21 0.35

20.56 19.92 –0.64 –3.11

4076.00 3954.00 –122.00 –2.99

14.69 13.74 –0.95 –6.47

23.65 23.65 0.00 0.00

15.37 15.20 –0.17 –1.11

11.71 11.88 0.17 1.45

36.48 36.70 0.22 0.60

211.10 211.10 0.00 0.00

413.70 422.10 8.40 2.03

24.50 23.53 –0.97 –3.96

18.40 18.00 –0.40 –2.17

96.23 93.01 –3.22 –3.35

15.61 15.30 –0.31 –1.99

6.78 6.47 –0.31 –4.57

141.70 140.60 –1.10 –0.78

187.70 185.30 –2.40 –1.28

29.57 29.17 –0.40 –1.35

Average 284.82 278.27 –6.55 –1.98

Standard deviation 28.05 2.73

Laying out the tables properly was not just an exercise in neatness. it showed that the differences were not normally distributed.

Page 4: The joys of consulting

36 march2007

A second possibility is to Winsorise the data. Winsorisation is not widely used, but it can be useful. It deals with extreme obser-vations, not by throwing them away, but, in the simplest case, by replacing them with the next most extreme value. I replaced the largest diff erence in each sample with the next larg-est. Th is leads to a signifi cant result for both experiments when using a standard paired-sample t-test on the diff erences. For experi-ment A, it is noteworthy that, even though all diff erences have the same sign and the Win-sorised sample gives a highly signifi cant result, making the largest observation even more ex-treme (which might be expected to add to the evidence against the null hypothesis) actually reduces the level of signifi cance. I had not ex-pected this. I knew that signifi cance levels for normal tests were inaccurate when data were skewed, but I did not realise that making an observation more extreme could actually send the signifi cance level in the “wrong” direction.

A third possible way of handling our data is to analyse percentage diff erences, which have already been tabulated in Tables 3 and 4. Th is is implicitly suggested by the contextual information that the reduction thought to be

clinically important is expressed as a percent-age diff erence (5%) rather than as a numerical diff erence. Moreover, the percentage diff er-ences look much closer to following a normal distribution, which does allow us to carry out a paired-sample t-test on them. It seems reasonable that very high values are likely to decrease more than low values, and with such a wide range of individual values it seems sensible to look at percentage diff erences. It is, perhaps, rather surprising that text-books rarely suggest this possibility. Equally strangely, when later I set the same data as a problem for my fi nal-year undergraduate stu-dents, several of them thought that they were not allowed to analyse percentage diff erences.

Perhaps they thought it would be breaking some unwritten statistical edict; perhaps it was culture shock at being faced for the fi rst time with real messy data.

For experiment A, we fi nd the mean percentage reduction of 8.62, with s = 3.11, leads to t = –8.32, which is highly signifi cant. For experiment B, we fi nd the mean percent-age reduction of –1.98, with s = 2.73, leads to t = –3.16, which is also highly signifi cant. Th e Wilcoxon test also gives highly signifi cant results on the percentage diff erences in each case.

For experiment A, these results are in-tuitively obvious as all the diff erences have the same sign. For experiment B, the results are less obvious in view of the zero and positive diff erences. Moreover, in the latter case, the mean percentage diff erence is less than 2%, which is below the fi gure of 5% thought to be clinically important. Th us, it is arguable that there is no point in doing a signifi cance test for experiment B at all, given that the results are not clinically important so that rejection of the null hypothesis is of little interest.

Further questions

When carrying out a consultancy such as this, it is inevitable that further questions arise dur-ing the course of the investigation. For exam-ple:

• Did the same person carry out all the analyses?

• Is there any reason to think that the samples used for experiment A (on one day in two particular clinics) are not random?

• Exactly how were samples for experiment B selected?

• How high is the measurement error expected to be?

• Exactly what is the hormone involved, and what are the medical consequences if a person has a high value?

• How likely is it that samples would need to be kept (A) overnight in a refrigerator, or (B) frozen for a month?

In the absence of further information, it sounds as though the data for experiment A can be treated as if they were random, al-though, strictly speaking, they are not. Th e randomness of the samples used in experiment B seems more suspect, which is why it would be helpful to fi nd out more about the selection

process. However, although the fresh sample value was known when the samples were se-lected, the value after freezing was not, so it is not unreasonable to treat these observations as if they were random as well. Of course, as the average percentage reduction is less than 2%, it probably does not matter too much as there is really no need to carry out signifi cance tests at all.

Communicating the results

After completing my analysis, I arranged an-other meeting with Mary and went through the results. In summary, I said that her stat-istician had carried out inappropriate tests on skewed data using a parametric t-test that requires a normal assumption. Instead, a non-parametric approach, or a parametric analysis of Winsorised diff erences, yielded diff erent results. Even better, an analysis of percentage diff erences showed that the refrigerated sam-ples did give signifi cantly lower values (down 8.6%) than the “control” samples, and that the diff erences were clinically important. On the other hand, the frozen samples gave values that were less than 2% lower than the control samples. While this reduction is statistically signifi cant, it is not thought to be clinically important.

Mary rewrote her dissertation and re-submitted it. I am delighted to say that it was passed!

Reference1. Chatfi eld, C. (1995) Problem Solving,

2nd edn. Boca Raton: Chapman and Hall/CRC.

Chris Chatfi eld is Reader in Statistics at Bath Univer-sity, an Honorary fellow of the International Institute of Forecasters and the author of 5 textbooks. His in-terests include all forms of practical problem-solving as well as time-series forecasting (not to mention golf, now that he is working part-time!).

Students thought they were not allowed to analyse percentage differences. Would it break some unwritten statistical edict?

Making the largest observation even more

extreme actually reduces the level of signifi cance. I had

not expected this.