SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

21
SADC Course in Statistics Introduction to Non-Parametric Methods (Session 19)

Transcript of SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

Page 1: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

SADC Course in Statistics

Introduction to Non-Parametric Methods

(Session 19)

Page 2: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

2To put your footer here go to View > Header and Footer

Learning ObjectivesAt the end of this session, you will be able to

• Understand the general meaning of non-parametric methods and when they might be used

• Implement and interpret a simple non-parametric test, the sign test, and understand its advantages and limitations

• Appreciate some practical problems associated with non-parametric methods

Page 3: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

3To put your footer here go to View > Header and Footer

An illustrative example

A random sample of 12 small businesses were asked “What percentage of last year’s profit was reinvested?”.

Data: 5.1, 6.4, 7.1, 23.6, 4.7, 14.3,

5.9, 5.5, 11.6, 17.5, 8.2, 7.7

A government official claims the real “average” is 10%.

How can this claim be tested?

Page 4: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

4To put your footer here go to View > Header and Footer

Start by plotting

- A very skewed distribution

5 10 15 20 25reinvest

Boxplot of % Reinvested

Page 5: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

5To put your footer here go to View > Header and Footer

Addressing the question …

• A one-sample t-test is often employed in such cases, but the procedure assumes normally distributed data

• This is clearly NOT the case here, and hence the validity of the t-test procedure is questionable

Page 6: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

6To put your footer here go to View > Header and Footer

Recall the t-test is robust to departures from normality due to the Central Limit Theorem

We only need to worry if the sample size is quite small and/or the underlying distribution is very non-normal

Hence, we might be concerned here about applying a t-test in our example

Robustness of the t-test

Page 7: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

7To put your footer here go to View > Header and Footer

Two alternative approaches

• TransformationsAre the measurements approximately normally distributed on a different measurement scale, e.g. a logarithmic scale? If so, analyse the data on the transformed scale

• Non-Parametric methodsUtilise a technique that does not assume a normal distribution. Such methods are often collectively referred to as non-parametric methods …

Page 8: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

8To put your footer here go to View > Header and Footer

• Non-parametric methods (or tests) derive their name from the fact that no explicit distribution (e.g. normal, gamma, …) is associated with the data

• Occasionally the techniques are called distribution-free methods, but assumptions may be made, e.g. a symmetrical distribution. Hence, the name is potentially misleading

• To illustrate the above we shall now apply a simple sign test to the example

Non-Parametric methods

Page 9: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

9To put your footer here go to View > Header and Footer

Back to the example

• Let us make no assumption about the distribution of reinvestment percentages

• Have said this, the distribution is clearly very skewed. When attempting to summarise the “average” of such a distribution the median is a natural choice

– Sample median = 7.4%

• The median is a flexible summary and so hypotheses of interest are generally phrased in terms of a population median

Page 10: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

10To put your footer here go to View > Header and Footer

The sign test

Hypotheses:

H0: Population median, =10% vs.

H1: Population median, 10%

Assumptions: Data values are independent. No distributional assumption is necessary

Logic: If H0 is true, then we would expect half

of the observed values to fall below 10 and half above 10. How inconsistent is our data with this expectation?

Page 11: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

11To put your footer here go to View > Header and Footer

Applying the sign test

• List the data in ascending order:4.7, 5.1, …,8.2, 11.6, …, 23.6

• If a value is < 10 assign a negative sign;if a value is > 10 assign a positive sign

• Under H0, we have a random sample of n=12 binary outcomes (– or +):

– – – – – – – – + + + +• This gives 8 –ve and 4 +ve signs compared

to the expected 6 and 6 respectively

Page 12: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

12To put your footer here go to View > Header and Footer

Applying the sign test

• How unusual is this result under H0?

• A natural test statistic is literally the number of +ve signs [the choice –ve vs. +ve is arbitrary]

• A sufficiently small or large value is evidence to reject H0

• Under H0, R=number of +ve signs follows a binomial distribution with n=12 and p=0.5– This is a symmetric distribution

• A two-sided p-value is thenProb(R4)+Prob(R8) = 2Prob(R4)

Page 13: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

13To put your footer here go to View > Header and Footer

The p-value

• Using statistical software, e.g. Stata:

Two-sided test:

Ho: median of reinvest - 10 = 0 vs.

Ha: median of reinvest - 10 != 0

Pr(#positive >= 8 or #negative >= 8) =

min(1, 2*Binomial(n = 12, x >= 8, p = 0.5))= 0.3877

• P-value = 0.39

• This may be calculated by using the Excel BINOMDIST worksheet function

Page 14: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

14To put your footer here go to View > Header and Footer

Conclusions

• The p-value is very large. Hence, there is no evidence to reject H0

• The estimated median reinvestment, 7.4%, is not significantly different from 10%

• There is no evidence based on this survey against the government official’s claim

Page 15: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

15To put your footer here go to View > Header and Footer

Further notes• P-value calculation

– The p-value may be approximated using the normal approximation to the binomial distribution

– Compare Z with the tails of a N(0,1) distribution– n > 20 will usually give a reasonable

approximation

0 H ,

R n/2Under Z = N(0,1) approximately

n/2

Page 16: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

16To put your footer here go to View > Header and Footer

Further notes

• No signs– If any value equals the hypothesised median of

10 then it is ignored and the sample size is reduced accordingly

• One-sided tests– Although a two-sided example was discussed,

one-sided tests are also possible

Page 17: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

17To put your footer here go to View > Header and Footer

Pros and cons of the sign test

Advantages

• Simple and logical

• Widely applicable– Few assumptions

• Robust to outliers– Recorded values are not used, only signs

Page 18: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

18To put your footer here go to View > Header and Footer

Pros and cons of the sign test

Major Disadvantages

• Severe loss of information– Recorded values not used, only signs– Makes the sign test inefficient

• Confidence intervals (CIs)– A CI for the true median can be constructed,

but it is cumbersome– Software packages tend not to present a CI for

the median, instead concentrating on the p-value

Page 19: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

19To put your footer here go to View > Header and Footer

Concluding remarks

• Non-parametric methods generally concentrate on hypothesis testing, and hence the p-value

• The lack of confidence intervals is a major disadvantage

• We shall return to these issues in Session 20

Page 20: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

20To put your footer here go to View > Header and Footer

References

The two references below apply to bothSessions 19 and 20 and also to non-parametric methods in general.

• Conover, W.J. (1999) Practical Nonparametric Statistics. 3rd edn. Wiley, pp. 584.

• Sprent, P., (1993) Applied Nonparametric Statistical Methods, 2nd edn. Chapman and Hall, London.

Page 21: SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

21To put your footer here go to View > Header and Footer

Practical work follows …