Nonparametric Statistical Methods. Definition When the data is generated from process (model) that...

109
Nonparametric Statistical Methods

Transcript of Nonparametric Statistical Methods. Definition When the data is generated from process (model) that...

  • Slide 1

Nonparametric Statistical Methods Slide 2 Definition When the data is generated from process (model) that is known except for finite number of unknown parameters the model is called a parametric model. Otherwise, the model is called a non- parametric model Statistical techniques that assume a non- parametric model are called non-parametric. Slide 3 For example If you assume that your data has come from a normal distribution with mean and standard deviation (both unknown) then the data is generated from process (model) that is known except for two of parameters.( and ) The model is called a parametric model. Models that do not assume normality (or some other distribution with a finite no. of paramters) are non-parametric Slide 4 We will consider two nonparametric tests 1.The sign test 2.Wilcoxons signed rank test These are tests for the central location of a population. They are alternatives to the z-test and the t-test for the mean of a normal population Slide 5 Nonparametric Statistical Methods Slide 6 Single sample nonparametric tests for central location 1.The sign test 2.Wilcoxons signed rank test These are tests for the central location of a population. They are alternatives to the z-test and the t-test for the mean of a normal population Slide 7 The probability of a type I error may be different than the desired value (0.05 or 0.01) Both the z-test and the t-test assumes the data is coming from a normal population If the data is not coming from a normal population, properties of the z-test and the t- test that require this assumption will no longer be true. Slide 8 1.The sign test 2.Wilcoxons signed tank test These tests do not assume the data is coming from a normal population Single sample non parametric tests If the data is not coming from a normal population we should then use one of the two nonparametric tests Slide 9 The sign test A nonparametric test for the central location of a distribution Slide 10 We want to test: H 0 : median = 0 H A : median 0 against (or against a one-sided alternative) Slide 11 The Sign test: S = the number of observations that exceed 0 Comment: If H 0 : median = 0 is true we would expect 50% of the observations to be above 0, and 50% of the observations to be below 0, 1.The test statistic: Slide 12 50% median = 0 If H 0 is true then S will have a binomial distribution with p = 0.50, n = sample size. Slide 13 median If H 0 is not true then S will still have a binomial distribution. However p will not be equal to 0.50. 00 p 0 > median p < 0.50 Slide 14 median 00 p 0 < median p > 0.50 p = the probability that an observation is greater than 0. Slide 15 n = 10 Summarizing: If H 0 is true then S will have a binomial distribution with p = 0.50, n = sample size. Slide 16 n = 10 The critical and acceptance region: Choose the critical region so that is close to 0.05 or 0.01. e. g. If critical region is {0,1,9,10} then =.0010 +.0098 +.0098 +.0010 =.0216 Slide 17 n = 10 e. g. If critical region is {0,1,2,8,9,10} then =.0010 +.0098 +.0439+.0439+.0098 +.0010 =.1094 Slide 18 Example Suppose that we are interested in determining if a new drug is effective in reducing cholesterol. Hence we administer the drug to n = 10 patients with high cholesterol and measure the reduction. Slide 19 The data Slide 20 Suppose we want to test H 0 : the drug is not effective median reduction 0 against H A : the drug is effective median reduction > 0 The Sign test S = the no. of positive obs Slide 21 The Sign test The test statistic S = the no. of positive obs = 8 We will use the p-value approach p-value = P[S 8] = 0.0439 + 0.0098 + 0.0010 = 0.0547 Since p-value > 0.05 we cannot reject H 0 Slide 22 Summarizing: To carry out Sign Test We 1.Compute S = The # of observations greater than 0 2.Let s observed = the observed value of S. 3.Compute the p-value = P[S s observed ] (2 P[S s observed ] for a two-tailed test). Use the table for the binomial distn (p = , n = sample size) 4.Conclude H A (Reject H 0 ) if p-value is less than 0.05 (or 0.01). Slide 23 Sign Test for Large Samples Slide 24 If n is large we can use the Normal approximation to the Binomial. Namely S has a Binomial distribution with p = and n = sample size. Hence for large n, S has approximately a Normal distribution with mean and standard deviation Slide 25 Hence for large n,use as the test statistic (in place of S) Choose the critical region for z from the Standard Normal distribution. i.e. Reject H 0 if z z /2 two tailed ( a one tailed test can also be set up. Slide 26 Nonparametric Confidence Intervals Slide 27 Now arrange the data x 1, x 2, x 3, x n in increasing order Assume that the data, x 1, x 2, x 3, x n is a sample from an unknown distribution. Hence x (1) < x (2) < x (3) < < x (n) x (1) = the smallest observation x (2) = the 2 nd smallest observation x (n) = the largest observation Slide 28 Consider the k th smallest observation and the k th largest observation in the data x 1, x 2, x 3, x n Hence x (k) and x (n k + 1) P[x (k) < median < x (n k + 1) ] = P[at least k observations lie below the median and at least k observations lie above the median ] If at least k observations lie below the median than x (k) < median If at least k observations lie above the median than median < x (n k + 1) Slide 29 Thus P[x (k) < median < x (n k + 1) ] = P[at least k observations lie below the median and at least k observations lie above the median ] = P[The number of observations below the median is at least k and at most n-k] = P[k S n-k] S has a binomial distribution with n = the sample size and p =1/2. where S = the number of observations below the median Slide 30 Hence P[x (k) < median < x (n k + 1) ] = p(k) + p(k + 1) + + p(n-k) = P = P[k S n-k] where p(i)s are binomial probabilities with n = the sample size and p =1/2. This means that x (k) to x (n k + 1) is a P100% confidence interval for the median Slide 31 Summarizing where P = p(k) + p(k + 1) + + p(n-k) and p(i)s are binomial probabilities with n = the sample size and p =1/2. x (k) to x (n k + 1) is a P100% confidence interval for the median Slide 32 n = 10 and k =2 Example: P = p(2) + p(3) + p(4) + p(5) + p(6) + p(7) + p(8)=.9784 Binomial probabilities Hence x (2) to x (9) is a 97.84% confidence interval for the median Slide 33 Example Suppose that we are interested in determining if a new drug is effective in reducing cholesterol. Hence we administer the drug to n = 10 patients with high cholesterol and measure the reduction. Slide 34 The data Slide 35 The data arranged in order x (2) = -3 to x (9) =15 is a 97.84% confidence interval for the median Slide 36 Example In the previous example to repeat the study with n = 20 patients with high cholesterol. Slide 37 The data Slide 38 The binomial distribution with n = 20, p = 0.5 Note: p(6) + p(7) + p(8) + p(9) + p(10) + p(11) + p(12) + p(13) + p(14) = 0.037 + 0.0739 + 0.1201 + 0.1602 + 0.1762 + 0.1602 + 0.1201 + 0.0739 + 0.037 = 0.9586 Hence x (6) to x (15) is a 95.86% confidence interval for the median reduction in cholesterol Slide 39 The data arranged in order x (6) = -1 to x (15) = 9 is a 95.86% confidence interval for the median Slide 40 For large values of n one can use the normal approximation to the Binomial to find the value of k so that x (k) to x (n k + 1) is a 95% confidence interval for the median. i.e. we want to find k so that Slide 41 Slide 42 Next we will consider: 1.The Wilcoxon signed rank test The Wilcoxon signed rank test is an alternative to the Sign test, a test for the central location of a single population Slide 43 The sign test A nonparametric test for the central location of a distribution Slide 44 We want to test: H 0 : median = 0 H A : median 0 against (or against a one-sided alternative) Slide 45 The Sign test: S = the number of observations that exceed 0 Comment: If H 0 : median = 0 is true then The distribution of S is binomial -n = sample size, -p = 0.50 1.The test statistic: Slide 46 To carry out the The Sign test: S = the number of observations that exceed 0 = s observed p-value = P [S s observed ] ( = 2 P [S s observed ] for 2-tailed test) where S is binomial, n = sample size, p = 0.50 1.Compute the test statistic: 2.Compute the p-value of test statistic, s observed : 3.Reject H 0 if p-value low (< 0.05) Slide 47 Non-parametric confidence intervals for the median of a population P = p(k) + p(k + 1) + + p(n-k) and p(i)s are binomial probabilities with n = the sample size and p =1/2. x (k) to x (n k + 1) is a (1 )100% = P100% confidence interval for the median where x (k) = k th smallest x i and x (n k + 1) = k th largest x i Slide 48 The Wilcoxon Signed Rank Test An Alternative to the sign test Slide 49 Situation A sample of size n, (x 1, x 2, , x n ) from an unknown distribution and we want to test H 0 : the centre of the distribution, = 0, against H A : 0, Slide 50 For the sign test we would count S, the number of positive values of (x 1 0, x 2 0, , x n 0 ). We would reject H 0 if S was not close to n/2 Slide 51 For Wicoxons signed-Rank test we would assign ranks to the absolute values of (x 1 0, x 2 0, , x n 0 ). A rank of 1 to the value of x i 0 which is smallest in absolute value. A rank of n to the value of x i 0 which is largest in absolute value. W + = the sum of the ranks associated with positive values of x i 0. W - = the sum of the ranks associated with negative values of x i 0. Slide 52 Note: W + + W - = 1 + 2+ 3+ n = n(n + 1)/2 If H 0 is true then W + W - n(n + 1)/4 If H 0 is not true then either 1.W + will be small (W - large) or 2.W + will be large (W - small) Slide 53 True median 00 W + smallW - large 0 > True median Slide 54 True median 00 W + largeW - small 0 < True median Slide 55 Note: It is possible to work out the sampling distribution of W + ( and W - ) when H 0 is true. Note: We use the fact that if H 0 is true that there is an equal probability (1/2) that the sign attached to any rank is plus (+) or minus (-). Slide 56 Example: n = 4. ranks W+W+ W-W- Prob 1234 ----0101/16 +---19 -+--28 --+-37 ---+46 ++--37 +-+-46 +--+55 -++-55 -+-+64 --++-731/16 +++-64 ++-+73 +-++82 -+++91 ++++1001/16 Slide 57 The distribution of W + and W - : n = 4. W+W+ Prob 01/16 1 2 32/16 4 5 6 7216 81/16 9 101/16 W-W- Prob 01/16 1 2 32/16 4 5 6 7216 81/16 9 101/16 Slide 58 If T = W + or W - : n = 4. TP[T = t]P[T t] 01/160.0625 11/160.1250 21/160.1875 32/160.3125 42/160.4375 52/160.5625 62/160.6875 72/160.8125 81/160.8750 91/160.9375 101/161.000 These are the values found in the table A.6 in the textbook Slide 59 table A.6 Page A- 15 Sample Size T23456 10.50000.25000.12500.06250.0313 20.37500.18750.09380.0469 30.62500.31250.15630.0782 40.43750.21880.1094 50.56250.31250.1563 60.40630.2188 70.50000.2813 80.3438 90.4219 100.5000 Distribution of the test statistic for Wilcoxon signed-rank test Slide 60 table A.6 Page A- 15 Only goes up to n = sample size = 12 For sample sizes, n > 12 we can use the fact that T (W + or W - ) has approximately a normal distribution with Slide 61 Exact ValuesNormal Approximation Slide 62 Example In this example we are interested in the quantity FVC (Forced Vital Capacity) in patients with cystic fibrosis FVC (Forced Vital Capacity) = the volume of air that a person can expel from the lungs in a 6 sec period. This will be reduced with time for cystic fibrosis patients The research question: Will this reduction be less when a new experimental drug is administered? Slide 63 The Experimental Design The design will be a matched pair design Pairs of patients are matched (Using initial FVC readings) One member of the pair is given the new drug the other member is given a placebo We measure the reduction in FVC for each member and compute the difference x i = Reduction in FVC (placebo) Reduction in FVC (drug) These values will be generally positive if the drug is effective in minimizing the deterioration in Forced Vital Capacity (FVC). W + will be large (W - will be small) Slide 64 Table: Reduction in forced vital capacity (FVC) for a matched pair sample of patients with cystic fibrosis Subject Reduction in FVC x i DifferenceRankSigned Rank PlaceboDrug 12242131111 2895-152-2 375334233 454144010144 574-3210655 685-2811366 7293445-1527-7 8-23-17815588 952536715899 10-38140-17910-10 1150832318511 122551024512 135256546013 14102334368014 W + = 86W - = 19 Slide 65 We have to judge if W + = 86 is large (or W - = 19 is small) Since the p-value is small (< 0.05) we conclude the drug is effective in reducing the deterioration of FVC Slide 66 Summarizing: To carry out Wilcoxons signed rank test We 1.Compute T = W + or W - (usually it would be the smaller of the two) 2.Let t observed = the observed value of T. 3.Compute the p-value = P[T t observed ] (2 P[T t observed ] for a two-tailed test). i.For n 12 use the table. ii.For n > 12 use the Normal approximation. 4.Conclude H A (Reject H 0 ) if p-value is less than 0.05 (or 0.01). Slide 67 Alternative tests for this example 1.The t test 2.The sign test Slide 68 1.The t test i.This test requires the assumption of normality. ii.If the data is not normally distributed the test is invalid The probability of a type I error may not be equal to its desired value (0.05 or 0.01) iii.If the data is normally distributed, the t-test commits type II errors with a smaller probability than any other test (In particular Wilcoxons signed rank test or the sign test) 2.The sign test i.This test does not require the assumption of normality (true also for Wilcoxons signed rank test). ii.This test ignores the magnitude of the observations completely. Wilcoxons test takes the magnitude into account by ranking them Comments Slide 69 Nonparametric Statistical Methods Slide 70 Single sample nonparametric tests for central location 1.The sign test 2.Wilcoxons signed rank test These are tests for the central location of a population. They are alternatives to the z-test and the t-test for the mean of a normal population Slide 71 The Sign test Slide 72 Summarizing: To carry out Sign Test We 1.Compute S = The # of observations greater than 0 2.Let s observed = the observed value of S. 3.Compute the p-value = P[S s observed ] (2 P[S s observed ] for a two-tailed test). Use the table for the binomial distn (p = , n = sample size) 4.Conclude H A (Reject H 0 ) if p-value is less than 0.05 (or 0.01). Slide 73 True median 0 S n/2 0 = True median Slide 74 True median 00 S small (close to 0) 0 > True median Slide 75 True median 00 0 < True median S large (close to n) Slide 76 Wilcoxons signed-Rank test Slide 77 For Wilcoxons signed-Rank test we would assign ranks to the absolute values of (x 1 0, x 2 0, , x n 0 ). A rank of 1 to the value of x i 0 which is smallest in absolute value. A rank of n to the value of x i 0 which is largest in absolute value. W + = the sum of the ranks associated with positive values of x i 0. W - = the sum of the ranks associated with negative values of x i 0. Slide 78 Note: W + + W - = 1 + 2+ 3+ n = n(n + 1)/2 If H 0 is true then W + W - n(n + 1)/4 If H 0 is not true then either 1.W + will be small (W - large) or 2.W + will be large (W - small) Slide 79 True median 0 W + W - n(n + 1)/4 0 = True median Slide 80 True median 00 W + smallW - large 0 > True median Slide 81 True median 00 W + smallW - large 0 > True median Slide 82 True median 00 W + largeW - small 0 < True median Slide 83 Summarizing: To carry out Wilcoxons signed rank test We 1.Compute T = W + or W - (usually it would be the smaller of the two) 2.Let t observed = the observed value of T. 3.Compute the p-value = P[T t observed ] (2 P[T t observed ] for a two-tailed test). i.For n 12 use the table. ii.For n > 12 use the Normal approximation. 4.Conclude H A (Reject H 0 ) if p-value is less than 0.05 (or 0.01). Slide 84 Two-sample Non-parametic tests Slide 85 Mann-Whitney Test A non-parametric two sample test for comparison of central location Slide 86 The Mann-Whitney Test This is a non parametric alternative to the two sample t test (or z test) for independent samples. These tests (t and z) assume the data is normal The Mann- Whitney test does not make this assumption. Sample of n from population 1 x 1, x 2, x 3, , x n Sample of m from population 2 y 1, y 2, y 3, , y m Slide 87 The Mann-Whitney test statistic U 1 counts the number of times an observation in sample 1 precedes an observation in sample 2. An Equivalent statistic U 2 that counts the number of times an observation in sample 2 precedes an observation in sample 1 can also be computed Slide 88 Example n = m = 4 measurements of bacteria counts per unit volume were made for two type of cultures. The n = 4 measurements for culture 1 were 27, 31, 26, 25 The m = 4 measurements for culture 2 were 32, 29, 35, 28 Slide 89 To compute the Mann-Whitney test statistics U 1 and U 2, arrange the observations from the two samples combined in increasing order (retaining sample membership). 25 (1), 26 (1), 27 (1), 28 (2), 29 (2), 31 (1), 32 (2), 35 (2) For each observation in sample 2 let u i demote the number of observations in sample 1 that precede that value. u 1 = 3, u 2 = 3, u 3 = 4, u 4 = 4, Then U 1 = u 1 + u 2 + u 3 + u 4 = 3 + 3 + 4 + 4 =14 Slide 90 To compute U 2, repeat the process for the second sample 25 (1), 26 (1), 27 (1), 28 (2), 29 (2), 31 (1), 32 (2), 35 (2) For each observation in sample 1 let v i demote the number of observations in sample 2 that precede that value. v 1 = 0, v 2 = 0, v 3 = 0, v 4 = 2, Then U 2 = v 1 + v 2 + v 3 + v 4 = 0 + 0 + 0 + 2 =2 Slide 91 Note: U 1 + U 2 = mn = 16. This is true in general For each pair (x i,y j ) either x i y j (Assume no ties) In one case U 1 will be increased by 1 while in the other case U 2 will be increased by 1. There are mn such pairs. Slide 92 An Alternative way of o computing the Mann-Whitney test statistic U Arrange the observations from the two samples combined in increasing order (retaining sample membership) and assign ranks to the observations. 25 (1) 26 (1) 27 (1) 28 (2) 29 (2) 31 (1) 32 (2) 35 (2) 12345678 Let W 1 = the sum of the ranks for sample 1. = 1 + 2 + 3 + 6 = 12 Let W 2 = the sum of the ranks for sample 2. = 4 + 5 + 7 + 8 = 24 Slide 93 It can be shown that and Note: Slide 94 The distribution function of U (U 1 or U 2 ) has been tabled for various values of n and m ( The Mann-Whitney test for large samples For large samples (n > 10 and m >10) the statistics U 1 and U 2 have approximately a Normal distribution with mean and standard deviation Slide 99 Thus we can convert U i to a standard normal statistic And reject H 0 if z z /2 (for a two tailed test) Slide 100 The Kruskal Wallis Test Comparing the central location for k populations An nonparametric alternative to the one-way ANOVA F-test Slide 101 Situation: Data is collected from k populations. The sample size from population i is n i. The data from population i is: Slide 102 The computation of The Kruskal-Wallis statistic We group the N = n 1 + n 2 + + n k observation from k populations together and rank these observations from 1 to N. Let r ij be the rank associated with with the observation x ij. Handling of tied observations If a group of observations are equal the ranks that would have been assigned to those observations are averaged Slide 103 The computation of The Kruskal-Wallis statistic Let Note: If the k populations do not differ in central location the Slide 104 The Kruskal-Wallis statistic where = the sum of the ranks for the i th sample Slide 105 The Kruskal-Wallis test Reject H 0 : the k populations have same central location Slide 106 Example In this example we are measuring an enzyme level in three groups of patients who have received open heart surgery. The three groups of patients differ in age: 1.Age 30 45 2.Age 46 60 3.Age 61+ Slide 107 The data Slide 108 Computation of the Kruskal-Wallis statistic The raw dataThe data ranked Slide 109 The Kruskal-Wallis statistic since H 0 is rejected. There are significant differences in the central enzyme levels between the three age groups