Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of...

62
Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1. Higgins (2004) Intro to Modern Nonpar Stat 2. Hollander and Wolfe (1999) Nonpar Stat Methods 3. Arnold Notes 4. Johnson, Morrell, and Schick (1992) Two-Sample Nonparametric Estimation and Confidence Intervals Under Truncation, Biometrics, 48, 1043- 1056. 5. Beers, Flynn, and Gebhardt (1990) Measures of Location and Scale for Velocities in Cluster of Galaxies-A Robust Approach. Astron J, 100, 32- 46. 6. Website: http://www.stat.wmich.edu/slab/RGLM/

Transcript of Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of...

Page 1: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Nonparametrics.Zip(a compressed version of nonparametrics)

Tom HettmanspergerDepartment of Statistics, Penn State University

References:1. Higgins (2004) Intro to Modern Nonpar Stat2. Hollander and Wolfe (1999) Nonpar Stat Methods3. Arnold Notes4. Johnson, Morrell, and Schick (1992) Two-Sample

Nonparametric Estimation and Confidence Intervals Under Truncation, Biometrics, 48, 1043-1056.

5. Beers, Flynn, and Gebhardt (1990) Measures of Location and Scale for Velocities in Cluster of Galaxies-A Robust Approach. Astron J, 100, 32-46.

6. Website: http://www.stat.wmich.edu/slab/RGLM/

Page 2: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Robustness and a little philosophy

Robustness: Insensitivity to assumptions

a. Structural Robustness of Statistical Procedures

b. Statistical Robustness of Statistical Procedures

Page 3: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Structural Robustness(Developed in 1970s)

Influence: How does a statistical procedure respond to a single outlying observation as it moves farther from the center of the data.

Want: Methods with bounded influence.

Breakdown Point: What proportion of the data must be contaminated in order to move the procedure beyond any bound.

Want: Methods with positive breakdown point.

Beers et. al. is an excellent reference.

Page 4: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Statistical Robustness(Developed in 1960s)

Hypothesis Tests:

• Level Robustness in which the significance level is not sensitiveto model assumptions.

• Power Robustness in which the statistical power of a test todetect important alternative hypotheses is not sensitive to model assumptions.

Estimators:

• Variance Robustness in which the variance (precision) of anestimator is not sensitive to model assumptions.

Not sensitive to model assumptions means that the propertyremains good throughout a neighborhood of the assumed model

Page 5: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Examples1. The sample mean is not structurally robust and is not variance robust.

2. The sample median is structurally robust and is variance robust.

3. The t-test is level robust (asymptotically) but is not structurally robust nor power robust.

4. The sign test is structurally and statistically robust. Caveat: it is not verypowerful at the normal model.

5. Trimmed means are structurally and variance robust.

6. The sample variance is not robust, neither structurally nor variance.

7. The interquartile range is structurally robust.

Page 6: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Recall that the sample mean minimizes

2)( ix

Replace the quadratic by x) which does not increase like a quadratic. Then minimize:

)( ix

The result is an M-Estimator which is structurally robust and variance robust.

See Beers et. al.

Page 7: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

The nonparametric tests described here are oftencalled distribution free because their significancelevels do not depend on the underlying modelassumption. Hence, they are level robust.

They are also power robust and structurally robust.

The estimators that are associated with the testsare structurally and variance robust.

Opinion: The importance of nonparametric methodsresides in their robustness not the distribution freeproperty (level robustness) of nonparametric tests.

Page 8: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Single Sample Methods

Page 9: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

• Robust Data Summaries

• Graphical Displays

• Inference: Confidence Intervals and Hypothesis Tests

Location, Spread, Shape

CI-Boxplots (notched boxplots)

Histograms, dotplots, kernel density estimates.

Page 10: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Absolute MagnitudePlanetary Nebulae

Milky WayAbs Mag (n = 81) -5.140 -6.700 -6.970 -7.190 -7.273 -7.365 -7.509 -7.633 -7.741 -8.000 -8.079 -8.359 -8.478 -8.558 -8.662 -8.730 -8.759 -8.825 …

Abs Mag-6.0-7.2-8.4-9.6-10.8-12.0-13.2-14.4

Dotplot of Abs Mag

Page 11: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

-6-8-10-12-14

Median

Mean

-10.0-10.2-10.4-10.6

Anderson-Darling Normality Test

Variance 3.253Skewness 0.305015Kurtosis -0.048362N 81

Minimum -14.205

A-Squared

1st Quartile -11.564Median -10.5573rd Quartile -9.144Maximum -5.140

85% Confidence Interval for Mean

-10.615

0.30

-10.032

85% Confidence Interval for Median

-10.699 -10.208

85% Confidence Interval for StDev

1.622 2.039

P-Value 0.567

Mean -10.324StDev 1.804

85% Confidence I ntervals

Summary for Abs Mag

Page 12: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Abs Mag

Perc

ent

-5.0-7.5-10.0-12.5-15.0-17.5

99.9

99

95

90

80706050403020

10

5

1

0.1

Mean

0.567

-10.32StDev 1.804N 81AD 0.303P-Value

Probability Plot of Abs Mag

Normal - 95% CI

Page 13: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Abs Mag - Threshold

Perc

ent

101

99.999

9080706050403020

10

5

32

1

0.1

Shape

0.224P-Value >0.500

2.680Scale 5.027Thresh -14.79N 81AD

Probability Plot of Abs Mag

3-Parameter Weibull - 95% CI

But don’t be too quick to “accept” normality:

Page 14: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Abs Mag

Frequency

-6-8-10-12-14

20

15

10

5

0

Shape 2.680Scale 5.027Thresh -14.79N 81

3-Parameter Weibull Histogram of Abs Mag

Page 15: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

shapec

scaleb

thresholdt

otherwiseandtxforb

tx

b

txcxf

onDistributiWeibull

cc

c

0)(exp{)(

)(

:

1

Page 16: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Null Hyp: Pop distribution, F(x) is normal

)())](1)(([))()(( 12 xdFxFxFxFxFnAD n

|)()(|max xFxFD n

The Kolmogorov-Smirnov Statistic

The Anderson-Darling Statistic

Page 17: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Abs

Mag

-5

-6

-7

-8

-9

-10

-11

-12

-13

-14

Outlier

Whisker

3rd Quartile

Median

1st Quartile

95% Confidence Interval for the Median (in red)

Boxplot of Abs Mag (with 95% CI)

Page 18: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Anatomy of a 95% CI-Boxplot

• Box formed by quartiles and median• IQR (interquartile range) Q3 – Q1• Whiskers extend from the end of the box to the farthest

point within 1.5xIQR. For a normal benchmark distribution, IQR=1.348Stdev

and 1.5xIQR=2Stdev. Outliers beyond the whiskers are more than 2.7 stdevs

from the median. For a normal distribution this should happen about .7% of the time.

Pseudo Stdev = .75xIQR

Page 19: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Estimation of scale or variation. Recall the sample standard deviation is not robust.

From the boxplot we can find the interquartile range.

Define a pseudo standard deviation by .75IQR.This is a consistent estimate of the population standarddeviation in a normal population. It is robust.

The MAD is even more robust. Define the MAD byMed|x – med(x)|. Further define another psueudostandard deviation by 1.5MAD. Again this is calibratedto a normal population.

Page 20: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

No

Outlier

Outlier Break-

down

Stdev 1.80 2.43 1 obs

.75IQR 1.82 1.82 25%

1.5MAD 1.78 1.78 50%

Suppose we enter 5.14 rather than -5.14 in the data set. The table shows the effect on the non robust stdev and the robust IQR and MAD.

Page 21: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

The confidence interval and hypothesis test

.0

0

dis

medianpopulationtheifdatlocatedispopulationA

00

1

11

1

0))((

),...,()(

.0

,...,,...,

.,...,

0datlocatedispopwhendSE

ifanalysislocation

forusefulstatisticadXdXSdS

atlocatedis

dXdXifdatlocatedisXXSay

populationthefromXXSample

d

n

nn

n

Page 22: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

)(ˆ:

]2/)ˆ([0)ˆ(ˆ

0)(:,

)(2)()(

##)sgn()(

:

00 0

i

d

iii

XmediandSolution

ndSordSdFind

dSEnotedatafromdEstimate

ndSdSdS

dXdXdXdS

StatisticSign

Page 23: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

FreeonDistributi

nBinomialddistributedS

ddHUnder

kncn

dSorkcn

dS

cndSPwhere

cndSdSifHrejectRule

ddHvsddHofTESTHYPOTHESIS

d

A

)2

1,()(

,:2

)(2

)(

.)|)(2(|

|)(2||)(|:

:.:

0

00

00

0

000

000

0

Page 24: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

FreeonDistributi

IntConfisXXThen

XdLikewise

knXXXd

knXXXd

kndXdsmallestFind

kndSkP

locationpopulationisd

INTERVALCONFIDENCE

knk

kn

kik

kik

i

d

..%100)1(],[

1)(#:

)(#:

)(#

1))((

)()1(

)(max

)1()1(min

)()(

Page 25: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

%100)1(],[2/))((:

)2/1,()(2/))(()(

2

1)(:.

2

1)(:

:.:

)(ˆ0)ˆ(ˆ:

##)()()(:

.,...,

:

)()1(

000

00

000

000

01

tcoefficienconfidencehasXXthenkdSPifINTERVALCONFIDENCE

nbinomialdSandkdSPwhereknorkdSifHreject

dXPHvsdXPH

ddHvsddHofTEST

XmedianddSdESTIMATE

dXdXdSdSdSSTATISTICSIGN

datlocatedpopulationafromsampleaXX

SUMMARY

knk

d

d

A

A

i

ii

n

Page 26: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Abs

Mag

-5

-6

-7

-8

-9

-10

-11

-12

-13

-14

Boxplot of Abs Mag (with 95% CI)

Q1 Median SE Med Q3 IQR-11.5 -10.7 .18 -9.14 2.42

Page 27: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Additional Remarks:

The median is a robust measure of location. It is not affected by outliers.It is efficient when the population has heavier tails than a normal population.

The sign test is also robust and insensitive to outliers. It is efficient when the tails are heavier than those of a normal population.

Similarly for the confidence interval.

In addition, the test and the confidence interval are distribution free anddo not depend on the shape of the underlying population to determinecritical values or confidence coefficients.

They are only 64% efficient relative to the mean and t-test when the population is normal.

If the population is symmetric then the Wilcoxon Signed Rank statistic can be used, and it is robust against outliers and 95% efficient relative to the t-test.

Page 28: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Two-Sample Methods

Page 29: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Two-Sample Comparisons

85% CI-Boxplots

Mann-Whitney-Wilcoxon Rank Sum Statistic

•Estimate of difference in locations•Test of difference in locations•Confidence Interval for difference in locations

Levene’s Rank Statistic for differences in scaleor variance.

Page 30: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

M-31MW

20

15

10

5

0

-5

-10

-15

85% CI-Boxplots

Page 31: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

App M

ag

19

18

17

16

15

14

13

12

11

10

Boxplot of App Mag, M-31

Page 32: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

App Mag

1817161514131211

Dotplot of App Mag, M-31

Page 33: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

18.016.515.013.512.010.5

Median

Mean

14.6014.5514.5014.4514.40

Anderson-Darling Normality Test

Variance 1.427Skewness -0.396822Kurtosis 0.366104N 360

Minimum 10.749

A-Squared

1st Quartile 13.849Median 14.5403rd Quartile 15.338Maximum 18.052

85% Confidence Interval for Mean

14.367

1.79

14.549

85% Confidence Interval for Median

14.453 14.610

85% Confidence Interval for StDev

1.134 1.263

P-Value < 0.005

Mean 14.458StDev 1.195

85% Confidence I ntervals

Summary for App Mag, M-31

Page 34: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

18171615141312

Median

Mean

14.6514.6014.5514.5014.45

Anderson-Darling Normality Test

Variance 1.243Skewness -0.172496Kurtosis 0.057368N 353

Minimum 11.685

A-Squared

1st Quartile 13.887Median 14.5503rd Quartile 15.356Maximum 18.052

85% Confidence Interval for Mean

14.436

1.01

14.607

85% Confidence Interval for Median

14.483 14.639

85% Confidence Interval for StDev

1.058 1.179

P-Value 0.012

Mean 14.522StDev 1.115

85% Confidence I ntervals

Summary for App Mag (low outliers removed)

Page 35: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

App Mag

Perc

ent

19181716151413121110

99.9

99

95

90

80706050403020

10

5

1

0.1

Mean

<0.005

14.46StDev 1.195N 360AD 1.794P-Value

Probability Plot of App MagNormal - 95% CI

Page 36: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Why 85% Confidence Intervals?

We have the following test of

Rule: reject the null hyp if the 85% confidence

intervals do not overlap.

The significance level is close to 5% provided

the ratio of sample sizes is less than 3.

0:.0: 21210 dddHvsdddH A

Page 37: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

)(#)(#

)()()sgn()(

.

,...,,..., 11

dXYdXY

dUdUXdYdU

dddwithGpopfromYand

FpopfromXwithYYandXX

jiji

ji

XY

nm

Mann-Whitney-Wilcoxon Statistic: The sign statistic on the pairwise differences.

Unlike the sign test (64% efficiency for normal population, the MWW testhas 95.5% efficiency for a normal population. And it is robust againstoutliers in either sample.

Page 38: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

....%100)1(],[

2/))((:

.)(2/))0(()0(

2

1)(:.

2

1)(:

0:.0:

)(ˆ0)ˆ(ˆ:

##)()()(:

:

)()1(

)()1(

00

0

0

0

,

sdifferencepairwiseorderedtheareDDwheretcoefficienconfidencehasDD

thenkdUPifINTERVALCONFIDENCE

ondistributitabledadUandkUPwhereknorkUifHreject

XYPHvsXYPH

dHvsdHofTEST

XYmedianddUdESTIMATE

dXYdXYdUdUdUSTATISTICMWW

SUMMARY

mn

kmnk

d

d

A

A

ijji

ijij

Page 39: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Mann-Whitney Test and CI: App Mag, Abs Mag

N MedianApp Mag (M-31) 360 14.540Abs Mag (MW) 81 -10.557

Point estimate for d is 24.900

95.0 Percent CI for d is (24.530,25.256)

W = 94140.0Test of d=0 vs d not equal 0 is significant at 0.0000

What is W?

Page 40: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

.

2)

11(

,...,,...,

2

)1(

#

11

1

testtinXYthanratherranksaverage

indifferencetheaswrittenbecanMWWHence

mnU

mnRR

datacombinedinYYofranksareRR

Rnn

UW

XYU

XY

nn

n

jj

ij

Page 41: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

19.618.216.815.414.012.611.2

MW

M-31

Each symbol represents up to 2 observations.

Dotplot of MW and M-31

What about spread or scale differences between the two populations?

Below we shift the MW observations to the right by 24.9 to line up withM-31.

Variable StDev IQR PseudoStdev MW 1.804 2.420 1.815 M-31 1.195 1.489 1.117

Page 42: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Levene’s Rank Test

Compute |Y – Med(Y)| and |X – Med(X)|, called absolute deviations.

Apply MWW to the absolute deviations. (Rank the absolute deviations)

The test rejects equal spreads in the two populations when differencein average ranks of the absolute deviations is too large.

Idea: After we have centered the data, then if the null hypothesisof no difference in spreads is true, all permutations of the combined dataare roughly equally likely. (Permutation Principle)

So randomly select a large set of the permutations say B permutations. Assign the first n to the Y sample and the remaining m to the X sample and compute MMW on the absolute deviations.

The approximate p-value is #MMW > original MMW divided by B.

Page 43: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Difference of rank mean abso devs 51.9793

levenerk

Frequency

524530150-15-30-45

120

100

80

60

40

20

0

Mean 0.1644StDev 16.22N 1000

Histogram of levenerkNormal

So we easily reject the null hypothesis of no difference in spreads and conclude that the two populations have significantly different spreads.

Page 44: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

k-Sample Methods

One Sample Methods

Two Sample Methods

Page 45: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Variable Mean StDev Median .75IQR Skew KurtosisMessier 31 22.685 0.969 23.028 1.069 -0.67 -0.67

Messier 81 24.298 0.274 24.371 0.336 -0.49 -0.68

NGC 3379 26.139 0.267 26.230 0.317 -0.64 -0.48NGC 4494 26.654 0.225 26.659 0.252 -0.36 -0.55NGC 4382 26.905 0.201 26.974 0.208 -1.06 1.08

All one-sample and two-sample methods can be applied one at a timeor two at a time. Plots, summaries, inferences.

We begin k-sample methods by asking if the location differences betweenthe NGC nebulae are statistically significant.

We will briefly discuss issues of truncation.

Page 46: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

NGC-4382NGC-4494NGC-3379M-81M-31

28

27

26

25

24

23

22

21

20

85% CI-Boxplot Planetray Nebula Luminosities

Page 47: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

KWforondistributisamplingeapproximat

asFreedomofDegreeskchisquareauseGenerally

NRn

NRn

NRn

NN

RRN

nnRR

N

nnRR

N

nn

NNKW

constructRandRRwith

datacombinedofrankswithsizesampletotalNGiven

samplesseveraltoMWWExtending

)21(

})2

1()

2

1()

2

1({

)1(

12

})()()({)1(

12

:,,

233

222

211

232

32231

31221

21

321

Page 48: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Kruskal-Wallis Test on NGC

sub N Median Ave Rank Z1 45 26.23 29.6 -9.392 101 26.66 104.5 0.363 59 26.97 156.4 8.19Overall 205 103.0

KW = 116.70 DF = 2 P = 0.000

This test can be followed by multiple comparisons.

For example, if we assign a family error rateof .09, then we would conduct 3 MWW tests, eachat a level of .03. (Bonferroni)

Page 49: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

NGC4382NGC4494NGC3379

27.25

27.00

26.75

26.50

26.25

26.00

25.75

25.50

85% CI-Boxplot

Page 50: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

What to do about truncation.

1. See a statistician

2. Read the Johnson, Morrell, and Schick reference. and thensee a statistician.

Here is the problem: Suppose we want to estimate the difference in locationsbetween two populations: F(x) and G(y) = F(y – d).

But (with right truncation at a) the observations come from

ayforandayfordaF

dyFyG

axforandaxforaF

xFxF

a

a

1)(

)()(

1)(

)()(

Suppose d > 0 and so we want to shift the X-sample to the right toward the truncation point. As we shift the Xs, some will pass the truncation point andwill be eliminated from the data set. This changes the sample sizes and requires adjustment when computing the corresponding MWW to see ifit is equal to its expectation. See the reference for details.

Page 51: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Computation of shift estimate with truncation

d m n W E(W)

25.3 88 59 5.10 4750.5 4366.0

28.3 84 59 3.60 4533.5 4248.0

30.3 83 59 2.10 4372.0 4218.5

32.3 81 59 0.80 4224.5 4159.5

33.3 81 59 -0.20 4144.5 4159.5

33.1 81 59 -0.00 4161.5 4159.5

Comparison of NGC4382 and NGC 4494

Data multiplied by 100 and 2600 subtracted.Truncation point taken as 120.

Point estimate for d is 25.30 W = 6595.5

m = 101 and n = 59

Page 52: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Robust regression fitting and correlation (association)

Dataset (http://astrostatistics.psu.edu/datasets/HIP_star.html)

We have extracted a sample of 50 from the subset of 2719 Hipparcos stars

Vmag = Visual band magnitude.  This is an inverted logarithmic measure of brightness 

Plx = Parallactic angle (mas = milliarcsseconds).  1000/Plx gives the distance in parsecs (pc)

B-V = Color of star (mag)

The HR diagram logL vs. B-V where (roughly) the log-luminosity in units of solar luminosity is constructed

logL=(15 - Vmag - 5logPlx)/2.5. 

All logs are base10. 

Page 53: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Row LogL BV

1 0.69233 0.593

2 1.75525 0.935

3 -0.30744 0.830

4 -0.17328 0.685

5 0.57038 0.529

6 -1.04471 1.297

7 0.51396 0.510

8 0.52149 0.607

9 -1.06306 1.288

10 0.41990 0.677

11 -0.76152 0.950

12 -1.10608 1.260

13 0.42593 0.651

14 -0.44066 0.909

15 -0.90039 1.569

16 -0.74118 1.065

17 -0.66820 1.049

18 -0.26810 0.884

19 0.56722 0.480

20 -0.93809 0.490

21 -0.38095 1.160

22 -0.19267 0.810

23 0.54619 0.498

24 0.20161 0.614

25 0.37348 0.538

26 -0.38556 0.879

27 -0.22978 0.723

28 0.57671 0.455

29 -1.00092 1.110

30 -0.00215 0.637

31 -0.95768 1.616

32 0.10378 0.606

33 -1.43872 1.365

34 1.23674 0.395

35 0.10866 0.630

36 -1.60621 *

37 0.06468 0.599

38 -0.18214 0.709

39 0.37988 0.561

40 1.23793 0.257

41 -0.16896 0.864

42 -0.59331 0.955

43 1.78028 1.010

44 -0.63099 1.100

45 0.61900 0.664

46 -0.28520 0.706

47 -0.71404 0.898

48 0.35061 0.616

49 0.55002 0.466

50 0.37922 0.548

Page 54: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

BV

LogL

1.61.41.21.00.80.60.40.20.0

2.00

1.50

1.00

0.50

0.00

-0.50

-1.00

-1.38-1.50

Fitted Line Plot

LogL = 1.253 - 1.605 BV

Resistent Line in BlackLeast Squares line in Blue

Resistant Line in BlackLeast Squares Line in Blue

logL = 1.513 - 2.067BV

Page 55: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

The resistant line is robust and not affected by the outliers. It followsthe bulk of the data much better than the non robust least squares regression line.

There are various ways to find a robust or resistant line. The most typical is to use the ideas of M-estimation and minimize:

)( ii bcax

where the (x) does not increase as fast as a quadratic.

The strength of the relationship between variables is generallymeasured by correlation.

Next we look briefly at non robust and robust measures of correlation or association.

Page 56: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Pearson product moment correlation is not robust.

Spearman’s rank correlation coefficient is simply the Pearson coefficient with the data replaced by their ranks.

Spearman’s coefficient measures association or the tendency of the two measurements to increase ordecrease together. Pearson’s measures the degreeto which the measurements cluster along a straight line.

Page 57: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

For the example:

Pearson r = -.673

Spearman rs= -.743

Significance tests:

Pearson r: refer z to a standard normal distribution, where

28.6)1(

22

r

nrz

Spearman rs: refer z to a standard normal distribution, where

5.2011 srnz

Page 58: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Kendall’s tau coefficient is defined as

where P is the number of concordant pairs out of n(n-1)/2 total pairs.

For example (1, 3) and (2, 7) are concordant since 2>1 and 7>3. Note that Kendall’s tau estimates the probability of concordance minus the probability of discordance in the bivariate population.

Page 59: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

For the example:

Kendalls Tau = -0.63095

Significance Test: refer z to a standard normal distribution where

47.6)52(2

)1(3

n

taunnz

Page 60: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

What more can we do?

1. Multiple regression

2. Analysis of designed experiments (AOV)

3. Analysis of covariance

4. Multivariate analysis

These analyses can be carried out using the website:

http://www.stat.wmich.edu/slab/RGLM/

Page 61: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

Professor Lundquist, in a seminar on compulsive thinkers, illustrates his brainstapling technique.

Page 62: Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1.Higgins (2004)

The End