STAT 3022 SLIDES UMN CHAPTER3
Transcript of STAT 3022 SLIDES UMN CHAPTER3
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
1/27
Intrduction Robustness Resistance Transformation Outlier
Chapter 3
A Closer Look at Assumptions
STAT 3022School of Statistic, University of Minnesota
2013 spring
1 / 2 7
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
2/27
Intrduction Robustness Resistance Transformation Outlier
Introduction
In Chapter 2, we discussed the mechanics of using t-proceduresto perform statistical inference. Namely t-tests and confidenceinterval.
We base these procedures on certain assumptions:
we have random samples, representative of populations
data come from Normal population
samples are drawn independently.
in pooled two-sample settings, we have equal variance(1 = 2 = )
In practice, these assumptions are usually not strictly met.When are these procedures still appropriate?
2 / 2 7
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
3/27
Intrduction Robustness Resistance Transformation Outlier
Case Study: Making it Rain
Data collected in southern Florida between 1968 - 1972 to testhypothesis that massive injection of silver iodide (AgI) intocumulus clouds can lead to increased rainfall.
This process is called cloud seeding. Over 52 days, either
seeded a target cloud or left it unseeded (as control). Randomlyassigned treatment.
Researchers were blindto the treatment - pilots flew throughcloud every day, whether treatment or control, and mechanismin plane either seeded the cloud or left it unseeded.
Question: Did cloud seeding have an effect on rainfall? If so,how much?
3 / 2 7
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
4/27
Intrduction Robustness Resistance Transformation Outlier
Graphical Summaries
library("Sleuth2")
boxplot(Rainfall ~ Treatment, ylab='Rainfall (acre-feet)', data=case0301)
Unseeded Seeded
0
500
1000
1500
2000
2500
Rainfall(acre
feet)
4 / 2 7
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
5/27
Intrduction Robustness Resistance Transformation Outlier
Graphical Summaries
par(mfrow=c(2,1), mar=c(4,4,1,0.5))
hist(case0301$Rainfall[case0301$Treatment=="Seeded"], breaks=10,
main="Seeded - Rainfall", xlim=c(0,3000), col="gray", xlab="")hist(case0301$Rainfall[case0301$Treatment=="Unseeded"], breaks=8,
main="Unseeded - Rainfall", xlim=c(0,3000),col="gray", xlab="")
ee e a n a
Frequency
0 500 1000 1500 2000 2500 3000
0
2
4
6
8
10
12
nsee e a n a
Frequency
0 500 1000 1500 2000 2500 3000
0
5
10
15
20
5 / 2 7
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
6/27
Intrduction Robustness Resistance Transformation Outlier
Numerical Summaries and Interpretations
Numerical Summaries: Do it yourself (follow the R-code onpage 42 of Chapter 2 slides)
Graphical and numerical summaries indicate that rainfall
tended to be greater on seeded days. However, there areproblems with our necessary assumptions:
both distributions are very skewed
both distributions have outliers
variability is much greater in the seeded group than in theunseeded group
Can we use our usual t-tools to analyze these data? How?
6 / 2 7
d b f l
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
7/27
Intrduction Robustness Resistance Transformation Outlier
Can we do this?
> t.test(Rainfall ~ Treatment, alternative="two.sided",
+ var.equal=TRUE, data=case0301)
Two Sample t-test
data: Rainfall by Treatment
t = -1.9982, df = 50, p-value = 0.05114
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-556.224179 1.431851
sample estimates:
mean in group Unseeded mean in group Seeded
164.5885 441.9846
How much did the violations of our assumptions affect theseresults?
7 / 2 7
I t d ti R b t R i t T f ti O tli
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
8/27
Intrduction Robustness Resistance Transformation Outlier
Robustness
t-tools may be used even when assumptions are violated, to acertain degree, because the t-tools are robust.
Robustness: A statistical procedure is robust to departuresfrom a particular assumption if it is valid even when theassumption is not met.
8 / 2 7
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
9/27
Intrduction Robustness Resistance Transformation Outlier
Type 1: Robustness Against Departures fromNormality
Recall that the Central Limit Theorem (CLT) states that sample
averages have approximately Normal sampling distributions,regardless of the shape of the population distribution, for largesamples.
As long as samples are large enough, the t-ratio will follow an
approximate t-distribution even if the data is non-Normal.
9 / 2 7
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
10/27
Intrduction Robustness Resistance Transformation Outlier
Type 1: Robustness Against Departures fromNormality
Effects of Skewness
If two populations have same standard deviations andapproximately same shapes, and ifn1 n2, then validityoft-tools is affected very little by skewness.
If two populations have same standard deviations andapproximately same shapes, but n1 = n2, then validity oft-tools is affected substantially by skewness. Larger samplesize diminish this effect.
If skewness in two populations differs considerably, tools
can be very misleading with small and moderate samplesizes.
See Display 3.4 in the textbook for simulation results.
10/27
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
11/27
Intrduction Robustness Resistance Transformation Outlier
Type 2: Robustness Against Differing StandardDeviations
When we cannot assume 1 = 2, more serious problems mayarise:
sp no longer estimates any parameterSE(x1 x2) no longer estimates the standard deviation ofthe difference between averagesthe t-ratio no longer follows a t-distribution
What can we do:
Ifn1 n2, t-tools remain fairly valid even when 1 = 2.
When n1 and n2 are very different, we need the ratio 1/2to be between 1/2 and 2 to have reliable results.
See Display 3.5 in the textbook for simulation results.
> t.test(x1, x2, alternative = 'two.sided', var.equal = FALSE)11/27
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
12/27
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
13/27
Intrduction Robustness Resistance Transformation Outlier
Resistance and Outliers
An outlier is an observation judged to be far from its groupaverage.
A statistical procedure is resistant if it does not change verymuch when a small part of the data changes, perhapsdrastically.
Whether or not we should simply remove such observationsdepend on how resistant our tools are to changes in the data.
Question: Can you tell the difference between Robustnessand Resistance?
13/27
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
14/27
Example of Outlier
6 4 2 0 2 4 6
1.0
0.
5
0.
0
0.
5
1.
0
3 2 1 0 1 2 3
3
2
1
0
1
2
3
14/27
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
15/27
Example of Resistance
Consider a hypothetical sample:
10, 20, 30, 50, 70
The sample mean is 36, and the sample median is 30.
Now consider the sample:
10, 20, 30, 50, 700
What happens to the sample mean? What about the sample
median?
The sample median is resistant to any change in a singleobservation, while the sample mean is not.
15/27
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
16/27
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
17/27
Practical Strategies for the Two-Sample Problem
Our task is to size up actual conditions, using available data,and evaluate appropriateness of t-tools:
1 think about possible cluster and serial effects
2 evaluate the suitability of t-tools by examining graphicaldisplays (side-by-side histograms or box plots)
3 consider alternatives
a. Transform the data (Section 3.5) to see if the transformeddata looks nicer
b. Alternative tools that do not require model assumptions(Chapter 4)
17/27
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
18/27
Transformations of Data
For positive data, the most useful transformation is the
logarithm (log), particularly the natural (base e) logarithm (e =2.71828...).
log(1) = 0log(ex) = x
0 2 4 6 8 10
2
1
0
1
2
log function
x
log
(x)
18/27
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
19/27
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
20/27
Cloud Seeding - Transformation
Recall both groups are skewed, with the seeded days having alarger average and a greater spread.> max(case0301$Rainfall[case0301$Treatment=="Seeded"])/
+ min(case0301$Rainfall[case0301$Treatment=="Seeded"])
[1] 669.6586
> max(case0301$Rainfall[case0301$Treatment=="Unseeded"])/
+ min(case0301$Rainfall[case0301$Treatment=="Unseeded"])
[1] 1202.6
> case0301$logRain head(case0301)
Rainfall Treatment logRain
1 1202.6 Unseeded 7.092241
2 830.1 Unseeded 6.721546
3 372.4 Unseeded 5.919969
4 345.5 Unseeded 5.844993
5 321.2 Unseeded 5.772064
6 244.3 Unseeded 5.498397
Unseeded Seeded
0
500
1000
1500
2000
2500
before transformation
Unseeded Seeded
0
2
4
6
8
after transformation
20/27
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
21/27
Two-Sample t-Analysis
Before:> t.test(Rainfall ~ Treatment, alternative="two.sided",
+ var.equal=TRUE, data=case0301)
Two Sample t-test
data: Rainfall by Treatment
t = -1.9982, df = 50, p-value = 0.05114
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-556.224179 1.431851
sample estimates:mean in group Unseeded mean in group Seeded
164.5885 441.9846
After:> t.test(logRain ~ Treatment, data=case0301,
+ alternative="less", var.equal=TRUE)
Two Sample t-test
data: logRain by Treatment
t = -2.5444, df = 50, p-value = 0.007041
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -0.3904045
sample estimates:
mean in group Unseeded mean in group Seeded
3.990406 5.134187
There is convincing evidence that seeding increased rainfall. 21/27
Intrduction Robustness Resistance Transformation Outlier
l l ff
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
22/27
Multiplicative Treatment Effect
Definition: Suppose Z= logY. It is estimated that the responseof an experimental unit to treatment 2 will be eZ2Z1 times aslarge as its response to treatment 1 (where Z1 = average oflog(Y1)).
> m1 m2 (diffmeans (est.mult.effect
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
23/27
Confidence Interval
> (test test$conf.int
[1] -2.0466973 -0.2408651
attr(,"conf.level")
[1] 0.95
> exp(test$conf.int)
[1] 0.1291608 0.7859476
attr(,"conf.level")
[1] 0.95
A 95% confidence interval for the multiplicative effect ofunseeding/seeding is 0.129 to 0.786 times.
23/27
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
24/27
Intrduction Robustness Resistance Transformation Outlier
R i O tli d Oth D t P i t
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
25/27
Removing Outliers and Other Data Points
> library(Sleuth2); ex0327[15:17, ]
Country Life Income Type
15 Portugal 68.1 956 Industrialized16 South_Africa 68.2 NaN Industrialized
17 Sweden 74.7 5596 Industrialized
> range(ex0327$Income, na.rm=TRUE)
[1] 110 5596
> data
> d1 ### dealing with Missing data ###
> (cc data2
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
26/27
Q: How many conservative economists does it take to change alight bulb?
26/27
Intrduction Robustness Resistance Transformation Outlier
-
7/29/2019 STAT 3022 SLIDES UMN CHAPTER3
27/27
A: None, theyre all waiting for the unseen hand of the marketto correct the lighting disequilibrium.
27/27