Assignment #7mfscott/lectures/16...Assignment #7 Chapter 12: 18, 24 Chapter 13: 28 Due this Friday...
Transcript of Assignment #7mfscott/lectures/16...Assignment #7 Chapter 12: 18, 24 Chapter 13: 28 Due this Friday...
Assignment #7
Chapter 12: 18, 24 Chapter 13: 28 Due this Friday Nov. 20th by 2pm in your TA’s homework box
Assignment #8
Chapter 14: 26 Chapter 15: 18, 27 Due next Friday Nov. 27th by 2pm in your TA’s homework box
Reading
For Today: Chapter 16 For Thuesday: Chapter 17
Lab Report
• Posted on web-site • Dates
– Rough draft due to TAs homework box Monday Nov. 16th – Rough draft returned in your registered lab section next week – Final draft due at start of your registered lab section the week of Nov. 30th
• 10% of course grade – Rough Draft - 5% – Final draft - 5% – If you’re happy with your rough draft mark, you can tell your TA to use it for
the final draft
• Read the “Writing a Lab Report” section of your lab notebook for guidance!!
Chapter 15 Review
Null hypothesis for simple ANOVA
• H0 : Variance among groups = 0
OR
• H0 : µ1 = µ2 = µ3 = µ4 = ... µk
Key to ANOVA
If there really are no differences among populations then differences in sample means should be due to sampling error alone We can estimate how much variation in group means ought to be present if due to sampling error
€
σ x 2 =
σ x2
n+ Variance µi[ ]
Variance within groups
Variance among groups
Variance due to difference among true population
means. • 0 if H0 is true
• >0 if HA is true
If H0 is true: If HA is true:
σ x2 =
σ x2
n
σ x2 >
σ x2
n
€
σ x 2 >
σ x2
n?
nσ x 2 > σ x
2 ?
€
nσ x 2 is estimated by the
“Mean Squares Group”
€
σ x2 is the variance within groups,
estimated by the “Mean Squares Error”
€
MSgroup
€
MSerror
Population parameters Estimates from sample
Mean squares group
€
SSgroup = ni X i − X ( )2∑
dfgroup = k-1
€
MSgroups =SSgroupsdfgroups
Mean squares error Error sum of squares =
SSerror = ΣiΣ j (Yij −Yi )2 = dfisi
2 = si2 (ni −1)∑∑
Error degrees of freedom =
€
dferror = dfi = ni −1( )∑∑ = N − k
MSerror is like the pooled variance in a 2-sample t-test:
€
MSerror =SSerrordferror
=si2(ni −1)∑N − k
€
sp2 =
df1s12 + df2s2
2
df1+ df2
Test statistic: F
If H0 is true:
€
F =n σ x
2
σ x2 = 1
If HA is true:
F =MSgroupMSerror
Estimate of F:
€
F =n σ x
2 + Variance µi[ ]( )σ x2 >1
Assumptions of ANOVA
(1) Random samples
(2) Normal distributions for each population
(3) Equal variances for all populations.
Example: Does the mean amount of nectar taken by bees depend on the concentration of caffeine in the nectar?
Hypotheses
H0: Mean amount of nectar does not differ among caffeine concentrations HA: Mean amount of nectar differs among caffeine concentrations
Example:
caffeine concentration mean standard
deviation n
50 ppm 0.008 0.289 5 100 ppm -0.172 0.169 5 150 ppm 0.376 0.309 5 200 ppm 0.378 0.393 5
Does the mean amount of nectar taken by bees depend on the concentration of caffeine in the nectar?
Mean squares Error
SSerror = dfisi2∑
= 4 0.289( )2 + 4 0.169( )2 + 4 0.309( )2 + 4 0.393( )2
=1.4482
dferror = 4+ 4+ 4+ 4 =16
MSerror =1.448216
= 0.0905
Mean Squares Group:
X =5 0.008( )+ 5 −0.172( )+ 5 0.376( )+ 5 0.378( )
5+ 5+ 5+ 5= 0.1475
SSgroup = 5(0.008-0.1475)2 + 5(-0.172-0.1475)2 + 5(0.376-0.1475)2 + 5(0.378-0.1475)2
=1.1344
€
SSgroup = ni X i − X ( )2∑
Mean Squares Group:
dfgroup= k – 1 = 4-1=3
MSgroups =SSgroupsdfgroups
=1.13443
= 0.3781
The test statistic for ANOVA is F
F =MSgroupMSerror
=0.37810.0905
= 4.18
MSgroupis always in the numerator, MSerror is always in the denominator
Compare to Fα(1),df_group,df_error
F0.05(1),3,16 = 3.24 F0.025(1),3,16 = 4.08 F0.01(1),3,16 = 5.29
4.08<4.18<5.29 0.025>P>0.01 and we can reject the null hypothesis. The mean amount of nectar taken differs for at least one of the caffeine concentrations
ANOVA table
Source SS df MS F P Group 1.1344 3 0.3781 4.18 <0.025 Error 1.4482 16 0.0905 Total 2.5826 19
In-class exercise Does PLP1 gene expression differ among people with schizophrenia, bipolar disorder and control groups?
Group mean standard deviation n
control -0.004 0.218 15 schizophrenia -0.195 0.182 15
bipolar -0.263 0.151 15
Variation Explained: R2
R2 =SSgroupsSStotal
The fraction of variation in Y that is “explained” by differences among groups
Kruskal-Wallis test
• A non-parametric test similar to a single factor ANOVA
• Uses the ranks of the data points
Multiple comparisons
Probability of at least one Type I error in N tests = 1-(1-α)N
For 20 tests, the probability of at least one Type I error is ~65%.
For 100 tests, ~99.4%!
"Bonferroni correction" for multiple comparisons
Uses a smaller α value:
€
" α =α
number of tests
Which groups are different?
After finding evidence for differences among means with ANOVA, sometimes we want to know: Which groups are different from which others?
One method for this: the Tukey-Kramer test
The Tukey-Kramer test
Done after finding variation among groups with single-factor ANOVA.
Compares all group means to all other
group means
With the Tukey-Kramer method, the probability of making at least one Type 1 error throughout the course of testing all pairs of means is no greater
than the significance level α.
Groups which cannot be distinguished share the same letter.
Another imaginary example:
€
H0 : µ1 = µ2
H0 : µ1 = µ3
H0 : µ2 = µ3
Cannot reject
Cannot reject
Reject
Chapter 16: Correlation between numerical variables
Two variables: Which test?
Explanatory variable
Categorical Numerical
Categorical Contingencyanalysis
Logisticregression
Survivalanalysis
Responsevariable
Numerical t-test
Analysis ofvariance
Regression
Correlation
Scatter plot
Tattersall et al. (2004) Journal of Experimental Biology 207:579-585
Correlation: r
• r is called the “correlation coefficient”
• Describes the relationship between two numerical variables
• Parameter: ρ (rho) Estimate: r
• -1 < ρ < 1 -1 < r < 1
Estimating the correlation coefficient
€
r =
Xi − X ( )∑ Yi − Y ( )
Xi − X ( )2∑ Yi − Y ( )2∑
“Sum of products”
“Sum of squares”
Standard error of r
€
SEr =1− r 2
n − 2
Example
How strong is the association between
number of encounters with aggressive adults as a chick and future aggressive behavior?
Example
Number of visits
Future aggressive behavior
1 -‐0.80 7 -‐0.92 15 -‐0.80 4 -‐0.46 11 -‐0.47 14 -‐0.46 23 -‐0.23 14 -‐0.16 9 -‐0.23 5 -‐0.23 4 -‐0.16 10 -‐0.10
Number of visits
Future aggressive behavior
13 -‐0.10 13 0.04 14 0.13 12 0.19 13 0.25 9 0.23 8 0.15 18 0.23 22 0.31 22 0.18 23 0.17 31 0.39
X is events experienced while nestling, Y is future behavior
X∑ = 315 Y∑ = −2.85
X 2∑ = 5329 Y 2∑ = 3.5553
XY∑ = −4.32 n = 24
Shortcuts
€
Xi − X ( ) Yi −Y ( )i=1
n
∑ = XiYi∑$
% & &
'
( ) ) −
Xi Yi∑∑
n
Xi − X ( )2i=1
n
∑ = Xi2( )∑ −
Xi∑$
% & &
'
( ) )
2
n
Yi −Y ( )2i=1
n
∑ = Yi2( )∑ −
Yi∑$
% & &
'
( ) )
2
n
Finding r
Xi − X( ) Yi −Y( )i=1
n
∑ = XiYi∑#
$%
&
'(−
Xi Yi∑∑n
= −4.32−315( ) −2.85( )
24= 33.086
Xi − X( )2
i=1
n
∑ = Xi2( )∑ −
Xi∑#
$%
&
'(
2
n= 5329−
315( )2
24=1194.625
Yi −Y( )2
i=1
n
∑ = Yi2( )∑ −
Yi∑#
$%
&
'(
2
n= 3.5553−
−2.85( )2
24= 3.217
r = 33.0861194.625( ) 3.217( )
= 0.534
SEr =1− r2
n− 2=1− (0.534)2
24− 2= 0.180
If ρ = 0,...
€
t =rSEr
r is normally distributed with mean 0
with df = n -2
Example
• Are the effects of new mutations on mating success and productivity correlated?
• Data from various visible mutations in Drosophila melanogaster
Hypotheses
H0: Mating success and productivity are not related (ρ = 0).
HA: Mating success and productivity are
correlated (ρ ≠ 0).
X is productivity, Y is the mating success
€
X∑ = −24.228 Y∑ = 9.498
X 2∑ = 35.1808 Y 2∑ = 4.5391
XY∑ = −4.62741 n = 31
Finding r
€
Xi − X ( ) Yi −Y ( )i=1
n
∑ = XiYi∑$
% & &
'
( ) ) −
Xi Yi∑∑
n
= −4.627 −−24.228( ) 9.4982( )
31= 2.796
Xi − X ( )2i=1
n
∑ = Xi2( )∑ −
Xi∑$
% & &
'
( ) )
2
n= 35.1808 −
−24.228( )2
31=16.245
Yi −Y ( )2i=1
n
∑ = Yi2( )∑ −
Yi∑$
% & &
'
( ) )
2
n= 4.539 −
9.49824( )2
31=1.6289
€
r =2.796
16.245( ) 1.6289( )= 0.5435
SEr =1− r2
n− 2=
0.704529
= 0.1558
t = rSEr
=0.54350.1558
= 3.49
df= n-2=31-2=29
t=3.49 is greater than t0.05(2), 29 = 2.045, so we can reject the null hypothesis and say that productivity and male mating success are correlated across genotypes.
Correlation assumes...
• Random sample
• X is normally distributed with equal variance for all values of Y
• Y is normally distributed with equal variance for all values of X
Bivariate Normal Distribution
• The relationship between X and Y is linear
• The cloud of points in a scatter plot of X and Y has a circular or elliptical shape • The frequency distribution of X and Y separately are normal
Most Frequent departures from bivariate normal distribution
Spearman's rank correlation
• An alternative to correlation that does not make so many assumptions
Example: Spearman's rs VERSIONS: 1. Boy climbs up rope, climbs down again 2. Boy climbs up rope, seems to vanish, re-appears at top, climbs down again 3. Boy climbs up rope, seems to vanish at top 4. Boy climbs up rope, vanishes at top, reappears somewhere the audience was not looking 5. Boy climbs up rope, vanishes at top, reappears in a place which has been in full view
Example: Spearman's rs
Hypotheses H0: The difficulty of the described trick is not correlated with the time elapsed since it was observed. HA: The difficulty of the described trick is correlated with the time elapsed since it was observed.
Years Elapsed Rank Years Impressiveness Score Rank Impressiveness 2 1 1 2
5 3.5 1 2
5 3.5 1 2
4 2 2 5
17 5.5 2 5
17 5.5 2 5
31 13 3 7
20 7 4 12.5
22 8 4 12.5
25 9 4 12.5
28 10.5 4 12.5
29 12 4 12.5
34 14.5 4 12.5
43 17 4 12.5
44 18 4 12.5
46 19 4 12.5
34 14.5 4 12.5
28 10.5 5 19.5
39 16 5 19.5
50 20.5 5 19.5
50 20.5 5 19.5
Finding rs
Ri − R( ) Si − S( )i=1
n
∑ = RiSi∑#
$%
&
'(−
Ri Si∑∑n
= 566
Ri − R( )2
i=1
n
∑ = Ri2( )∑ −
Ri∑#
$%
&
'(
2
n= 767.5
Si − S( )2
i=1
n
∑ = Si2( )∑ −
Si∑#
$%
&
'(
2
n= 678.5
rS =566
767.5( ) 678.5( )= 0.784
rS(0.05,21)=0.434 rS(0.01,21)=0.550 Since rS=0.784 is greater than 0.550, P<0.01 We reject the null hypothesis There is a positive correlation between the impressiveness score and number of years elapsed
Spearman’s rank correlation for n >100
SE[rS ]=1− rS
2
n− 2
t = rSSE[rS ]
df = n− 2
Attenuation: The estimated correlation will be lower
if X or Y are estimated with error
Real correlation
Y estimated with measurement
error
X and Y estimated with measurement
error
Correlation depends on range