Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation...
-
Upload
britney-welch -
Category
Documents
-
view
219 -
download
0
Transcript of Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation...
Descriptive measures of the strength of a linear association
r-squared and the (Pearson) correlation coefficient r
Translating a research question into a statistical procedure
• How strong is the linear relationship between skin cancer mortality and latitude?– (Pearson) correlation coefficient r– Coefficient of determination r2
Where does this topic fit in?
• Model formulation
• Model estimation
• Model evaluation
• Model use
10 9 8 7 6 5 4 3 2 1 0
60
50
40
x
y
S = 7.81137 R-Sq = 6.5 % R-Sq(adj) = 3.2 %
y = 54.4758 - 0.764016 xRegression Plot
6.18271
2
n
ii yySSTO
5.1708ˆ1
2
n
iii yySSE
1.119ˆ1
2
n
ii yySSRy
y
Situation #1A very weak linear relationship
0 1 2 3 4 5 6 7 8 9 10
10
20
30
40
50
60
70
80
x
y
y = 75.5458 - 5.76402 xS = 7.81137 R-Sq = 79.9 % R-Sq(adj) = 79.2 %
Regression Plot
3.6679ˆ1
2
n
ii yySSR
5.1708ˆ1
2
n
iii yySSE
8.84871
2
n
ii yySSTO
y
y
Situation #2A fairly strong linear relationship
Coefficient of determination r2
SSTO
SSE
SSTO
SSRr 12
• r2 is a number (a proportion!) between 0 and 1.• If r2 = 1:
– all data points fall perfectly on the regression line– the predictor x accounts for all of the variation in y
• If r2 = 0:– the fitted regression line is perfectly horizontal– the predictor x accounts for none of the variation in y
Interpretation of r2
• r2 ×100 percent of the variation in y is reduced by taking into account predictor x.
• r2 ×100 percent of the variation in y is “explained by” the variation in predictor x.
R-sq in Minitab fitted line plot
30 40 50
100
150
200
Latitude (at center of state)
Mo
rta
lity
Mort = 389.189 - 5.97764 Lat
S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %
Regression Plot
R-sq in Minitab regression output
The regression equation is Mort = 389.189 - 5.97764 Lat S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %
Analysis of Variance
Source DF SS MS F PRegression 1 36464.2 36464.2 99.7968 0.000Error 47 17173.1 365.4 Total 48 53637.3
Pearson correlation coefficient r
2rr • r is a (unitless) number between -1 and 1, inclusive.
• Sign of coefficient of correlation
– plus sign if slope of fitted regression line is positive
– negative sign if slope of fitted regression line is negative
If r2 is represented in decimal form, e.g. 0.39 or 0.87, then:
Formulas for the Pearson correlation coefficient r
n
i
n
iii
n
iii
yyxx
yyxxr
1 1
22
1
1
1
2
1
2
b
yy
xx
rn
ii
n
ii
What do we learn from the formulas for r?
• The correlation coefficient r gets its sign from the slope b1.
• The correlation coefficient r is a unitless measure.
• The correlation coefficient r = 0 when the estimated slope b1 = 0 and vice versa.
Interpretation of Pearson correlation coefficient r
• There is no nice practical interpretation for r as there is for r2.
• r = -1 is perfect negative linear relationship.• r = 1 is perfect positive linear relationship.• r = 0 is no linear relationship.• For other r, how strong the relationship
between x and y is deemed depends on the research area.
Pearson correlation coefficient r in Minitab
Correlations: Mort, Lat
Pearson correlation of Mort and Lat = -0.825
Correlations: Lat, Mort
Pearson correlation of Lat and Mort = -0.825
How strong is the linear relationship between Celsius and Fahrenheit?
0 10 20 30 40 50
30
40
50
60
70
80
90
100
110
120
Celsius
Fa
hre
nhe
it
Fahrenheit = 32 + 1.8 Celsius
S = 0 R-Sq = 100.0 % R-Sq(adj) = 100.0 %
Regression Plot
Pearson correlation of Celsius and Fahrenheit = 1.000
How strong is the linear relationship between # of stories and height?
105 95 85 75 65 55 45 35 25 15
1200
700
200
STORIES
HE
IGH
T
S = 58.3259 R-Sq = 90.4 % R-Sq(adj) = 90.2 %
HEIGHT = 90.3096 + 11.2924 STORIESRegression Plot
Pearson correlation of HEIGHT and STORIES = 0.951
How strong is the linear relationship between driver age and see distance?
80706050403020
600
500
400
300
DrivAge
Dis
tanc
e
S = 49.7616 R-Sq = 64.2 % R-Sq(adj) = 62.9 %Distance = 576.682 - 3.00684 DrivAge
Regression Plot
Pearson correlation of Distance and DrivAge = -0.801
How strong is the linear relationship between height and g.p.a.?
75706560
4
3
2
height
gpa
S = 0.542316 R-Sq = 0.3 % R-Sq(adj) = 0.0 %
gpa = 3.41021 - 0.0065630 height
Regression Plot
Pearson correlation of height and gpa = -0.053
Caution #1
• The correlation coefficient r quantifies the strength of a linear relationship.
• It is possible to get r = 0 with a perfect curvilinear relationship.
Example of Caution #1
5 0-5
40
30
20
10
0
x
y
S = 13.4907 R-Sq = 0.0 % R-Sq(adj) = 0.0 %
y = 14 - 0.0000000 xRegression Plot
Pearson correlation of x and y = 0.000
y
Clarification of Caution #1
5 0-5
40
30
20
10
0
x
y
S = 0 R-Sq = 100.0 % R-Sq(adj) = 100.0 %y = 0.0000000 - 0.0000000 x + 1 x**2
Regression Plot
Pearson correlation of x and y = 0.000
Caution #2
• A large r2 value should not be interpreted as meaning that the estimated regression line fits the data well.
• Another function might better describe the trend in the data.
Example of Caution #2
200019001800
200
100
0
Year
US
Po
pula
tion
(mill
ions
)
S = 22.8349 R-Sq = 92.0 % R-Sq(adj) = 91.6 %
USPopn = -2217.46 + 1.21862 Year
Regression Plot
Pearson correlation of Year and USPopn = 0.959
Caution #3
• The coefficient of determination r2 and the correlation coefficient r can both be greatly affected by just one data point (or a few data points).
Example of Caution #3
6 7 8
0
100
200
300
400
500
Magnitude
De
ath
s
Deaths = -1121.94 + 179.468 Magnitude
S = 140.359 R-Sq = 53.5 % R-Sq(adj) = 41.9 %
Regression Plot
Pearson correlation of Deaths and Magnitude = 0.732
Example of Caution #3
6.4 6.9 7.4
0
50
100
Magnitude
De
ath
s
Deaths = 647.967 - 87.1465 Magnitude
S = 13.1447 R-Sq = 92.1 % R-Sq(adj) = 89.4 %
Regression Plot
Pearson correlation of Deaths and Magnitude = -0.960
Caution #4
• Correlation (association) does not imply causation.
Example of Caution #4
9876543210
300
200
100
Wine consumption
Hea
rt d
isea
se d
eath
s
S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %
Heart = 260.563 - 22.9688 WineRegression Plot
Liters of wine per person per year
(per
100
,000
peo
ple)
Pearson correlation of Wine and Heart = -0.843
Caution #5
• Ecological correlations are correlations that are based on rates or averages.
• Ecological correlations tend to overstate the strength of an association.
Example of Caution #5
• Data from 1988 Current Population Survey• Treating individuals as the units
– Correlation between income and education for men age 25-64 in U.S. is r ≈ 0.4.
• Treating nine regions as the units– Compute average income and average education for
men age 25-64 in each of the nine regions.
– Correlation between the average incomes and the average education in U.S. is r ≈ 0.7.
Example of Caution #5
9876543210
300
200
100
Wine consumption
Hea
rt d
isea
se d
eath
s
S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %
Heart = 260.563 - 22.9688 WineRegression Plot
Liters of wine per person per year
(per
100
,000
peo
ple)
Example of Caution #5
30 40 50
100
150
200
Latitude (at center of state)
Mo
rta
lity
Mort = 389.189 - 5.97764 Lat
S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %
Regression Plot
Caution #6
• A “statistically significant” r2 does not imply that the slope β1 is meaningfully different from 0.
Caution #7
• A large r2 does not necessarily mean that a useful prediction of the response ynew (or estimation of the mean response μY) can be made.
• It is still possible to get prediction (or confidence) intervals that are too wide to be useful.
Using the sample correlation r to learn about
the population correlation ρ
Translating a research question into a statistical procedure
• Is there a linear relationship between skin cancer mortality and latitude?– t-test for testing H0: β1= 0
– ANOVA F-test for testing H0: β1= 0
• Is there a linear correlation between husband’s age and wife’s age?– t-test for testing population correlation
coefficient H0: ρ = 0
Where does this topic fit in?
• Model formulation
• Model estimation
• Model evaluation
• Model use
Is there a linear correlation between husband’s age and wife’s age?
655545352515
65
60
55
50
45
40
35
30
25
20
Wife's Age (years)
Hus
band
's A
ge (
year
s)
Pearson correlation of HAge and WAge = 0.939
Is there a linear correlation between husband’s age and wife’s age?
65605550454035302520
65
55
45
35
25
15
Husband's Age (years)
Wife
's A
ge (
year
s)
Pearson correlation of WAge and HAge = 0.939
The formal t-test for correlation coefficient ρ
Null hypothesis H0: ρ = 0Alternative hypothesis HA: ρ ≠ 0 or ρ < 0 or ρ > 0
Test statistic2
*
1
2
r
nrt
P-value = What is the probability that we’d get a t* statistic as extreme as we did, if the null hypothesis is true?
The P-value is determined by comparing t* to a t distribution with n-2 degrees of freedom.
Is there a linear correlation between husband’s age and wife’s age?
Test statistic:
39.35939.01
2170939.0
1
222
*
r
nrt
Student's t distribution with 168 DF x P( X <= x ) 35.3900 1.0000
Help in determining the P-value:
Just let Minitab do the work:
Pearson correlation of WAge and HAge = 0.939P-Value = 0.000
When is it okay to use the t-test for testing H0: ρ = 0?
• When it is not obvious which variable is the response.
• When the (x, y) pairs are a random sample from a bivariate normal population.– For each x, the y’s are normal with equal variances. – For each y, the x’s are normal with equal variances.– Either, y can be considered a linear function of x.– Or, x can be considered a linear function of y.
• The (x, y) pairs are independent.
The three tests will always yield similar results.
Pearson correlation of WAge and HAge = 0.939P-Value = 0.000
The regression equation is HAge = 3.59 + 0.967 Wage170 cases used 48 cases contain missing values
Predictor Coef SE Coef T PConstant 3.590 1.159 3.10 0.002WAge 0.96670 0.02742 35.25 0.000
S = 4.069 R-Sq = 88.1% R-Sq(adj) = 88.0%
Analysis of VarianceSource DF SS MS F PRegression 1 20577 20577 1242.51 0.000Error 168 2782 17Total 169 23359
The three tests will always yield similar results.
The regression equation is WAge = 1.57 + 0.911 HAge170 cases used 48 cases contain missing values
Predictor Coef SE Coef T PConstant 1.574 1.150 1.37 0.173HAge 0.91124 0.02585 35.25 0.000
S = 3.951 R-Sq = 88.1% R-Sq(adj) = 88.0%
Analysis of VarianceSource DF SS MS F PRegression 1 19396 19396 1242.51 0.000Error 168 2623 16Total 169 22019
Pearson correlation of WAge and HAge = 0.939P-Value = 0.000
Which results should I report?
• If one of the variables can be clearly identified as the response, report the t-test or F-test results for testing H0: β1 = 0.– Does it make sense to use x to predict y?
• If it is not obvious which variable is the response, report the t-test results for testing H0: ρ = 0.– Does it only make sense to look for an association
between x and y?