1 Chapter 10 Linear regression and correlation Relationship between variables.

Chapter 10

Linear regression and correlation

Relationship between variables

Age and blood pressure

Nutrient level and growth of cells

Height and weight

To determine the strength of relationship between two variables and to test if it is

statistically significant

Two samples t test vs regression

group 1 group 2

Observe group

Difference, variation and association analysis

Relationship Variable Y Variable X

Two sample t test (quantity)

One way ANOVA (quantity)

Regression and correlation

(quantity)

Group(0,1)

(category)

Group(ABC)

(category)

(quantity)

Sir Francis Galton (16 February 1822 – 17 January 1911)

Polymath:Meteorology (the anti-cyclone and the first popular weather maps);Psychology (synaesthesia);Biology (the nature and mechanism of heredity);Eugenicist; Criminology (fingerprints); Statistics (regression and correlation).

Related but Different

Regression analysis:

one of the variables (e.g. blood pressure) is dependent on (caused by) the other which are fixed and measured without error (e.g. age).

Correlation analysis:

both variables are experimental and measured with error (e.g height and weight).

Regression analysis

Recording

Number

temperature

(◦Celsius)

heart rate

(beast/minute)

2 4 11

3 6 11

4 8 14

5 10 22

6 12 23

7 14 32

8 16 29

9 18 32

The experimental data

Repeated experiments

Correlation analysis

Animal

Length

1 10.7 5.8

2 11.0 6.0

3 9.5 5.0

4 11.1 6.0

5 10.3 5.3

6 10.7 5.8

7 9.9 5.2

8 10.6 5.7

9 10.0 5.3

10 12.0 6.3

The experimental data

More individuals measured

Equation for a straight line

If you know a and b, can predict Y from X

----the goal of regression

Y a bX

Regression analysis

Regression vs correlation

Concepts

Simple linear regression

Simple linear correlation

Correlation analysis based on ranks

Example

Consider growth rate of a yeast colony and nutrient level .

If you increase nutrient level, the growth rate would increase.

Growth rate is dependent on nutrient level but nutrient level is NOT dependent on growth rate.

Growth rate is called the Dependent Variable and is given the symbol Y.

Nutrient level (the causal factor) is called the Independent Variable and is given the symbol X.

Variables in Regression

Nutrient level X

th rate Y

Single linear model assumption

a) X’s are fixed and measured without error

d) Homoscedastic

|( ) Y XE Y X

2| ~ . . . (0, )i ii i i Y X i iY X or and i i d N

α,β:constant real numbers, β ≠0

independent identically distributed

General steps for simple linear regression analysis

① Graphing the data

② Fitting the best straight line

③ Testing whether the linear relationship is

statistically significant or not

① Graphing the data

② Fitting the best straight line

No relationship

Relationship but not straight-lined

Negative linear relationship

Positive linear relationship

Which one？

Need criterion

Example: Area of a yeast colony on successive days.

Time days (x)

Slope (b) = H/L

Intercept (at x=0)

The best fit

ProblemA

Time days (x)

How to estimate a and b?

Method

Y Y( , )i iX Y

2( )Total iSS Y Y Total sum of squares for Y:

( ) YE Y Fitting to the dataY Y

MethodA

Time days (x)

Y a bX

( , )i iX Y

Residual error sum of squares: 2ˆ( )E i iiSS Y Y

Fitting to the dataY a bX

a and b should minimize the residual error

Method

ˆ ˆ( ) ( )i i i iY Y Y Y Y Y ˆ ˆ( ) ( ) ( )i i i iY Y Y Y Y Y

ˆ ˆ( ) ( ) ( )i i i iY Y Y Y Y Y

2 2ˆ ˆ( ) [( ) ( )]i i i iY Y Y Y Y Y 2 2ˆ ˆ ˆ ˆ( ) 2 ( )( ) ( )i i i i i iY Y Y Y Y Y Y Y

2 22 ˆ ˆ( ) ( ) ( )i i i iY Y Y Y Y Y Sum of Squares

SSTotal

Sum of Squaresdue to regression

Sum of SquaresResidual or error

maximize minimize

Least Square Regression Equation

• Minimize SSError by partial derivatives

0 2 { [ ' ( )]} 0'

' ( ) 0

Errori i

SSY a b X X

Y a b X X

2 2ˆ( ) { [ ' ( )]}Error i i i iSS Y Y Y a b X X

ˆ' ( ) ; ' ( )i i i i iY a b X X Y a b X X

Least Square Regression Equation

0 2{ [ ' ( )]}[ ( )] 0

( ) '( ) ( ) 0

Errori i i

i i i i

SSY a b X X X X

Y X X a X X b X X

( ) ( )

( )( ) /

i i i i XY

Y X X b X X

Y X Xb

X Y X Y n SSb

X X n SS

Result

Least squares regression line

( )( )

X YXY SSnb

a Y bX

|ˆ ˆ

Y XY a bX |ˆ ˆ ' ( X)Y XY a b X

③ Simple Linear Regression Analysis

A global test for regression (ANOVA)

A test for regression coefficient (Student

t test)

Hypothesis

H0: The variation in Y is not explained by

a linear model, i.e., β=0

Ha: A significant portion of the variation in

Y is explained by a linear model i.e., β≠0

Partitioning the Sum of Squares

2 2ˆ( ) ( ( ) )R i iSS Y Y Y b X X Y 2 22 2 2( ) ( )i i Xb X X b X X b SS

22 ( )

( )XY XYX XY

SS SSSS bSS

22 2 ( )

( ) iTotal i i

YSS Y Y Y

2ˆ( )E i i Total RSS Y Y SS SS

n-1SSTotalTotal

MSEn-2SSE

See Table C.7MSR1SSRRegression

c.v.FE(MS)MSDFSSSource of variation

2 2Y XSS R

The ANOVA table for a regression analysis

If H0 is true

2( ) R Y X

EMS SSE F

2( ) 1Y X

If Ha is true,β=0

Test statistic: (1, 2)R

MSF F n

Coefficient of determination

a measure of the amount of the variability in Y that is explained by its dependence on X.

Coeff of D. = R

Simple Linear Regression Analysis

A global test for regression (ANOVA)

A test for regression coefficient (Student

t test)

Hypothesis

H0: The variation in Y is not explained by

a linear model, i.e., β=0

Ha: A significant portion of the variation in

Y is explained by a linear model i.e., β≠0

t test statistic

Variance of b:

It’s estimate:

Standard error of b:

~ ( 2)b

bt t ns

2 21b Y

2 1( )

SS MSs

SS n SS

F(1,n-2) = t2(n-2)

21 1 / 2

MS b SSANOVA F

b SSbStudent t

follow student’s t distribution

Confidence interval

Confidence interval for β

1 12 2

( ) 1b

bP t t

1 12 2

( ) 1 , 2b bC b t s b t s with df n

1 21 1

,b bL b t s L b t s

bt df n

Standard error of is:

|ˆ ˆ ( )i Y X iY Y b X X

2 2E b E XYs MS n s MS SS

Confidence Interval for |Y X

( )1( ) [ ]

2 ( )i

iE E EiY

X XMS MS SSs X X

n SS n n X X

|ˆY X

Sampling error

follow student’s t distribution

Confidence interval

ˆ ˆ| | |1 1

ˆ ˆ( ) 1 ,

Y X Y X Y XY YC t s t s

with df n

| | , 2ˆ

Y X Y Xt df nS

Standard error of is:

( )i i iY Y b X X

2 2E b E XYs MS n s MS SS

Confidence Interval for iY

( )1( ) [1 ]

2 ( )i

iE E EY E i

X XMS MS SSs MS X X

n SS n n X X

Sampling error

Understand the regression

analysis via example

Example1: Yield of tomato varieties

Variety 1 Variety 2

248 189Totals:

Variety

Mean 24.8 18.9

St. Dev. 5.8271 6.3675

Variance 33.96 40.54

Nos. of

observations 10 10

Summarized data:

A. Student’s t test

There is no difference between the two variances

There is difference between the two mean

33.96 0.84, 0.79640.54sF ps

Accept H0

0.045p Reject H0

2 40.54 33.9637.3

(24.8 18.9) (0 0) 5.92.16 2 18

2.731 137.3 ( )

t with n n

0 1 2:H

B. ANOVA

Item df SS MS F(1,18) P

Between 1 174.05 174.05 4.67** <0.01

Within 18 670.5 37.25

Total 19 844.55

Conclusion: Reject H0

0 1 2:H

Compare ANOVA with t test

t was 2.16 for 18df, 0.05 P 0.01

F was 4.67 for 1 and 18 df, 0.05 P 0.01

In fact, F= t2 (i.e. 4.67=2.162)

Because with t we are dealing with differences

while with F we are dealing with variances

(differences squared)

C. RegressionObserve Variety

Variety 1 Variety 2

Calculations20 30 437n X Y

2 22 ( ) 30

50 520X

10 1 10 2 50

2 22 ( ) 437

10393 844.5520Y

30 437626 29.5

X YSS XY

Regression coefficient

Intercept atx

Intercept

regression equation

Estimation

29.55.9

5, 844.55, 29.5X Y XYSS SS SS

30.7 5.9Y X

21.85 5.9 1.5 30.7a Y b X

21.85a Y

21.85 5.9( 1.5)Y X

Testing the significance ANOVA

H0: no linear relation between y and x. β=0

Ha: the variation in y is linearly explained by

the variation in x. i.e., β≠0

( | 1)

( | 2)

ANOVA SS

5, 844.55, 29.5X Y XYSS SS SS

844.55

( ) ( 29.5)174.05

Total Y

E Total R

SS SS SS

Regression ANOVA

Item df SS MS F P

Regression 1 174.05 174.05 4.67** <0.01

Error 18 670.5 37.25

Total 19 844.55

Conclusion: Reject H0

a measure of the amount of the variability in y that is explained by its dependence on x.

Coeff of D. = R

174.0520.6%

844.55

test for regression coefficient

• H0: β=0

37.252.729

5.92.16 ~ (18)

2.729b

Example 2:Yeast Data

Yeast colony grown on

Area measured (mm2)

on 9 successive days

and area transformed

to logs.

(days)

(log mm2)

1 3.62 3.8

x = 45 y = 43.5

Exponential

Scatterplot of Yeast dataA

0 2 4 6 81 3 5 7 9

Apparent positive

linear relationship

Nonlinear->linear

Power-law:

log log log

Exponential:

log log log

Y a b X

Y a X b

Calculations9 45 43.5n X Y

2 22 ( ) 45

285 609X

2 2 2 2

1 2 9 285

3.6 3.8 6.1 216.15

1 3.6 2 3.8 9 6.1 236.2

2 22 ( ) 43.5

216.15 5.99Y

45 43.5236.2 18.7

X YSS XY

(days)

(2log mm)

X = 45 Y = 43.5

Regression coefficient

Intercept atx

Intercept

regression equation

Estimation

18.70.3117

60, 5.9, 18.7X Y XYSS SS SS

3.27 0.3117Y X

4.83 0.3117 5 3.27a Y b X

4.83a Y

4.83 0.3117( 5.0)Y X

To fit line

regression equation

Use two extreme values of x; 0 and 9

When x = 0, y = 3.27

When x = 9, y = 3.27 + 0.3117*9 = 6.08

3.27 0.3117Y a bX X

Fitting the best line.

0 2 4 6 81 3 5 7 9

(y= 3.27 + 0.3117x)

(0, 3.27)

(9, 6.08)

(5, 4.83)

Testing the significance ANOVA

H0: no linear relation between y and x. β=0

Ha: the variation in y is linearly explained

by the variation in x. i.e., β≠0

Item df SS MS F c.v.

Regression 1

Error n-2

Total n-1

ANOVA SS

60, 5.9, 18.7X Y XYSS SS SS

( ) 18.75.8282

5.9 5.8282 0.0718

Total Y

E Total R

SS SS SS

Regression ANOVA

Item df SS MS F P

Regression 1 5.8282 5.8282 565.8** <0.01

Error 7 0.0718 0.0103

Total 8 5.9000

Conclusion: Reject H0 (of no relationship) and conclude

that a significant portion of the variability in colony area

is explained by regression on time.

a measure of the amount of the variability in y that is explained by its dependence on x.

Coeff of D. = R

5.828298.8%

inference from yeast data

H0: Log area has no linear relationship with time

Ha: Log area has a linear relationship with time

inference: Reject H0 and accept Ha,

i.e. log area changes linearly with time.

Moreover, it explains 98.8% of the variation.

Confidence interval for β

(standard error of

95% confidence interval for β

2 0.0163= 0.0001717

0.0001717 0.0131

0.3117 2.365 0.0131 0.2807

0.3117 2.365 0.0131 0.3427

0.311723.79 565.8

0.0131b

Confidence interval for

standard error

95% confidence interval limits

4.83 2.365 0.034 4.75

4.83 2.365 0.034 4.91

( )1[ ]

1 (5 5)0.0103[ ] 0.034

X Xs MS

|5ˆ ˆ 4.83 0.3117 (5 5) 4.83i YY

Summary of Regression

Regression analysis is used when one variable is fixed

(x) and likely to cause variation in the other (y)

Graph data to ascertain linear relationship apparent

Calculate the regression equation using the least

squares method

Test the significance of this equation with ANOVA

If significant, plot the equation on the graphed data

Calculate required confidence intervals

Partial pressure CO2 (torr)

respiration rate(breaths/minute)

1 30 8.1

2 32 8.0

3 34 9.9

4 36 11.2

5 38 11.0

6 40 13.2

7 42 14.6

8 44 16.6

9 46 16.7

10 48 18.3

11 50 18.2

Example 3: The effect of carbon dioxide on respiration rate

① Construct a scatterplot of these data

② Compute the linear regression equation

③ Test the significance of this equation via ANOVA

④ Calculate the 95% CI for β

⑤ Find the predicted respiration rate for

48 torr and the 95% CI

⑥ Find the predicted respiration rate for

38 torr and the 95% CI

⑦ Why these two CI have difference

length?

Scat terpl ot of CO2 pressure andrespi rat i on rate

26 30 34 38 42 46 50 54

CO2 pressure

Positive linear relationship

440 40 18040

145.8 13.25 2082.04

( )( ) 40 13.256085.0-

11 0.575( ) 440

18040-11

X YXYSS nb

13.25 0.575 40 9.745a Y bX

ˆ 9.745 0.575Y X

Compute the linear regression equation

res rate

5045403530

S 0. 671009R-Sq 97. 3%R-Sq(adj ) 97. 0%

Fi tted Li ne Pl otres rate = - 9. 745 + 0. 5750 CO2

Test the significance 2 2

( ) 145.82082.04 149.53

( )( )( ) 253

145.47( ) 440

149.53 145.47 4.05

E Total R

SS SS SS

149.5310Total

0.454.059Remainder

<0.01323.10**145.47145.471Regression

PFMSSSdfItem

Conclusion: Reject H0 and accept Ha, respiration rate

changes linearly with partial pressure CO2

The CI for β: [0.503 ， 0.647]

The CI for y48: [17.53 ， 18.18] 0.65

The CI for y38: [11.89 ， 12.32] 0.43

t1-α/2 ＝？2.365, 2.306, 2.262, 2.228

df 7, 8, 9, 10

1 21 1

,b bL b t s L b t s

ˆ ˆ1 | 2 |1 1

ˆ ˆ,Y X Y XY YL t s L t s

The difference length between the two CI is due to that 48 torr gets further from average partial pressure CO2-

40 torr than 38 torr.

The CI length increases when partial pressure is further from the mean value, as it is determined by the standard error of Y and this standard error is determined by the difference between individual partial pressure and average partial pressure.

( )1[ ]

X Xs MS

ˆ ˆ1 | 2 |1 1

ˆ ˆ,Y X Y XY YL t s L t s

SinceAnd Standard error of is:

|ˆ ˆ ( )i Y X iY Y b X X 2 2

E b E XYs MS n s MS SS

( )1( ) [ ]

2 ( )i

iE E EiY

X XMS MS SSs X X

n SS n n X X

|ˆY X

Sampling error

( )1( ) [1 ]

2 ( )i

iE E EY E i

X XMS MS SSs MS X X

n SS n n X X

y = 0. 575x - 9. 7455R2 = 0. 9729

30 35 40 45 50

dataCI f or YCI f or YCI f or YiCI f or Yi

(data)线性

Concepts

Correlation Analysis

To measure the intensity of association

observed between any pair of variables

and test it’s statistic significance

To test whether two variables are covary

or interdependent

Correlation analysis

Two variables, X and Y, are of the same status and we can’t tell which cause and effect not clear.

There may be some common underlying cause of both.

Examples:

height and weight

height of siblings

Types of relationship.

Positive correlation; large X’s are associated with large Y’s

Negative correlation; large X’s are associated with small Y’s

No correlation; Y and X no linear correlation.

No linear correlation ≠ independent

Dependent

Linear

correlation

Independent

( )( )i i

X X Y Y

Index of association

Positive correlation

Large X’s associated with large Y’s

Negative correlationLarge X’s associated with small Y’s

Standardized normal

deviates for X

Standardized normal

deviates for Y

( )( ) 0i iX X Y Y

( )( )i i

X X Y Y

Pearson product-moment correlation coefficient

Abbr: Pearson correlation coefficient

2 22 2

( )( )

( ) ( )[ ][ ]

X X Y Yr

X YXY SSn

SS SSX YX Y

Corrected sum of squares of X

Corrected sum of squares of Y

Corrected cross products

Sample size

Pearson correlation coefficient

A widely used index of association

Association of two quantitative variables

An estimate of population correlation

coefficient XY

COV(X, Y) = E[(X-E(X))(Y-E(Y))]

Population correlation efficient is

Sample correlation efficient is

Linear Correlation Model

Normal standard deviation of X Normal standard deviation of Y

( )( )1( )( )

1 ( ) ( )

X Y x y

SSX X Y YX X Y Yr

n s s SS SSX X Y Y

Linear Correlation Model

Y’s at each X are normal distributed

X’s at each Y are normal distributed

Conditional population mean is

Which follow the regression equation:

( )Y X Y Y X X

X Y X X Y Y

ˆ ( )

Y Y b X X

X X b Y Y

Regression vs Correlation

22 ( )

( )XY XYX

SS SSSS

XY XYY X X Y

XY XY XYY X X Y

X Y X Y

SS SSb b

SS SS SSb b

SS SS SS SS

2R XSS b SS

( ) ( )XY X XYY X X Y

SS SS SSb b

SS SS SS

Coeff of D. = R

SSSSR: Explainable variability

SSTotal=SSY: Total variability

characteristics of correlation coefficient

2 . R R

Total Total

SS SSr Coeff of D r

Total R ESS SS SS 0 R TotalSS SS

20 1, . . 0 1R

SSi e r

r = +1: a complete positive correlation

between x and y

r = -1: a complete negative correlation

between x and y

r = 0 : x and y are not correlation

Correlation has upper and lower limits of +1 and -1 respectively

Test of hypothesis: t test Hypothesis:

To test if r differs from zero

The standard error of r is:

follow t distribution with df=n-2

Critical value for given n and alpha

21 / 2

21 / 22

Regression = Correlation

Total Total

SS SSr Coeff of D

( 2)12

XY E XYr

TotalX Y X E

SS SS SSrt

SS nSS SS SS MSrn

, 2XY X XYb

b X E X E

SS SS SSbt df n

s SS MS SS MS

Understand the correlation

analysis via example

① Preliminary calculations

The length and width of eight overlapping plates composing the

shell of 10 Chiton olivaceous

Animal X

Length (cm)Y

Width (cm)

1 10.7 5.8

2 11 6

3 9.5 5

4 11.1 6

5 10.3 5.3

6 10.7 5.8

7 9.9 5.2

8 10.6 5.7

9 10 5.3

10 12 6.3

1123.9

319.68

599.31XY

② Calculate the correlation coefficient:

③ Test of significance

Hypotheses:

test statistic:

since reject H0

④ Confidence interval ???

0 0.969 0 0.96911.14

0.0871 0.96910 2

0.05(8)11.14 2.306t t

2 2 2 22 2

( )( ) 105.8 56.4599.31

10 0.969( ) ( ) 105.8 56.4

[1123.9 ][319.68 ][ ][ ]10 10

X Yn n

Need different tests of

Can ONLY use t-test to test H0 that = 0,

because only in this case (independent) is the

distribution approximate Normal.

In all other cases, the distribution is

asymmetrical and so we have to use a different

test as follows

Distribution of r in samples

-1 0 +1

= 0 = 0.8

Fisher’s Z transformation must be employed:

hyperbolic tangent of r

Fisher’s Z Transformation

Transform r to Z: hyperbolic tangent

Z follow normal distribution approximately

Standard error of Z

1 1 1tanh ln [ln(1 ) ln(1 )]

e er z

z r r rr

1( , )

2( 1) 3N

Testing when 0 0

follow approximately normal

distribution:

Hypothesis:

Test statistic:

1 1 1 10 0tanh tanh tanh tanh

~ (0,1)1

r ru N

tanhZ r

Confidence interval for correlation coefficient

95% CI for ζρ

95% CI for ρ

11.960 1.960

L Z Zn

Example when expected 0

For genetical reasons, the

correlation of height among

sibs (brothers or sisters) is

expected to be 0.5

i.e., H0 is that =0.5

Family

brother

sister

1 71 69

2 68 64

3 66 65

4 67 64

5 70 65

6 71 62

7 70 62

8 73 64

9 72 66

… … …

50 66 62

390.558

(74)(64)XY

2 0.31142r

Test of significance when 0

for r=0.5

for r=0.558

Standard error is

1ln[(1 0.5) (1 0.5)] 0.5493

1 (50 3) 0.146Z

1ln[(1 0.558) (1 0.558)] 0.63

Therefore cannot reject H0

0 0.63 0.550.55 1.960

0.146Z

Example when expected 0

0.63 1.960 0.146 0.63 0.2859 0.3441

0.63 1.960 0.146 0.63 0.2859 0.9159

0.3311 0.558 0.7240

0.3441 0.63 0.9159

Concepts

Bivariate random sample (not normal distributed)

first grade students’ ages and their performance on a

standardized test

student heights and GPAs

To test the relationship between the two variables

To validate dependence of two random variables.

The characteristics for an index of association

Values only between -1 and +1, inclusive

The stronger the positive correlation is, the

closer the value to be +1

The stronger the negative correlation is, the

closer the value to be -1

For uncorrelated pairs of X an Y, the value

should be close to 0

Kendall Correlation Coefficient τ Direct compare the n observations with each other

Concordant (C): positive correlation

Discordant (D): negative correlation

Tie (E):

( 1) 2 ( 1)

C D C D

n n n n

The total # of comparisons

=C+D+E

Difference between the # of

concordant and discordant pairs

( )( ) 0i j i jX X Y Y

0002.5122712

0104112911

0021102410

1112.59279

04098388

04187377

0425.56326

05275355

080104424

1445.53323

0100112472

0110121531

tied pairs below

discordant pairs below

Concordant pairs below

Rank YiRank XiYiXi① Rank X and Y

② Compare each

observations

with other

observations

below it

③ Summarize C,

D and E

C=4+…=12

D=11+…=52

E=1+1=2

④ Calculate Kendall correlation coefficient:

2( ) 2 (12 52)0.606

( 1) 12 (12 1)

Birthmonth

month after cut-off

students evaluated

Dec 1 53

Jan 2 47

Feb 3 32

Mar 4 42

Apr 5 35

May 6 32

Jun 7 37

Jul 8 38

Aug 9 27

Sep 10 24

Oct 11 29

Nov 12 27

The relative age effects on academic and social performance in Geneva.

The older students tend to be

overrepresented and younger

ones underrepresented.

X: the month

Y: the # of students in grades K

through 4 evaluated for the

district’s Gifted and Talented

Student Program.

⑥ Test significance: two-tailed test

Hypotheses:

test statistic:

Table C.12 for n=12, p=2*0.0027=0.0054<0.01

reject H0

Conclusion: there is moderately strong negative

correlation between month after cut-off date and

referrals for gifted evaluation.

min( , ) 12 1 132 2

E EC D

Spearman’s rank coefficient rs

One of the most common correlation coefficients

Idea: Rank X and Y observations separately and

compute the Pearson correlation coefficient on the

ranks rather than on the original data.

Advantage: to compute more simply

i ii x ywhere d r r

1 1 1 12

( 1) / 2

( 1)(2 1) / 6

( ) / ( 1) /12

/ ( 1) / 4

( 1) /12

6 ( 2 )

yi xii i

rx ry yi yii i

n n n n

xi yi xi yi xi yii i i ip

xi xi yi yii

r r n n

r r n n n

SS SS r r n n n

r r r r n r r n nr

n nSS SS

r r r r

3 ( 1) 2 ( 1)(2 1)

6 ( ) ( 1)

xi yii

n n n n n

r r n n

X: the month

Y: the difference between

actual and expected

numbers of professional

players

Month Actual players Expected players Difference

1 37 28.27 8.73

2 33 27.38 5.62

3 40 26.26 13.74

4 25 27.60 -2.60

5 29 29.16 -0.16

6 33 30.05 2.95

7 28 31.38 -3.38

8 25 31.83 -6.83

9 25 31.16 -6.16

10 23 30.71 -7.71

11 30 30.93 -0.93

12 27 30.27 -3.27

The distribution of Germany professional players' birthdays and that of the general population of Germany.

① Rank X and Y

② Compute the difference

③ Calculate

A negative correlation

The distribution of Germany professional players' birthdays and that of the general population of Germany.

7512-3.2712

4711-0.9311

9110-7.7110

639-6.169

628-6.838

347-3.387

-3962.956

-385-0.165

-264-2.64

-912313.743

-81025.622

-101118.731

Y:DifferenceX: Month yrxr id

i x yd r r

6 4941

12(12 1)

[( 10) 7 ]

④ Test significance

Hypotheses:

Table C.13 critical value = 0.587, for n=12 and α=0.05

since |rs|=0.727>0.587 reject H0

inference: there is an excess in the number of players

born early in the competition year and a lack of those

born late among professional soccer players in

Germany.

Compare the two coefficients

Kendall’s correlation coefficient τ

easy to test significant difference from zero

Spearman’s rank correlation coefficient rs

more common, related to Pearson’s correlation

coefficient

Produce no radically different correlation

values.

Comparing different regressions

Common situation is either to compare your regression with a

published one

or to compare the regressions in two or more experiments you have done

Two linear regression equations:

1 1 1 1

2 2 2 2

Y a b X

Compare intercept: Using t test

Hypothesis:Standard error for a1-a2:

follow t distribution with

0 1 2 1 2: 0H or

2 21 2

/ 221 2 12 2 2

1 1( )( )

( ) ( )a a Y X

x xs s

Xn n XX X

)()( 2121

2212 2 2

1 1 2 2

( )( )( )( 2) ( )( 2)

( 2) ( 2)Y X

YYY n Y n

n nsn n

1 2( 2) ( 2)df n n

Compare slope: Using t test

Hypothesis:Standard error for β1-β2:

follow t distribution with

0 1 2 1 2: 0H or

1 2 / 2 21 22 2

1 1( ) ( )

( ) ( )b b Y Xs s

X XX X

)()( 2121

2212 2 2

1 1 2 2

( )( )( )( 2) ( )( 2)

( 2) ( 2)Y X

YYY n Y n

n nsn n

1 2( 2) ( 2)df n n

Combining data from two experiments

If the data for two experiments are not significantly different, can combine them.

Combined estimate of b is:

Standard error is:

Should combine the original data.

1 1 2 2

2 21 1 2 22 2

1 1 2 21 2

2212 2 2

1 21 2

( ) ( )( ( ) ) ( ( ) )

( )( )( ) ( )

X Y X Y

X Y X YX Y X Y

SS SS n nXSS SS X

X Xn n

1 21 2

2212 2 2

1 21 2

[ ( 2)] [ ( 2)]

( )( )( ) ( )

E ESS n SS n

Two experiments

Expt. 1

Expt. 2

Same slopes, different means

Expt. 1

Expt. 2

Different slopes, means may or may not differ

Compare two different correlations

Fisher’s Z transformationHypothesis: variance for z1-z2:

1 20 1 2:H or z z

3 3z z n n

1 2 1 2

( ) ( )~ (0,1)

1 13 3

z z z z z zu N

1 Chapter 10 Linear regression and correlation Relationship between variables.

Documents

Transcript of 1 Chapter 10 Linear regression and correlation Relationship between variables.

Chapter 18 Correlation - WordPress.com · Correlation Correlation is a statistical technique used to examine the association be-tween two continuous variables. Unlike regression,

Linear Regression and Correlation. Describing Relationship between Two Variables When we study the relationship between two variables we refer to the.

Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.

Canonical Correlation/Regression. AKA multiple, multiple regression AKA multivariate multiple regression Have two sets of variables (Xs and Ys) Create.

Regression. Correlation and regression are closely related in use and in math. Correlation summarizes the relations b/t 2 variables. Regression is used.

GrowingKnowing.com © 2011 1. Correlation and Regression Correlation shows relationships between variables. This is important. All professionals want to.

Simple Linear Regression 1. Correlation indicates the magnitude and direction of the linear relationship between two variables. Linear Regression: variable.

Chapter 10 Two Quantitative Variables Inference for Correlation: Simulation-Based Methods Least-Squares Regression Inference for Correlation and.

Correlation and Regression Analysis - OIC-StatCom...approach. 1.2. What Are correlation and regression Correlation quantifies the degree and direction to which two variables are related.

Unit 5 Correlation and Regression: Examining and Modeling Relationships Between Variables

Data mining, prediction, correlation, regression, correlation analysis, regression analysis.

Correlation &regression

(Regression, Correlation, Time Series) Analysis. Correlation Analysis Correlation Analysis is the study of the relationship between variables. It is also.

Biostatistics - Correlation and linear regressionffffffff-c1f2-5119... · Correlation and linear regression Analysis of the relation of two continuous variables (bivariate data).

Correlation Between Variables in Multiple Regression

1 Pre-regression Basics Random Vs. Non-random variables Stochastic Vs. Deterministic Relations Correlation Vs. Causation Regression Vs. Causation Types.

Simple Linear Regression and Correlationbrahms.emu.edu.tr/icetin/mdcnbiostat16-reg-cor.pdf · Association between two quantitative variables •Simple Linear Regression and Correlation

Regression & Correlation. Review: Types of Variables & Steps in Analysis.

Chapter 10: Regression and Correlation - MyMathClasses Kozak/chapter_10.pdf · Chapter 10: Regression and Correlation 318 variables are represented as x and y, those labels will be

Correlation and Regression. Correlation. ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables.