Correlation_and_Simple_Linear_Regression

8/7/2019 Correlation_and_Simple_Linear_Regression

1/27

Correlation and Simple Linear Regression

Department of Health Informatics

BINF 5210Spring 2011

1


2/27

Correlation Analysis

It is used to measure the linear association (degree towhich they are related) between two quantitative variablesmeasured on the same subjects

For example, if you want to see the relationship between

the height and weight of a group of children ages 8 to 10 toinvestigate the physical growth, correlation analysis mightbe a better option for you.

Plotting the variables of interest in a scatter plot and thenexamining the relationship visually is one way of examining

correlation. It is a recommended practice. Pearsons product-moment correlation or Pearsons

correlation is the most commonly used for correlationmeasurement between 2 quantitative variables

2


3/27

Pearsons Correlation

Pearsons product moment correlation measured on apopulation is (Greek letter rho) which is the measureof degree to which the variables of interest (2quantitative) are related. When measured (estimated)

on a sample, it is designated by r (Pearsons r) It measures the extent (degree) to which the points in

a scatter plot of the variables of interest fall on astraight line (linear relationship)

Value for Pearsons correlation ranges from +1 to -1 (+1for perfect positive correlation, - 1 for perfect negativecorrelation and 0 means no correlation (zerocorrelation))

3


4/27

Calculating Pearsons Correlation

Lets say we want to verify the correlationbetween variable X and variable Y (bothquantitative variables) of a sample dataset.

The formula to calculate Pearsons correlationis:

XY (X Y)/N

r = ------------------------------------------- (divided by) ((X2 ((X)2/N)) (Y2 ((Y)2/N))

N is the number of elements (observations or subjects)

4


5/27

Correlation in SAS

SAS provides a procedure called PROC CORR for theanalysis of correlation coefficient between twovariables (quantitative)

It tests the hypotheses-

H0 (Null Hypothesis): There is no linear relationshipbetween the two variables of interest (Pearsons r=0)

Ha (Alternative Hypothesis): There is a linear

relationship between the two variables of interest(Pearsons r 0)

and determines if estimated correlation coefficient issignificantly differ from 0.

5


6/27

PROC CORR Assumptions

Data is a random sample drawn from normallydistributed population (bivariate)

If the population is not normal, then use nonparametric correlation estimation procedure (mostcommon is Spearmans rho)

PROC CORR also provides Spearmans rho as well butyou have to request it in PROC CORR option

Spearmans correlation can be calculated by calculatingthe rank for each of the values of the variables ofinterest and then applying the Pearsons correlationcoefficient method on the ranks of the variables.

6


7/27

PROC CORR Structure

PROC CORR ;

;

: commonly used ones are

Data=your_dataset_name

spearman (to request non parametric test non normal population)You can also use NOSIMPLE (not to display simple statistics), NOPROB (not to displayprobability value, p-value)

Statements could be:

VAR variables of interest;

BY variable; /* Optional (for categorical variable, it will produce output

separately for each category level)*/

WITH variable /* Optional (when you want the correlation between variables

in VAR list with other variables (listed in WITH list))*/

7


8/27

PROC CORR Example

Consider the data set for this assignment (external text file smoke_drug in my document). All

columns value are tab delimited. All data are numerical type.

First column is Gender (1=male and 2= female)

Second column is Age

Third column is Race of subjects (1=white, 2= black, 3= Hispanic, 4= other)

Fourth column is smoker? (1=yes 2=no)Fifth column is Systolic blood pressure

Sixth column is diastolic blood pressure

As an investigator, you are interested to examine the relationship between age and

(SYSTOLIC AND DIASTOLIC) blood pressure of randomly selected subjects as a part of a

clinical trial.

8


9/27

PROC CORR in SAS

First we read the data into SAS:

data mydata;

INFILE "C:\smoke_drug.txt" DLM ='09'x;

INPUT GENDER AGE RACE SMOKER SYSTOLIC DIASTOLIC;

RUN;

Then we run PROC CORR on the variables of our interest

ODS HTML;

PROC CORR DATA=MYDATA;

VAR AGE SYSTOLIC DIASTOLIC; /*list of variables we are interested this will generate

correlation for variables pair wise (3 pairs)*/TITLE 'CORRELATION OF AGE SYSTOLIC, AGE DIASTOLIC AND SYSTOLIC AND DIASTOLIC

BLOOD PRESSURE';

RUN;

ODS HTML CLOSE;

9


10/27

PROC CORR OUTPUT

You can tell

SAS not to

display this

table for

basicstatistics by

using

NOSIMPLE

option in

PROC CORR

This is the

correlation

matrix

containing

pair wise

Pearson

correlations

between

each of the

3 variables

2

1

3

Values of r Significance level 10


11/27

PROC CORR Output Interpretation

Table 3 is of our interest in this example.

We can see the correlation between AGE andSYSTOLIC pressure is 0.511150 (positive

relationship but not perfect positive) and the p-value is


12/27

PROC CORR

If your population is non normal then use Spearmans correlationtest by specifying this in the PROC CORR option either with Pearsonor by itself.

Using WITH statement: sometimes we want to examine thecorrelations of one or more variables with other variables. WITH

statement becomes handy in such cases. Lets say in our example you want to verify the correlations of AGE

with multiple measures of Systolic blood pressure (lets say 4measures Sys1, Sys2, Sys3, Sys 4). In this case you have to includeWITH statement, for example,

PROC CORR data=data_set_name;

VAR Sys1-sys4;

WITH AGE;

This will produce correlations between AGE and each of Sys1, Sys2,Sys3, and Sys4.

12


13/27

PROC CORR -Plot

You should always produce a scatter plot of

the variables of interest to verify the

correlations between them

One option is to use ODS GRAPHICS option on

PROC CORR. This will generate the graphs and

plots associated with output of SAS PROC

CORR.

13


14/27

Linear Regression

Correlation gives you the measures of linearrelationship between two variables and regressionanalysis utilizes this relationship to predict thedependent variable from the independent variable

In order to predict (value of) a dependent variable froma given value of an independent variable, simple linearregression is appropriate

For example, as part of the investigation of the effect sof physical exercises (amount of time spent for

exercising daily) on BMI, a simple linear regression canbe used to predict the BMI from the amount of timespent daily for physical exercises.

14


15/27

Simple Linear Regression Model Basics

The following mathematical equation of a (theoretical) linedescribes the association (relationship) between anindependent variable X and a dependent variable Y:

Y= + x +

( is the Y intercept, is slope of the line and is the errorwhose mean is 0 and whose variance if fixed. If the slope,=0, then there is no predictive relationship between thevariables )

When we perform a regression analysis on data to predict

variable, we actually calculate a regression line to describethe relationship of the variables of our interest ,which(regression line) is an estimate of the theoretical lineabove.

15


16/27

Simple Linear Regression Model Basics

The regression line we calculate has the following equation:

Y= a + bx

Where a and b are the least square estimate of the

parameters and respectively, x is the given value ofindependent variable, Y is the dependent variable (value)we are trying to predict.

Note: Least square estimates because the regression line triesto minimize the sum of the squared errors of the

predictions (square of the error between the actual valueof the outcome variable and the predicted value of theoutcome variable. Please check Lane text book, chapter 15,for details)

16


17/27

Simple Linear Regression in SAS

SAS provides a procedure called PROC REG for regressionanalysis of data

When we specify the regression model in SAS by specifyingthe dependent variable and independent variable, SASformulates a regression line (same equation in the previous

slide) based on the given dataset and predicts the dependentvariable (value)

First step is to check if there is any relationship existedbetween the variables specified in SAS.

This is done by testing the null hypothesis that there is no

linear relationship predictable between the variables (that isthe slope of the equation, = 0) .

Ha (Alternative Hypothesis): There is linear relationshippredictable between the two variables of interest (the slopeof the equation is not 0).

17


18/27

Simple Linear Regression in SAS

Therefore, H0: =0 and

Ha: 0

If we have a small p-value (usually


19/27

Simple Linear Regression Using PROC REG

SAS PROC REG takes the following structure:

PROC REG ;

;

MODEL statement has the structure:

MODEL dependent_var=independent_var/ options;

Some of the MODEL statement options are (check

SAS manual and know their functions):P (for requesting a table of predicted values),R (for residual analysis),CLM (for expected

value), CLI (for individual values of the dependent variable), INCLUDE, SELECTION,SLSTAY, SLENTRY

19


20/27

Simple linear regression using PROC REG

Example

Lets consider the data we used for the correlationanalysis example. In this example we areinterested to see if systolic blood pressure can beused to predict the diastolic blood pressure. After

reading the dataset into SAS, we run thefollowing PROC REG:ODS HTML;

TITLE 'SIMPLE LINEAR REGRESSION EXAMPLE';

PROC REG DATA= MYDATA;

MODEL DIASTOLIC = SYSTOLIC;

/* SPECIFYING THE OUTCOME VARIABLE (DEPENDENT) AND PREDICTOR (INDEPENDENT) VARIABLEFOR THE MODEL OF REGRESSION, what you want to predict from what*/

RUN;

ODS HTML CLOSE;

20


21/27

PROC REG Output

1

2

3

4

If we could reject the

null hypothesis, then

only we would havecontinued here.

R-square is the

measure of how strong

the relationship

between the variables

is. The closer it is to 1,

the stronger the

relationship. In this

example the value is

very small, 0.0141(0.01, there is barely a

relationship).

This is how to interpret

this value: only 1% of

the variability in

DIASTOLIC variable can

be explained by

SYSTOLIC variable.

Statistical test on

SYSTOLIC row is for

the =0. Can not

reject null

hypothesis. So there

is no relationship

(slope is not

significantly different

from 0)

This table is associated with regressionmodel

This table

tells you

about the

strength of

therelationship

Least square estimate ofa

Least square estimate ofb

Y= a + b x

We can not predict

DIASTOLIC from

SYSTOLIC because

there is nosignificant

relationship

between them. So

we do not go any

further.

21


22/27

PROC REG Output

When we read PROC REG output, 3 things are usually are of our interest to understand the results(also shown in the output in the last slide):

1- R -square (tells you the strength of the relationship)

2- Slope (check the regression table for the independent variable

and check the p-value for the test if it is significant or not. This is the

test for whether the slope=0 or not)

3- Parameter estimate: Intercept and independent variable (estimate of a and b for the regression

equation forprediction)

From this example, we conclude that there is no significant predictive linear relationship betweenDiastolic and Systolic blood pressure according to our dataset. Since the slope=0 t-test is significant(slope is for the dependent and the independent variables, the intercept not involved. So we checkthe regression table for the independent variable and check the p-value for the test if it issignificant or not.)

Therefore, we can not predict Diastolic from Systolic. So we stop our analysis by concluding and we do not need to verify the parameters and regression

equation for prediction.

22


23/27

How to Interpret PROC REG Output

(When slope is not zero)

Now consider the following output. Think about it as ifis based on the same table but for different data valuesand also think that this time the slope test producedsmaller p-value ( lets say.04) for significance.

This is a made up output where just the p-value ofSystolic is changed to make it significant so that thenull hypothesis is rejected. This is just to show how toread and interpret the output of a regression when theslope is not 0 and how to predict the dependent

variable from the value of independent variable usingthe regression line equation.

(Again this is just for explanation, not correct output on a dataset)

23


24/27

PROC REG Output (Slope0)

1

2

3

4

R-square is the

measure of how strong

the relationship

between the variables

is. The closer it is to 1,

the stronger the

relationship. In this

example the value is

0.0141.

This is how to interpretthis value: only 1% of

the variability in

DIASTOLIC variable can

be explained by

SYSTOLIC variable.

Statistical test on

SYSTOLIC row is for

the =0. Can reject

null hypothesis. So

there is a

relationship

(slope is significantly

different from 0)

This table is associated with regressionmodel

This table

tells you

about the

strength of

therelationship

Least square estimate ofa

Least square estimate ofb

DIASTOLIC= 1.47628 + 0.00110 * SYSTOLIC

Y= a + b x

(This is the predictive equation)

Check R square to see the strength ofthe relationship. Then check the

slope test p-value in the last column

in the regression table for the

independent variable, in this case the

row for independent variable

SYSTOLIC, the p-value (Pr> |t|) in

regression table (4). Then report the

parameter estimate (a and b) from

the third column in regression table

(4). Test for Intercept is not of our

interest but the value is.

24


25/27

PROC REG Output (Slope0)

In this case, parameter estimates are 1.47628and 0.00110 for the Intercept and Systolic

(remember these are least square estimates

of a and b in the regression line equation) So we can calculate the equation of the

regression line as:

Y= a + b xDIASTOLIC= 1.47628 + 0.00110 * SYSTOLIC(outcome variable)

In this situation, we would have used this regression equation to predict the dependent variable from the

values of the independent variable.

25

(Value of x)

(predictor)


26/27

Simple Linear Regression Plot

It is always recommended to create a plot forthe variables of interest to visually inspect thelinear relationship of the data. The regression

line can give you ideas about the predictivevalues of the dependent variable for each unitchange of the independent variable.

You can simply plot the variables by using plotoption available in SAS simply by adding PLOTstatement after MODEL statement

(PLOT DEPENDENTVARIABLE * INDEPENDENTVARIABLE;) or by usingPROC GPLOT procedure after MODEL statement.

26


27/27

Correlation_and_Simple_Linear_Regression

Documents

Transcript of Correlation_and_Simple_Linear_Regression