Correlation_and_Simple_Linear_Regression
Transcript of Correlation_and_Simple_Linear_Regression
-
8/7/2019 Correlation_and_Simple_Linear_Regression
1/27
Correlation and Simple Linear Regression
Department of Health Informatics
BINF 5210Spring 2011
1
-
8/7/2019 Correlation_and_Simple_Linear_Regression
2/27
Correlation Analysis
It is used to measure the linear association (degree towhich they are related) between two quantitative variablesmeasured on the same subjects
For example, if you want to see the relationship between
the height and weight of a group of children ages 8 to 10 toinvestigate the physical growth, correlation analysis mightbe a better option for you.
Plotting the variables of interest in a scatter plot and thenexamining the relationship visually is one way of examining
correlation. It is a recommended practice. Pearsons product-moment correlation or Pearsons
correlation is the most commonly used for correlationmeasurement between 2 quantitative variables
2
-
8/7/2019 Correlation_and_Simple_Linear_Regression
3/27
Pearsons Correlation
Pearsons product moment correlation measured on apopulation is (Greek letter rho) which is the measureof degree to which the variables of interest (2quantitative) are related. When measured (estimated)
on a sample, it is designated by r (Pearsons r) It measures the extent (degree) to which the points in
a scatter plot of the variables of interest fall on astraight line (linear relationship)
Value for Pearsons correlation ranges from +1 to -1 (+1for perfect positive correlation, - 1 for perfect negativecorrelation and 0 means no correlation (zerocorrelation))
3
-
8/7/2019 Correlation_and_Simple_Linear_Regression
4/27
Calculating Pearsons Correlation
Lets say we want to verify the correlationbetween variable X and variable Y (bothquantitative variables) of a sample dataset.
The formula to calculate Pearsons correlationis:
XY (X Y)/N
r = ------------------------------------------- (divided by) ((X2 ((X)2/N)) (Y2 ((Y)2/N))
N is the number of elements (observations or subjects)
4
-
8/7/2019 Correlation_and_Simple_Linear_Regression
5/27
Correlation in SAS
SAS provides a procedure called PROC CORR for theanalysis of correlation coefficient between twovariables (quantitative)
It tests the hypotheses-
H0 (Null Hypothesis): There is no linear relationshipbetween the two variables of interest (Pearsons r=0)
Ha (Alternative Hypothesis): There is a linear
relationship between the two variables of interest(Pearsons r 0)
and determines if estimated correlation coefficient issignificantly differ from 0.
5
-
8/7/2019 Correlation_and_Simple_Linear_Regression
6/27
PROC CORR Assumptions
Data is a random sample drawn from normallydistributed population (bivariate)
If the population is not normal, then use nonparametric correlation estimation procedure (mostcommon is Spearmans rho)
PROC CORR also provides Spearmans rho as well butyou have to request it in PROC CORR option
Spearmans correlation can be calculated by calculatingthe rank for each of the values of the variables ofinterest and then applying the Pearsons correlationcoefficient method on the ranks of the variables.
6
-
8/7/2019 Correlation_and_Simple_Linear_Regression
7/27
PROC CORR Structure
PROC CORR ;
;
: commonly used ones are
Data=your_dataset_name
spearman (to request non parametric test non normal population)You can also use NOSIMPLE (not to display simple statistics), NOPROB (not to displayprobability value, p-value)
Statements could be:
VAR variables of interest;
BY variable; /* Optional (for categorical variable, it will produce output
separately for each category level)*/
WITH variable /* Optional (when you want the correlation between variables
in VAR list with other variables (listed in WITH list))*/
7
-
8/7/2019 Correlation_and_Simple_Linear_Regression
8/27
PROC CORR Example
Consider the data set for this assignment (external text file smoke_drug in my document). All
columns value are tab delimited. All data are numerical type.
First column is Gender (1=male and 2= female)
Second column is Age
Third column is Race of subjects (1=white, 2= black, 3= Hispanic, 4= other)
Fourth column is smoker? (1=yes 2=no)Fifth column is Systolic blood pressure
Sixth column is diastolic blood pressure
As an investigator, you are interested to examine the relationship between age and
(SYSTOLIC AND DIASTOLIC) blood pressure of randomly selected subjects as a part of a
clinical trial.
8
-
8/7/2019 Correlation_and_Simple_Linear_Regression
9/27
PROC CORR in SAS
First we read the data into SAS:
data mydata;
INFILE "C:\smoke_drug.txt" DLM ='09'x;
INPUT GENDER AGE RACE SMOKER SYSTOLIC DIASTOLIC;
RUN;
Then we run PROC CORR on the variables of our interest
ODS HTML;
PROC CORR DATA=MYDATA;
VAR AGE SYSTOLIC DIASTOLIC; /*list of variables we are interested this will generate
correlation for variables pair wise (3 pairs)*/TITLE 'CORRELATION OF AGE SYSTOLIC, AGE DIASTOLIC AND SYSTOLIC AND DIASTOLIC
BLOOD PRESSURE';
RUN;
ODS HTML CLOSE;
9
-
8/7/2019 Correlation_and_Simple_Linear_Regression
10/27
PROC CORR OUTPUT
You can tell
SAS not to
display this
table for
basicstatistics by
using
NOSIMPLE
option in
PROC CORR
This is the
correlation
matrix
containing
pair wise
Pearson
correlations
between
each of the
3 variables
2
1
3
Values of r Significance level 10
-
8/7/2019 Correlation_and_Simple_Linear_Regression
11/27
PROC CORR Output Interpretation
Table 3 is of our interest in this example.
We can see the correlation between AGE andSYSTOLIC pressure is 0.511150 (positive
relationship but not perfect positive) and the p-value is
-
8/7/2019 Correlation_and_Simple_Linear_Regression
12/27
PROC CORR
If your population is non normal then use Spearmans correlationtest by specifying this in the PROC CORR option either with Pearsonor by itself.
Using WITH statement: sometimes we want to examine thecorrelations of one or more variables with other variables. WITH
statement becomes handy in such cases. Lets say in our example you want to verify the correlations of AGE
with multiple measures of Systolic blood pressure (lets say 4measures Sys1, Sys2, Sys3, Sys 4). In this case you have to includeWITH statement, for example,
PROC CORR data=data_set_name;
VAR Sys1-sys4;
WITH AGE;
This will produce correlations between AGE and each of Sys1, Sys2,Sys3, and Sys4.
12
-
8/7/2019 Correlation_and_Simple_Linear_Regression
13/27
PROC CORR -Plot
You should always produce a scatter plot of
the variables of interest to verify the
correlations between them
One option is to use ODS GRAPHICS option on
PROC CORR. This will generate the graphs and
plots associated with output of SAS PROC
CORR.
13
-
8/7/2019 Correlation_and_Simple_Linear_Regression
14/27
Linear Regression
Correlation gives you the measures of linearrelationship between two variables and regressionanalysis utilizes this relationship to predict thedependent variable from the independent variable
In order to predict (value of) a dependent variable froma given value of an independent variable, simple linearregression is appropriate
For example, as part of the investigation of the effect sof physical exercises (amount of time spent for
exercising daily) on BMI, a simple linear regression canbe used to predict the BMI from the amount of timespent daily for physical exercises.
14
-
8/7/2019 Correlation_and_Simple_Linear_Regression
15/27
Simple Linear Regression Model Basics
The following mathematical equation of a (theoretical) linedescribes the association (relationship) between anindependent variable X and a dependent variable Y:
Y= + x +
( is the Y intercept, is slope of the line and is the errorwhose mean is 0 and whose variance if fixed. If the slope,=0, then there is no predictive relationship between thevariables )
When we perform a regression analysis on data to predict
variable, we actually calculate a regression line to describethe relationship of the variables of our interest ,which(regression line) is an estimate of the theoretical lineabove.
15
-
8/7/2019 Correlation_and_Simple_Linear_Regression
16/27
Simple Linear Regression Model Basics
The regression line we calculate has the following equation:
Y= a + bx
Where a and b are the least square estimate of the
parameters and respectively, x is the given value ofindependent variable, Y is the dependent variable (value)we are trying to predict.
Note: Least square estimates because the regression line triesto minimize the sum of the squared errors of the
predictions (square of the error between the actual valueof the outcome variable and the predicted value of theoutcome variable. Please check Lane text book, chapter 15,for details)
16
-
8/7/2019 Correlation_and_Simple_Linear_Regression
17/27
Simple Linear Regression in SAS
SAS provides a procedure called PROC REG for regressionanalysis of data
When we specify the regression model in SAS by specifyingthe dependent variable and independent variable, SASformulates a regression line (same equation in the previous
slide) based on the given dataset and predicts the dependentvariable (value)
First step is to check if there is any relationship existedbetween the variables specified in SAS.
This is done by testing the null hypothesis that there is no
linear relationship predictable between the variables (that isthe slope of the equation, = 0) .
Ha (Alternative Hypothesis): There is linear relationshippredictable between the two variables of interest (the slopeof the equation is not 0).
17
-
8/7/2019 Correlation_and_Simple_Linear_Regression
18/27
Simple Linear Regression in SAS
Therefore, H0: =0 and
Ha: 0
If we have a small p-value (usually
-
8/7/2019 Correlation_and_Simple_Linear_Regression
19/27
Simple Linear Regression Using PROC REG
SAS PROC REG takes the following structure:
PROC REG ;
;
MODEL statement has the structure:
MODEL dependent_var=independent_var/ options;
Some of the MODEL statement options are (check
SAS manual and know their functions):P (for requesting a table of predicted values),R (for residual analysis),CLM (for expected
value), CLI (for individual values of the dependent variable), INCLUDE, SELECTION,SLSTAY, SLENTRY
19
-
8/7/2019 Correlation_and_Simple_Linear_Regression
20/27
Simple linear regression using PROC REG
Example
Lets consider the data we used for the correlationanalysis example. In this example we areinterested to see if systolic blood pressure can beused to predict the diastolic blood pressure. After
reading the dataset into SAS, we run thefollowing PROC REG:ODS HTML;
TITLE 'SIMPLE LINEAR REGRESSION EXAMPLE';
PROC REG DATA= MYDATA;
MODEL DIASTOLIC = SYSTOLIC;
/* SPECIFYING THE OUTCOME VARIABLE (DEPENDENT) AND PREDICTOR (INDEPENDENT) VARIABLEFOR THE MODEL OF REGRESSION, what you want to predict from what*/
RUN;
ODS HTML CLOSE;
20
-
8/7/2019 Correlation_and_Simple_Linear_Regression
21/27
PROC REG Output
1
2
3
4
If we could reject the
null hypothesis, then
only we would havecontinued here.
R-square is the
measure of how strong
the relationship
between the variables
is. The closer it is to 1,
the stronger the
relationship. In this
example the value is
very small, 0.0141(0.01, there is barely a
relationship).
This is how to interpret
this value: only 1% of
the variability in
DIASTOLIC variable can
be explained by
SYSTOLIC variable.
Statistical test on
SYSTOLIC row is for
the =0. Can not
reject null
hypothesis. So there
is no relationship
(slope is not
significantly different
from 0)
This table is associated with regressionmodel
This table
tells you
about the
strength of
therelationship
Least square estimate ofa
Least square estimate ofb
Y= a + b x
We can not predict
DIASTOLIC from
SYSTOLIC because
there is nosignificant
relationship
between them. So
we do not go any
further.
21
-
8/7/2019 Correlation_and_Simple_Linear_Regression
22/27
PROC REG Output
When we read PROC REG output, 3 things are usually are of our interest to understand the results(also shown in the output in the last slide):
1- R -square (tells you the strength of the relationship)
2- Slope (check the regression table for the independent variable
and check the p-value for the test if it is significant or not. This is the
test for whether the slope=0 or not)
3- Parameter estimate: Intercept and independent variable (estimate of a and b for the regression
equation forprediction)
From this example, we conclude that there is no significant predictive linear relationship betweenDiastolic and Systolic blood pressure according to our dataset. Since the slope=0 t-test is significant(slope is for the dependent and the independent variables, the intercept not involved. So we checkthe regression table for the independent variable and check the p-value for the test if it issignificant or not.)
Therefore, we can not predict Diastolic from Systolic. So we stop our analysis by concluding and we do not need to verify the parameters and regression
equation for prediction.
22
-
8/7/2019 Correlation_and_Simple_Linear_Regression
23/27
How to Interpret PROC REG Output
(When slope is not zero)
Now consider the following output. Think about it as ifis based on the same table but for different data valuesand also think that this time the slope test producedsmaller p-value ( lets say.04) for significance.
This is a made up output where just the p-value ofSystolic is changed to make it significant so that thenull hypothesis is rejected. This is just to show how toread and interpret the output of a regression when theslope is not 0 and how to predict the dependent
variable from the value of independent variable usingthe regression line equation.
(Again this is just for explanation, not correct output on a dataset)
23
-
8/7/2019 Correlation_and_Simple_Linear_Regression
24/27
PROC REG Output (Slope0)
1
2
3
4
R-square is the
measure of how strong
the relationship
between the variables
is. The closer it is to 1,
the stronger the
relationship. In this
example the value is
0.0141.
This is how to interpretthis value: only 1% of
the variability in
DIASTOLIC variable can
be explained by
SYSTOLIC variable.
Statistical test on
SYSTOLIC row is for
the =0. Can reject
null hypothesis. So
there is a
relationship
(slope is significantly
different from 0)
This table is associated with regressionmodel
This table
tells you
about the
strength of
therelationship
Least square estimate ofa
Least square estimate ofb
DIASTOLIC= 1.47628 + 0.00110 * SYSTOLIC
Y= a + b x
(This is the predictive equation)
Check R square to see the strength ofthe relationship. Then check the
slope test p-value in the last column
in the regression table for the
independent variable, in this case the
row for independent variable
SYSTOLIC, the p-value (Pr> |t|) in
regression table (4). Then report the
parameter estimate (a and b) from
the third column in regression table
(4). Test for Intercept is not of our
interest but the value is.
24
-
8/7/2019 Correlation_and_Simple_Linear_Regression
25/27
PROC REG Output (Slope0)
In this case, parameter estimates are 1.47628and 0.00110 for the Intercept and Systolic
(remember these are least square estimates
of a and b in the regression line equation) So we can calculate the equation of the
regression line as:
Y= a + b xDIASTOLIC= 1.47628 + 0.00110 * SYSTOLIC(outcome variable)
In this situation, we would have used this regression equation to predict the dependent variable from the
values of the independent variable.
25
(Value of x)
(predictor)
-
8/7/2019 Correlation_and_Simple_Linear_Regression
26/27
Simple Linear Regression Plot
It is always recommended to create a plot forthe variables of interest to visually inspect thelinear relationship of the data. The regression
line can give you ideas about the predictivevalues of the dependent variable for each unitchange of the independent variable.
You can simply plot the variables by using plotoption available in SAS simply by adding PLOTstatement after MODEL statement
(PLOT DEPENDENTVARIABLE * INDEPENDENTVARIABLE;) or by usingPROC GPLOT procedure after MODEL statement.
26
-
8/7/2019 Correlation_and_Simple_Linear_Regression
27/27