Correlation_and_Simple_Linear_Regression

download Correlation_and_Simple_Linear_Regression

of 27

Transcript of Correlation_and_Simple_Linear_Regression

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    1/27

    Correlation and Simple Linear Regression

    Department of Health Informatics

    BINF 5210Spring 2011

    1

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    2/27

    Correlation Analysis

    It is used to measure the linear association (degree towhich they are related) between two quantitative variablesmeasured on the same subjects

    For example, if you want to see the relationship between

    the height and weight of a group of children ages 8 to 10 toinvestigate the physical growth, correlation analysis mightbe a better option for you.

    Plotting the variables of interest in a scatter plot and thenexamining the relationship visually is one way of examining

    correlation. It is a recommended practice. Pearsons product-moment correlation or Pearsons

    correlation is the most commonly used for correlationmeasurement between 2 quantitative variables

    2

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    3/27

    Pearsons Correlation

    Pearsons product moment correlation measured on apopulation is (Greek letter rho) which is the measureof degree to which the variables of interest (2quantitative) are related. When measured (estimated)

    on a sample, it is designated by r (Pearsons r) It measures the extent (degree) to which the points in

    a scatter plot of the variables of interest fall on astraight line (linear relationship)

    Value for Pearsons correlation ranges from +1 to -1 (+1for perfect positive correlation, - 1 for perfect negativecorrelation and 0 means no correlation (zerocorrelation))

    3

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    4/27

    Calculating Pearsons Correlation

    Lets say we want to verify the correlationbetween variable X and variable Y (bothquantitative variables) of a sample dataset.

    The formula to calculate Pearsons correlationis:

    XY (X Y)/N

    r = ------------------------------------------- (divided by) ((X2 ((X)2/N)) (Y2 ((Y)2/N))

    N is the number of elements (observations or subjects)

    4

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    5/27

    Correlation in SAS

    SAS provides a procedure called PROC CORR for theanalysis of correlation coefficient between twovariables (quantitative)

    It tests the hypotheses-

    H0 (Null Hypothesis): There is no linear relationshipbetween the two variables of interest (Pearsons r=0)

    Ha (Alternative Hypothesis): There is a linear

    relationship between the two variables of interest(Pearsons r 0)

    and determines if estimated correlation coefficient issignificantly differ from 0.

    5

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    6/27

    PROC CORR Assumptions

    Data is a random sample drawn from normallydistributed population (bivariate)

    If the population is not normal, then use nonparametric correlation estimation procedure (mostcommon is Spearmans rho)

    PROC CORR also provides Spearmans rho as well butyou have to request it in PROC CORR option

    Spearmans correlation can be calculated by calculatingthe rank for each of the values of the variables ofinterest and then applying the Pearsons correlationcoefficient method on the ranks of the variables.

    6

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    7/27

    PROC CORR Structure

    PROC CORR ;

    ;

    : commonly used ones are

    Data=your_dataset_name

    spearman (to request non parametric test non normal population)You can also use NOSIMPLE (not to display simple statistics), NOPROB (not to displayprobability value, p-value)

    Statements could be:

    VAR variables of interest;

    BY variable; /* Optional (for categorical variable, it will produce output

    separately for each category level)*/

    WITH variable /* Optional (when you want the correlation between variables

    in VAR list with other variables (listed in WITH list))*/

    7

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    8/27

    PROC CORR Example

    Consider the data set for this assignment (external text file smoke_drug in my document). All

    columns value are tab delimited. All data are numerical type.

    First column is Gender (1=male and 2= female)

    Second column is Age

    Third column is Race of subjects (1=white, 2= black, 3= Hispanic, 4= other)

    Fourth column is smoker? (1=yes 2=no)Fifth column is Systolic blood pressure

    Sixth column is diastolic blood pressure

    As an investigator, you are interested to examine the relationship between age and

    (SYSTOLIC AND DIASTOLIC) blood pressure of randomly selected subjects as a part of a

    clinical trial.

    8

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    9/27

    PROC CORR in SAS

    First we read the data into SAS:

    data mydata;

    INFILE "C:\smoke_drug.txt" DLM ='09'x;

    INPUT GENDER AGE RACE SMOKER SYSTOLIC DIASTOLIC;

    RUN;

    Then we run PROC CORR on the variables of our interest

    ODS HTML;

    PROC CORR DATA=MYDATA;

    VAR AGE SYSTOLIC DIASTOLIC; /*list of variables we are interested this will generate

    correlation for variables pair wise (3 pairs)*/TITLE 'CORRELATION OF AGE SYSTOLIC, AGE DIASTOLIC AND SYSTOLIC AND DIASTOLIC

    BLOOD PRESSURE';

    RUN;

    ODS HTML CLOSE;

    9

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    10/27

    PROC CORR OUTPUT

    You can tell

    SAS not to

    display this

    table for

    basicstatistics by

    using

    NOSIMPLE

    option in

    PROC CORR

    This is the

    correlation

    matrix

    containing

    pair wise

    Pearson

    correlations

    between

    each of the

    3 variables

    2

    1

    3

    Values of r Significance level 10

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    11/27

    PROC CORR Output Interpretation

    Table 3 is of our interest in this example.

    We can see the correlation between AGE andSYSTOLIC pressure is 0.511150 (positive

    relationship but not perfect positive) and the p-value is

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    12/27

    PROC CORR

    If your population is non normal then use Spearmans correlationtest by specifying this in the PROC CORR option either with Pearsonor by itself.

    Using WITH statement: sometimes we want to examine thecorrelations of one or more variables with other variables. WITH

    statement becomes handy in such cases. Lets say in our example you want to verify the correlations of AGE

    with multiple measures of Systolic blood pressure (lets say 4measures Sys1, Sys2, Sys3, Sys 4). In this case you have to includeWITH statement, for example,

    PROC CORR data=data_set_name;

    VAR Sys1-sys4;

    WITH AGE;

    This will produce correlations between AGE and each of Sys1, Sys2,Sys3, and Sys4.

    12

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    13/27

    PROC CORR -Plot

    You should always produce a scatter plot of

    the variables of interest to verify the

    correlations between them

    One option is to use ODS GRAPHICS option on

    PROC CORR. This will generate the graphs and

    plots associated with output of SAS PROC

    CORR.

    13

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    14/27

    Linear Regression

    Correlation gives you the measures of linearrelationship between two variables and regressionanalysis utilizes this relationship to predict thedependent variable from the independent variable

    In order to predict (value of) a dependent variable froma given value of an independent variable, simple linearregression is appropriate

    For example, as part of the investigation of the effect sof physical exercises (amount of time spent for

    exercising daily) on BMI, a simple linear regression canbe used to predict the BMI from the amount of timespent daily for physical exercises.

    14

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    15/27

    Simple Linear Regression Model Basics

    The following mathematical equation of a (theoretical) linedescribes the association (relationship) between anindependent variable X and a dependent variable Y:

    Y= + x +

    ( is the Y intercept, is slope of the line and is the errorwhose mean is 0 and whose variance if fixed. If the slope,=0, then there is no predictive relationship between thevariables )

    When we perform a regression analysis on data to predict

    variable, we actually calculate a regression line to describethe relationship of the variables of our interest ,which(regression line) is an estimate of the theoretical lineabove.

    15

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    16/27

    Simple Linear Regression Model Basics

    The regression line we calculate has the following equation:

    Y= a + bx

    Where a and b are the least square estimate of the

    parameters and respectively, x is the given value ofindependent variable, Y is the dependent variable (value)we are trying to predict.

    Note: Least square estimates because the regression line triesto minimize the sum of the squared errors of the

    predictions (square of the error between the actual valueof the outcome variable and the predicted value of theoutcome variable. Please check Lane text book, chapter 15,for details)

    16

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    17/27

    Simple Linear Regression in SAS

    SAS provides a procedure called PROC REG for regressionanalysis of data

    When we specify the regression model in SAS by specifyingthe dependent variable and independent variable, SASformulates a regression line (same equation in the previous

    slide) based on the given dataset and predicts the dependentvariable (value)

    First step is to check if there is any relationship existedbetween the variables specified in SAS.

    This is done by testing the null hypothesis that there is no

    linear relationship predictable between the variables (that isthe slope of the equation, = 0) .

    Ha (Alternative Hypothesis): There is linear relationshippredictable between the two variables of interest (the slopeof the equation is not 0).

    17

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    18/27

    Simple Linear Regression in SAS

    Therefore, H0: =0 and

    Ha: 0

    If we have a small p-value (usually

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    19/27

    Simple Linear Regression Using PROC REG

    SAS PROC REG takes the following structure:

    PROC REG ;

    ;

    MODEL statement has the structure:

    MODEL dependent_var=independent_var/ options;

    Some of the MODEL statement options are (check

    SAS manual and know their functions):P (for requesting a table of predicted values),R (for residual analysis),CLM (for expected

    value), CLI (for individual values of the dependent variable), INCLUDE, SELECTION,SLSTAY, SLENTRY

    19

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    20/27

    Simple linear regression using PROC REG

    Example

    Lets consider the data we used for the correlationanalysis example. In this example we areinterested to see if systolic blood pressure can beused to predict the diastolic blood pressure. After

    reading the dataset into SAS, we run thefollowing PROC REG:ODS HTML;

    TITLE 'SIMPLE LINEAR REGRESSION EXAMPLE';

    PROC REG DATA= MYDATA;

    MODEL DIASTOLIC = SYSTOLIC;

    /* SPECIFYING THE OUTCOME VARIABLE (DEPENDENT) AND PREDICTOR (INDEPENDENT) VARIABLEFOR THE MODEL OF REGRESSION, what you want to predict from what*/

    RUN;

    ODS HTML CLOSE;

    20

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    21/27

    PROC REG Output

    1

    2

    3

    4

    If we could reject the

    null hypothesis, then

    only we would havecontinued here.

    R-square is the

    measure of how strong

    the relationship

    between the variables

    is. The closer it is to 1,

    the stronger the

    relationship. In this

    example the value is

    very small, 0.0141(0.01, there is barely a

    relationship).

    This is how to interpret

    this value: only 1% of

    the variability in

    DIASTOLIC variable can

    be explained by

    SYSTOLIC variable.

    Statistical test on

    SYSTOLIC row is for

    the =0. Can not

    reject null

    hypothesis. So there

    is no relationship

    (slope is not

    significantly different

    from 0)

    This table is associated with regressionmodel

    This table

    tells you

    about the

    strength of

    therelationship

    Least square estimate ofa

    Least square estimate ofb

    Y= a + b x

    We can not predict

    DIASTOLIC from

    SYSTOLIC because

    there is nosignificant

    relationship

    between them. So

    we do not go any

    further.

    21

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    22/27

    PROC REG Output

    When we read PROC REG output, 3 things are usually are of our interest to understand the results(also shown in the output in the last slide):

    1- R -square (tells you the strength of the relationship)

    2- Slope (check the regression table for the independent variable

    and check the p-value for the test if it is significant or not. This is the

    test for whether the slope=0 or not)

    3- Parameter estimate: Intercept and independent variable (estimate of a and b for the regression

    equation forprediction)

    From this example, we conclude that there is no significant predictive linear relationship betweenDiastolic and Systolic blood pressure according to our dataset. Since the slope=0 t-test is significant(slope is for the dependent and the independent variables, the intercept not involved. So we checkthe regression table for the independent variable and check the p-value for the test if it issignificant or not.)

    Therefore, we can not predict Diastolic from Systolic. So we stop our analysis by concluding and we do not need to verify the parameters and regression

    equation for prediction.

    22

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    23/27

    How to Interpret PROC REG Output

    (When slope is not zero)

    Now consider the following output. Think about it as ifis based on the same table but for different data valuesand also think that this time the slope test producedsmaller p-value ( lets say.04) for significance.

    This is a made up output where just the p-value ofSystolic is changed to make it significant so that thenull hypothesis is rejected. This is just to show how toread and interpret the output of a regression when theslope is not 0 and how to predict the dependent

    variable from the value of independent variable usingthe regression line equation.

    (Again this is just for explanation, not correct output on a dataset)

    23

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    24/27

    PROC REG Output (Slope0)

    1

    2

    3

    4

    R-square is the

    measure of how strong

    the relationship

    between the variables

    is. The closer it is to 1,

    the stronger the

    relationship. In this

    example the value is

    0.0141.

    This is how to interpretthis value: only 1% of

    the variability in

    DIASTOLIC variable can

    be explained by

    SYSTOLIC variable.

    Statistical test on

    SYSTOLIC row is for

    the =0. Can reject

    null hypothesis. So

    there is a

    relationship

    (slope is significantly

    different from 0)

    This table is associated with regressionmodel

    This table

    tells you

    about the

    strength of

    therelationship

    Least square estimate ofa

    Least square estimate ofb

    DIASTOLIC= 1.47628 + 0.00110 * SYSTOLIC

    Y= a + b x

    (This is the predictive equation)

    Check R square to see the strength ofthe relationship. Then check the

    slope test p-value in the last column

    in the regression table for the

    independent variable, in this case the

    row for independent variable

    SYSTOLIC, the p-value (Pr> |t|) in

    regression table (4). Then report the

    parameter estimate (a and b) from

    the third column in regression table

    (4). Test for Intercept is not of our

    interest but the value is.

    24

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    25/27

    PROC REG Output (Slope0)

    In this case, parameter estimates are 1.47628and 0.00110 for the Intercept and Systolic

    (remember these are least square estimates

    of a and b in the regression line equation) So we can calculate the equation of the

    regression line as:

    Y= a + b xDIASTOLIC= 1.47628 + 0.00110 * SYSTOLIC(outcome variable)

    In this situation, we would have used this regression equation to predict the dependent variable from the

    values of the independent variable.

    25

    (Value of x)

    (predictor)

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    26/27

    Simple Linear Regression Plot

    It is always recommended to create a plot forthe variables of interest to visually inspect thelinear relationship of the data. The regression

    line can give you ideas about the predictivevalues of the dependent variable for each unitchange of the independent variable.

    You can simply plot the variables by using plotoption available in SAS simply by adding PLOTstatement after MODEL statement

    (PLOT DEPENDENTVARIABLE * INDEPENDENTVARIABLE;) or by usingPROC GPLOT procedure after MODEL statement.

    26

  • 8/7/2019 Correlation_and_Simple_Linear_Regression

    27/27