Correlation and Regression Feb2014

download Correlation and Regression Feb2014

of 50

Transcript of Correlation and Regression Feb2014

  • 8/12/2019 Correlation and Regression Feb2014

    1/50

    PLT 6133

    QUANTITATIVE DATA

    ANALYSIS

  • 8/12/2019 Correlation and Regression Feb2014

    2/50

    CORRELATION AND

    REGRESSIONS

  • 8/12/2019 Correlation and Regression Feb2014

    3/50

    Statistics maybe regarded as a method of dealing

    with data. This definition stresses the view that

    statistics is a tool concerned with collection,

    organization and analysis of numerical facts andobservations..the major concerned with

    descriptive statistics is to present information in a

    convenient, usable, and understandable form

    - Richard Runyon & Audry Haber

  • 8/12/2019 Correlation and Regression Feb2014

    4/50

    Summary of Major Types of Descriptive Statistics

    TYPE OF TECHNIQUE STATISTICAL TECHNIQUE PURPOSE

    Univariate Frequency distribution, Describe one variable

    measures of central tendency,

    std deviation,

    Bivariate Correlation, percentage Describe a relationship

    table, chi-square or the association

    between two variables

    Multivariate Elaboration paradigm, Describe relationships

    linear and multiple regression among several variables,

    or see how severalindependent variables have

    an effect on a dependent

    variable.

  • 8/12/2019 Correlation and Regression Feb2014

    5/50

    Three Broad Types of Research Questions:

    1 Descriptive Research Questions

    2

    3

    Associational Research Questions

    Difference Research Questions

  • 8/12/2019 Correlation and Regression Feb2014

    6/50

    DESCRIPTIVE RESE RCH QUESTIONS

    Descriptive Research Questions are not answered

    with inferential statistics.

    They merely describe or summarize data, without

    trying to generalize to a larger population ofindividual.

    Mean, Percentage, SD, Mod, Median, etc.

  • 8/12/2019 Correlation and Regression Feb2014

    7/50

    INFERENTIAL STATISTICSrely on principles from probability

    sampling, whereby a researcher uses a random process to

    select cases from the entire population.

    Inferential statistics are a precise way to talk about how

    confident a researcher can be when inferring from the

    results in a sample to the population.

  • 8/12/2019 Correlation and Regression Feb2014

    8/50

    SSOCI TION L RESE RCH QUESTIONS

    Associational Research Questions are those inwhich 2 or more variables are associated or

    related.

    This approach usually involves an attempt to see

    how 2 or more variables covary (as one grows

    larger, the other grows larger or smaller) or one

    or more variables enables one to predict another

    variable.

    Pearson Correlation, Spearman Correlation, Eta

    Correlation, etc.

  • 8/12/2019 Correlation and Regression Feb2014

    9/50

    DIFFERENCE RESE RCH QUESTIONS

    Difference Research Questions: For thesequestions, we compare scores (on the dependent

    variable) of 2 or more different groups, each of

    which is composed of individuals with one of the

    values or levels on the independent variable.

    This type of question attempts to demonstrate that

    groups are not the same on the dependent

    variable.

    T-test, ANOVA, ANCOVA, MANOVA, MANCOVA,

    etc.

  • 8/12/2019 Correlation and Regression Feb2014

    10/50

    CORRELATION

    The correlation is one of the most common and

    most useful statistics.

    Definition - A correlation is a single number that describes

    the degree of relationship (dependence) between two

    variables. It characterizes the existence of a relationshipbetween variables.

    Relationship between 2 variables can vary from strong to

    weak.

    More accurately, correlation is the co-variation of

    standardized variables.

  • 8/12/2019 Correlation and Regression Feb2014

    11/50

    However, a correlation does not imply causation.meaning

    Because there is a strong positive or strong

    negative correlation between 2 variables, thisdoes not mean that one variable is caused by the

    other variable. Many statisticians claim that a

    strong correlation neverimplies a cause-effect

    relationship between two variables.

  • 8/12/2019 Correlation and Regression Feb2014

    12/50

    GENERALLY

    Two variables may correlate to each other in 3possible ways:

    Positive relationship:

    Both variables vary in the same directionas one goes up, the other goes up. Eg.

    Salary and years of education are positively correlated because people who get the

    highest salaries tend to be the ones who have gone to school the longest.

    Negative relationship:Two variables vary in the opposite directionas one up, the other goes down. Eg. The

    number of problems faced and the amount of immunoglobulin A in a persons system

    are negatively correlated because as the number of problems goes up, the amount of

    immunoglobulin A tends to go down.

    Zero relationship:

    Two variables has no relationship with each otherone changes without affecting the

    other. Eg. Average speed of car driven and average speed of mouse. Also, the

    relationship between personality fluctuations and movement of distant stars has a

    zero correlation.

  • 8/12/2019 Correlation and Regression Feb2014

    13/50

    Degree of Correlation: How Strongly are

    variables correlated?

    The degree of correlation between two variables can

    be established using two methods:

    Scatter plota graph with plotted values for twovariables being compared.

    Correlation Coefficient methods.

  • 8/12/2019 Correlation and Regression Feb2014

    14/50

    SCATTER PLOTS

  • 8/12/2019 Correlation and Regression Feb2014

    15/50

    Scatter Plots - Example

    Example of negative correlation

    - Hours of exercise per week and months of

    machine owned

    Example of uncorrelated data

    - Height and months of machined owned

    Example of positive correlation

    - Cardiovascular fitness score and months

    machine owned

  • 8/12/2019 Correlation and Regression Feb2014

    16/50

    Scatter Plots - Example

    Example of (a) weak and (b) strong correlation

  • 8/12/2019 Correlation and Regression Feb2014

    17/50

    Scatter Plots - Example

    Researchers laid out 10 circular plots,

    each 4 meters in diameter, in an area

    where beavers were cutting down

    cottonwood trees. The number of

    stumps and the number of clusters ofbeetle larvae were recorded in each

    plot with the following results:

    Stumps Beetle Larvae

    2 10

    2 30

    1 12

    3 244 40

    1 11

    5 56

    3 40

    1 8

    2 14

  • 8/12/2019 Correlation and Regression Feb2014

    18/50

    Scatter Plots - Example

    The scatter plot for the previous data:

    From the scatter plot, there appears to be a fairly strong positive association

    between the number of cottonwood stumps and the number of clusters of

    beetle larvae.

  • 8/12/2019 Correlation and Regression Feb2014

    19/50

    CORRELATION COEFFICIENT

  • 8/12/2019 Correlation and Regression Feb2014

    20/50

    Correlation coefficient is used to measure the degree of

    correlation between variables - It is a quantitative indicator.

    There are several type of correlation coefficient depending of

    the type of relationship.

    The most common is Pearsons correlation coefficient

    (denoted by r) which is sensitive only to a linear relationship

    between two variables.

    Other types of common correlation coefficients includeSpearmens rank correlation coefficient (denoted by ) and

    Kendalls rank correlation coefficient (denoted by ).

    Correlation coefficient

  • 8/12/2019 Correlation and Regression Feb2014

    21/50

    A correlation coefficient is a calculated number that indicates thedegree of correlation between two variables:

    Perfect positive correlation usually is calculated as a value of

    1 (or 100%).

    Perfect negative correlation usually is calculated as a value of

    -1.

    A values of zero shows no correlation at all.

    Correlation Coefficient

  • 8/12/2019 Correlation and Regression Feb2014

    22/50

    Correlation Coefficient

    TABLE 1.0 Interpreting a Correlation Coefficient

    Size of the Correlation coefficient General Interpretation

    0.8 to 1.0 Very strong relationship

    0.6 to 0.8 Strong relationship

    0.4 to 0.6 Moderate relationship

    0.2 to 0.4 Weak relationship

    0.0 to 0.2 Weak or no relationship

  • 8/12/2019 Correlation and Regression Feb2014

    23/50

    Correlation Coefficient

    A much more precise way to interpret the correlation coefficient:

    Computing the coefficient of determination. The coefficient of

    determination is the percentage of variance in one variable that is

    accounted for by the variance in the other variable.

    Coefficient of determination = Square of correlation coefficient

    Example: If the correlation between GPA and the number of hours of

    study is 0.7, then the coefficient of determination is _______.

    This means _______% of the variance in GPA can be explained by the

    variance in studying time. The stronger the correlation, the more the

    variance can be explained.

    However, this means that _______ % cannot be explained. The amount

    of unexplained variance is called the coefficient of alienation (or

    coefficient of non-determination).

  • 8/12/2019 Correlation and Regression Feb2014

    24/50

    Pearsons Correlation Coefficient

    If we have a series of nmeasurements ofXand Ywritten asxiand yiwhere i= 1, 2, ..., n,

    then the sample correlation coefficientcan be used to estimate the population Pearsoncorrelation rbetweenXand Y. The sample correlation coefficient is written as:

    where x and y are the sample means ofXand Y, and sxand syare the sample standard

    deviations ofXand Y.

    This can also be written as:

  • 8/12/2019 Correlation and Regression Feb2014

    25/50

    Age Score15 95

    26 71

    10 83

    9 91

    15 102

    20 87

    18 93

    11 100

    8 104

    20 94

    Is there a linear relationship between the age atwhich a child first begins to speak and his or her

    mental ability later on? To answer this question a

    study was conducted in which the age (in months)

    at which a child first spoke and the child's score on

    an aptitude test as a teenager were recorded:

    Draw a scatter plot and determine whether there

    appears to be a linear relationship between these

    two variables. If so, describe the relationship,

    calculate r, and determine what percentage of the

    variability in the aptitude score can be explained

    by the variability in the age at which a child beginsspeaking.

    Correlation Coefficient Example

  • 8/12/2019 Correlation and Regression Feb2014

    26/50

    Correlation Coefficient Example

    The scatter plot for the data:

    There appears to be a moderate negative association between the

    age at which a baby first begins to speak and mental ability later in

    life.

  • 8/12/2019 Correlation and Regression Feb2014

    27/50

    Correlation Coefficient ExampleCalculation of the correlation coefficient:

    r=(1013676-152920) (102616-1522) (1085510-9202))

    = -0.5973301213-0.60

    The variability in the age at which a child first speaks explains only about

    36% (r2= 0.36) of the variability in aptitude test scores later in life.

  • 8/12/2019 Correlation and Regression Feb2014

    28/50

    ExerciseCompute the correlation between the mens

    Height (in cm) and Weights (in kg) for the

    following data:

    Man Height (X) Weight (Y)

    A 182 86

    B 167 61

    C 175 70

    D 182 75

    E 180 70

  • 8/12/2019 Correlation and Regression Feb2014

    29/50

    When is a correlationstrong enough?

    0.9 very high correlation; very

    dependable relationship

  • 8/12/2019 Correlation and Regression Feb2014

    30/50

    Words of Caution

    Ex amine your data distribution (i.e using scatter

    plot) before you do anything with the correlation

    and make sure you know the dos and donts with

    the correlation coefficient!

    Correlation coefficient is just an index ofrelationship which tells nothing about the cause

    and effect of the relationship!

    Limit yourself to linear relationship if you dont

    have adequate statistical background!

  • 8/12/2019 Correlation and Regression Feb2014

    31/50

    REGRESSION

  • 8/12/2019 Correlation and Regression Feb2014

    32/50

    In statistics, regression analysisis a statistical technique for

    estimating the relationships among variables. It includesmany techniques for modeling and analyzing several

    variables, when the focus is on the relationship between a

    dependent variable and one or more independent variables.

    More specifically, regression analysis helps one understand

    how the typical value of the dependent variable changes

    when any one of the independent variables is varied, while

    the other independent variables are held fixed.

    All regression analysis test whether a significant

    quantitative relationship exists.

    Regression Analysis

  • 8/12/2019 Correlation and Regression Feb2014

    33/50

    Some Commonly Used Jargons..

    Linear Regression

    Line of Best Fit

    Regression Equation

  • 8/12/2019 Correlation and Regression Feb2014

    34/50

    The General idea About Regression

    Suppose we are asked to investigate the relationship between two

    variables namely Variable P (being the independent) and variableQ (being the dependent):

    What would be the predicted value of Q if P = 15? If P = 25?

    How do you predict these?

    Pair Variable P Variable Q

    Pair 1 10 7

    Pair 2 20 12

    Pair 3 30 17

    Pair 4 40 22

  • 8/12/2019 Correlation and Regression Feb2014

    35/50

    010 20 30 40

    20

    15

    10

    5Pair 1

    Pair 2

    Pair 3

    Pair 4

    P variable

    Q variable

  • 8/12/2019 Correlation and Regression Feb2014

    36/50

    Notice that if we connect these points, we would get a

    straight line. This line fits ALL the observed points.

    This straight line is called the line of best fit or

    regression line.

    The line of best fit defines a basis for predicting values

    of Q, given values of P (and vice versa).

    The concept of the line of best fit can be extended to

    form a basis for linear regression as well as non-linear

    regression.

  • 8/12/2019 Correlation and Regression Feb2014

    37/50

    Linear Regression

  • 8/12/2019 Correlation and Regression Feb2014

    38/50

    Non-Linear Regression

  • 8/12/2019 Correlation and Regression Feb2014

    39/50

    Regression models involve the following variables: The unknown parameters, denoted as , which may

    represent a scalar or a vector.

    The independent variables,X.

    The dependent variable, Y.

    Regression models can predict a value of the Yvariable given

    values of theXvariables. Prediction withinthe range of

    values in the dataset used for model-fitting is known

    informally asinterpolation. Prediction outsidethis range of

    the data is known as extrapolation.

    Regression Models

  • 8/12/2019 Correlation and Regression Feb2014

    40/50

    Linear Regression

    In linear regression, data is modeled using linear predictor

    functions, and unknown model parameters are estimated from

    the data.

    Such models are called linear models. Most commonly, linear

    regression refers to a model in which the conditional

    mean of Ygiven the value ofXis an affine function ofX. Lesscommonly, linear regression could refer to a model in which

    the median, or some other quantile of the conditional

    distribution of YgivenXis expressed as a linear function ofX.

    Like all forms of regression analysis, linear regression focuses on

    the conditional probability distribution of YgivenX, rather than

    on the joint probability distribution of YandX, which is the

    domain of multivariate analysis.

  • 8/12/2019 Correlation and Regression Feb2014

    41/50

    In non-linear regression, data are modeled by a function

    which is a non-linear combination of the model

    parameters and depends on one or more independent

    variables.

    As linear regression is much easier, some non-linear

    regression can be transformed or segmented to a linear

    regression.

    Non-Linear Regression

  • 8/12/2019 Correlation and Regression Feb2014

    42/50

    The method of least squares gives a way to find the best

    estimate of a particular measurement or data, assuming that the

    errors (i.e. the differences from the true value) are random and

    unbiased.

    "Least squares" means that the overall solution minimizes thesum of the squares of the errors made in the results of every

    single equation.

    The best fit in the least-squares sense minimizes the sum of

    squared residuals, a residual being the difference between an

    observed value and the fitted value provided by a model.

    Method of least squares

    Method of least squares the

  • 8/12/2019 Correlation and Regression Feb2014

    43/50

    Method of least squaresthe

    line of best fitThe method of least squares calculates the line of best fit by minimising the sum

    of the squares of the vertical distances of the points to the line. Lets illustratewith a simple example.

    Method of least squares the

  • 8/12/2019 Correlation and Regression Feb2014

    44/50

    Method of least squaresthe

    line of best fit

    Continued from previous slide.

  • 8/12/2019 Correlation and Regression Feb2014

    45/50

    Example - Method of least

    squares

    Fit a least square line to the following data.

    X 1 2 3 4 5

    Y 2 5 3 8 7

  • 8/12/2019 Correlation and Regression Feb2014

    46/50

    Example - Method of least squares

    Solution:

    X Y XY X2

    1 2 2 1

    2 5 10 4

    3 3 9 9

    4 8 32 16

    5 7 35 25

    The equation of least square line

    Normal Equation for a ---- (1)

    Normal Equation for b ---- (2)

    Eliminate a from equation (1) and (2), multiply equation (2) by3and subtract form

    equation (2), we get the values of a and b.

    Here a = 1.1 and b = 1.3, the equation of least square line becomes .

    Exercise

  • 8/12/2019 Correlation and Regression Feb2014

    47/50

    ExerciseA researcher investigates the relationship between individuals

    score on a Reading Aptitude Test and the average amount of hours

    he/she spends for reading (simply called Hours): The data

    gathered from 10 students are as follows:

    :Student Score on Reading Aptitude Test (X) Hours (Y)

    S1 20 5

    S2 5 1

    S3 5 2

    S4 40 7

    S5 30 8

    S6 35 9

    S7 5 3

    S8 5 2

    S9 15 5

    S10 40 8

  • 8/12/2019 Correlation and Regression Feb2014

    48/50

    DO NOT WORRY ABOUT APPLYING THE

    EQUATIONS!

    You will use SPSS (Statistical Package for

    Social science) to obtain all the analysis

  • 8/12/2019 Correlation and Regression Feb2014

    49/50

    The first step in any applied research isto get a good THEORETICAL grasp of the

    topic to be studied.

    The best data analyst dont start with

    the data, they start with theory.

  • 8/12/2019 Correlation and Regression Feb2014

    50/50

    THANK YOU

    PREPARED BYASSOC PROF DR NORMAH MULOP