SChapter 5

6
 UECM2623/UCCM2623 Numerical Methods and Statistics/UECM1693 Mathematics for Physics II Chapter 5 - 1 Chapter 5: Regression and Correlation 5.1 Introduction The main objective of this chapter is to a nalyze a collection of paired sample data (or bivariate data) and determine whether there appears to be a relationship between the two variables. For a set of bivariate data, { } ) , ( , ... ), , ( ), , ( 2 2 1 1 n n  y  x  y  x  y  x , the sum of squares of  X  and Y  is given by n  y  x  xy  y  y  x  x S  XY Σ Σ Σ = Σ = ) )( ( . Similarly, the sum of squares of  X  is n  x  x S  XX 2 2 ) (Σ Σ =  the sum of squares of Y  is n  y  y S YY 2 2 ) (Σ Σ =  5.2 Correlation A correlation exists between two variables when one of them is related to the other in some way. A scatterplot  (or scatter diagram) is a graph in which the paired (  x,  y) sample data are plotted with a horizontal x-axis and a vertical y-axis. Each individual (  x, y) pair is plotted as a single point. Example 5.1. Suppose we take a sample of seven households and collect information on their incomes and food expenditures for the past month. The information obtained (in hu ndreds of RM) is given below.  Income (hundreds) 35 49 21 39 15 28 25 Food expenditure (hundreds) 9 15 7 11 5 8 9 Solution. The scatter diagram for this set of data is 4 6 8 10 12 14 16 10 20 30 40 50  

description

statistics

Transcript of SChapter 5

  • UECM2623/UCCM2623 Numerical Methods and Statistics/UECM1693 Mathematics for Physics II

    Chapter 5 - 1

    Chapter 5: Regression and Correlation

    5.1 Introduction

    The main objective of this chapter is to analyze a collection of paired sample data (or bivariate data) and determine whether there appears to be a relationship between the two variables.

    For a set of bivariate data, { }),(,...),,(),,( 2211 nn yxyxyx , the sum of squares of X and Y is given by

    n

    yxxyyyxxS XY

    == ))(( .

    Similarly,

    the sum of squares of X is n

    xxS XX

    22 )(

    =

    the sum of squares of Y is n

    yySYY2

    2 )(=

    5.2 Correlation

    A correlation exists between two variables when one of them is related to the other in some way.

    A scatterplot (or scatter diagram) is a graph in which the paired (x, y) sample data are plotted with a horizontal x-axis and a vertical y-axis. Each individual (x, y) pair is plotted as a single point.

    Example 5.1. Suppose we take a sample of seven households and collect information on their incomes and food expenditures for the past month. The information obtained (in hundreds of RM) is given below.

    Income (hundreds) 35 49 21 39 15 28 25 Food expenditure (hundreds) 9 15 7 11 5 8 9

    Solution. The scatter diagram for this set of data is

    4

    6

    8

    10

    12

    14

    16

    10 20 30 40 50

  • UECM2623/UCCM2623 Numerical Methods and Statistics/UECM1693 Mathematics for Physics II

    Chapter 5 - 2

    5.2.1 Linear Correlation Coefficient The linear correlation coefficient, r, (is also called the Pearson product moment correlation coefficient) measures the strength of the linear relationship between the paired x- and y-quantitative values in a sample.

    YYXX

    XY

    SSS

    r =

    Properties of the linear correlation coefficient, r, 1. The value of r is always between 1 and 1 inclusive. That is 11 r . 2. r measures the strength of a linear relationship. It is not designed to measure the strength of a

    relationship that is not linear. 3. If 0=r , then there is no linear relationship between the two variables 4. 0 r 2 1

    Degree of correlation Positive correlation Negative correlation Perfect +1

    1 Strong 0.18.0

  • UECM2623/UCCM2623 Numerical Methods and Statistics/UECM1693 Mathematics for Physics II

    Chapter 5 - 3

    5.3 Regression

    The simple regression equation (model) expresses a relationship between x (called the independent variable, predictor variable or explanatory variable) and y (called the dependent variable or response variable).

    A (simple) regression model that gives a straight-line relationship between two variables is called a linear regression model, xBAy +=

    Given a collection of paired sample data, the regression equation xbay += describes the relationship between the two variables algebraically. The graph of the regression equation is called the regression line.

    The error sum of squares, denoted by SSE, is

    2)( yySSE = The values of a and b which give the minimum SSE are called the least squares estimates of A and B and the regression line obtained with these estimates is called the least squares line.

    For least squares regression line, xbay +=

    XX

    XY

    SSb = and xbya = .

    The least squares regression line xbay += is also called the regression of y on x.

    Example 5.3. Find the regression equation for the data in Example 5.2.

    Solution.

    5.3.1 Interpretation of a and b

    Interpretation of a a is the y-intercept of the regression line, that is the value of y when 0=x

    Interpretation of b b is the slope of the regression line. The value of b in a regression line gives the change in y due to a change of one unit in x.

  • UECM2623/UCCM2623 Numerical Methods and Statistics/UECM1693 Mathematics for Physics II

    Chapter 5 - 4

    Note b is positive positive linear relationship between x and y x increases then y increases, x decreases then y decreases. b is negative negative linear relationship between x and y x increases then y decreases, x decreases then y increases.

    5.3.2 Using regression equation for predictions If there is a linear correlation between x and y, the best predicted y-value is found by substituting the x-value into the regression equation.

    Example 5.4. Find the least squares regression line for the data in Example 5.1, use income as an independent variable and food expenditure as a dependent variable a) What is the predicted food expenditure for a household with income of RM3000? b) Give a brief interpretation of the values of a and b calculated in part (a) in this context.

    Solution.

    Income, x Food expenditure, y xy x2 y2 35 9 315 1225 81 49 15 735 2401 225 21 7 147 441 49 39 11 15 5 75 225 25 28 8 224 784 64 25 9 225 625 81

    x = y = 64 xy = x2 = y2 =

  • UECM2623/UCCM2623 Numerical Methods and Statistics/UECM1693 Mathematics for Physics II

    Chapter 5 - 5

    5.4 Coefficient of determination

    The total sum of squares, denoted by SST is given by,

    n

    yyS

    yySST

    YY

    22

    2

    )()(

    ==

    =

    Example 5.5. For the regression line in Example 5.4, find the value of its SSE and SST.

    Solution.

    x y xy 2642.01414.1 += yy 2)( yy 35 9 10.3884 -1.3884 1.9277 49 15 14.0872 0.9128 0.8332 21 7 6.6896 0.3104 0.0963 39 11 11.4452 -0.4452 0.1982 15 5 28 8 8.5390 -0.5390 0.2905 25 9 7.7464 1.2536 1.5715

    = 2)( yy

    These values indicate that the sum of squared errors decreased from 60.8571 to 4.9283 when we used y in place of y to predict food expenditure.

    This reduction in squared errors is called the regression sum of squares and is denoted by SSR. Thus SSESSTSSR =

    The coefficient of determination, denoted by r2, represents the proportion of SST that is explained by the use of the linear regression model.

    SSTSSR

    r =2

    The computational formula for r2 is

    YYXX

    XY

    SSS

    r

    22

    = and 10 2 r .

    Example 5.6. For the data in Example 5.4, calculate the coefficient of determination and interpret the result.