We thankfully acknowledge Krishna Kanta Handiqui State ...

68

Transcript of We thankfully acknowledge Krishna Kanta Handiqui State ...

Page 1: We thankfully acknowledge Krishna Kanta Handiqui State ...
Page 2: We thankfully acknowledge Krishna Kanta Handiqui State ...

We thankfully acknowledge Krishna Kanta Handiqui State Open

University (KKHSOU) for adoption of their e-content in our course

materials. However the screenshots of Microsoft Excel have been

captured from www.excel-easy.com.

Page 3: We thankfully acknowledge Krishna Kanta Handiqui State ...

GENERIC ELECTIVE IN

COMMERCE (GECO)

GECO-3

Business Statistics

BLOCK – 3

SIMPLE CORRELATION AND

REGRESSION ANALYSIS

UNIT-5 CORRELATION ANALYSIS

UNIT-6 REGRESSION ANALYSIS

Page 4: We thankfully acknowledge Krishna Kanta Handiqui State ...

1

UNIT 5 : CORRELATION ANALYSIS

Structure

5.0 Learning Objectives

5.1 Introduction

5.2 Correlation Analysis

5.3 Correlation and Causation

5.4 Types of Correlation

5.5 Methods of Measuring Correlation

5.5.1 Scatted Diagram

5.5.2 Karl Pearson's Coefficient of Correlation

5.5.3 Spearman's Rank Correlation

5.6 Probable error of correlation coefficient

5.7 Coefficient of determination

5.8 Correlation with the use of Ms EXCEL

5.9 Let us Sum up

5.10 Further Readings

5.11 Answer to Check your Progress

5.12 Model Questions

5.0 LEARNING OBJECTIVES

After going through this unit you will be able to -

Learn how correlation analysis expresses quantitatively, the degree and

direction of the association between two variables.

Compute and interpret different measures of correlation namely Karl earson’s

Correlation Coefficient, Spearman’s Rank Correlation Coefficient.

5.1 INTRODUCTION

In many business environments, we come across problems or situations where two

variables seem to move in the same direction such as both are increasing or

decreasing. At times an increase in one variable is accompanied by a decline in

another. Thus, if two variables are such that when one changes the other also changes

then the two variables are said to be correlated. The knowledge of such a relationship

is important to make inferences from the relationship between variables in a given

situation. For example, a marketing manager may be interested to investigate the

degree of relationship between advertising expenditure and the sales volume. The

Page 5: We thankfully acknowledge Krishna Kanta Handiqui State ...

2

manager would like to know whether money that he is going to spent on advertising

is justified or not in terms of sales generated.

Correlation analysis is used as a statistical technique to ascertain the association

between two quantitative variables. Usually, in correlation, the relationship between

two variables is expressed by a pure number i.e., a number without having any unit

of measurement. This pure number is referred to as coefficient of correlation or

correlation coefficient which indicates the strength and direction of statistical

relationship between variables.

It may be noted that correlation analysis is one of the most widely employed

statistical devices adopted by applied statisticians and has been used extensively not

only in biological problems but also in agriculture, economics, business and several

other fields. In this unit we shall introduce correlation analysis for two variables.

The importance of examining the relationship between two or more variables can be

stated in the form of following questions and accordingly requires the statistical

devices to arrive at conclusions:

i. Does there exist an association between two or more variables? If exists, what

is the form and the degree of that relationship?

ii. Is the relationship strong or significant enough to be useful to arrive at a

desirable conclusion?

iii. Can the relationship be used for predictions of future events, that is, to

forecast the most likely value of a dependent variable corresponding to the

given value of independent variable or variables?

The first two questions can be answered with the help of correlation analysis while

the final question can be answered by using the regression analysis.

In case of correlation analysis, the data on values of two variables must come from

sampling in pairs, one for each of the two variables.

5.2 CORRELATION ANALYSIS

By the term ‘correlation’ we mean the relationship between two variables. Two

variables are said to be correlated if the change in one variable results in a

corresponding change in the other variable. In the practical field we need to

investigate the type of relationship that might exist between the ages of husbands and

wives, the heights of fathers and sons, the amount of rainfall and the volume of

production of a particular crop, the price of a commodity and the demand for it, and

so on. Similarly, we may state the example of price of a commodity and the demand

Page 6: We thankfully acknowledge Krishna Kanta Handiqui State ...

3

for it. If the price of a commodity increases, there is a decline in its demand. Thus

there exists correlation between the two variables price and demand. In correlation,

we study the nature and the degree of relationship between the two variables.

Uses of Correlation Analysis:

In spite of certain limitations correlation analysis is a widely used statistical device.

With the help of correlation analysis one can ascertain the existence as well as degree

and direction of relations between two variables. It is an indispensable tool of

analysis for the people of Economics and Business. Variables in Economics and

Business are usually interrelated. In order to study the nature (positive or negative)

and degree (low, moderate or high) of relationship between any two of such related

variables correlation analysis is used. In reality, besides Business and Economics, it

is extensively used in various other branches.

5.3 CORRELATION AND CAUSATION

Correlation analysis helps us to have an idea about the degree and direction of the

relationship between the two variables under study. However, it fails to reflect upon

the cause and effect relationship between the variables. If there exist a cause and

effect relationship between the two variables, they are bound to vary in sympathy

with each other and, therefore, there is bound to be a high degree of correlation

between them. In other words, causation always implies correlation. However, the

converse is not true i.e., even a fairly high degree of correlation between the two

variables need not imply a cause and effect relationship between them. The high

degree of correlation between the variables may be due to the following reasons:

(i) Mutual dependence: The phenomenon under study may inter influence

each other. Such situations are usually observed in data relating to economic and

business situations. For example, variables like price, supply, and demand of a

commodity are mutually correlated. According to the principle of economics, as the

price of a commodity increases, its demand decreases, so price influences the

demand level. But if demand of a commodity increases due to growth in population,

then its price also increases. In this case increased demand makes an effect on the

price. However, the amount of export of a commodity is influenced by an increase or

decrease in custom duties but the reverse is normally not true.

(ii) Pure chance: It may happen that a small randomly selected sample from

a bivariate distribution (i.e., a distribution in which each unit of the series assumes

two values) may show a fairly high degree of correlation though, actually, such a

relationship may not exist in the universe. Such correlation may be due to chance

Page 7: We thankfully acknowledge Krishna Kanta Handiqui State ...

4

fluctuations. Moreover, the conscious or unconscious bias on the part of the

investigator, in the selection of the sample may also result in high degree of

correlation in the sample. It may be noted that in both the phenomena a fairly high

degree of correlation may be observed, though it is not possible to conceive them as

being causally related.

(iii) Influence of external factors: A high degree of correlation may be

observed between two variables due to the effect or inter action of a third variable or

a number of variables on each of these variables. For example, a fairly high degree of

correlation may be observed between the yield per hectare of two crops, say , rice

and potato, due to the effect of a number of factors like favorable weather conditions,

fertilizers used, irrigation facilities, etc., on each of them. But none of the two is the

cause of the other.

5.4 TYPES OF CORRELATION

Correlation may be broadly classified into the following three types:

(a) Positive, Negative and Zero Correlation,

(b) Linear and Non-linear Correlation,

(c) Simple, Partial and Multiple Correlation .

(a) Positive, Negative and Zero Correlation,

Positive Correlation: Two variables are said to be positively or directly correlated if

the values of the two variables deviate or move in the same direction i.e., if the

increase in the values of one variable results, on an average, in a corresponding

increase in the values of the other variable or if a decrease in the values of one

variable results, on an average, in a corresponding decrease in the values of the other

variable. For example, there exists positive correlation between the following pairs of

variables.

(i) The income and expenditure of a family on luxury items,

(ii) Advertising expenditure and the sales volume of a company,

(iii) Amount of rainfall and yield of a crop,

(iv) Height and weight of a student

(v) Price and supply of a commodity,

(vi) Temperature and sale of cold drinks on different days of a month in summer.

Page 8: We thankfully acknowledge Krishna Kanta Handiqui State ...

5

When the changes in two related variables are exactly proportional and are in the

same direction then we say that there is perfect positive correlation between them.

For example, there exists perfect positive correlation between the following pairs of

sets of data where each set of data may be assumed to be the values of a variable.

X: 10 20 30 40 50

Y: 2 4 6 8 10

U: 40 35 30 25 20 15 10

V: 14 12 10 8 6 4 2

Negative Correlation: The correlation between two variables is said to be negative

or inverse if the variables deviate in the opposite direction i.e., if the increase

(decrease) in the values of one variable results, on an average, in a corresponding

decrease (increase) in the values of the other variable. The following pairs of

variables are negatively correlated:

(i) Price and demand for a commodity,

(ii) Volume and pressure of a perfect gas,

(iii) Sale of woolen garments and the day temperature,

(iv) Number of workers and time required to complete a work

When the changes in two related variables are exactly proportional but are in the

opposite directions then we say that there is perfect negative correlation between

them. The correlation between each of the following pairs of variables is perfectly

negative:

X: 60 50 40 30 20

Y: 2 4 6 8 10

U: 0 1 2 3

V: 2 -3 -8 -13

Zero Correlation: Two variables are said to have zero correlation or no correlation

if they tend to change with no connection to each other. In such situation the

variables are said to be uncorrelated. For example, one should expect zero correlation

between the yield of crop and the heights of students, or between price of rice and

demand for sugar.

Page 9: We thankfully acknowledge Krishna Kanta Handiqui State ...

6

(b) Linear and Non-linear Correlation: The correlation between two variables is

said to be linear if corresponding to a unit change in one variable, there is a constant

change in the other variable over the entire range of the values. The following

example illustrates a linear correlation between the two variables X and Y.

X: 10 20 30 40 50

Y: 40 60 80 100 120

When these pairs of values X and Y are plotted on a graph paper, the line obtained

by joining the points would be a straight line.

In general, two variables X and Y are said to be linearly related if the relationship

that exists between the two variables is of the form given by,

Y= a +b X

Where ‘b’ is the slope and ‘a’ the intercept.

On the other hand, a non-linear correlation indicates an absolute change in one of the

variable values with respect to changes in values of another variable. In other words,

correlation is said to be non-linear when the amount of change in the values of one

variable does not bear a constant ratio to the amount of change in the corresponding

values of another variable. The following example illustrates a non-linear correlation

between the given variables.

X: 8 9 9 10 10 28 29 30

Y: 80 130 170 150 230 560 460 600

When these pair of values are plotted on a graph paper, the line obtained by joining

these points would not be a straight line, rather it would be curvi-linear.

(c) Simple, Partial and Multiple Correlation: The distinction amongst Simple,

Partial and Multiple Correlation depends upon the number of variables involved

under study.

In Simple correlation only two variables are introduced to study the relationship

between them. A study on income with respect to saving only, or sales revenue with

respect to amount of money spent on advertisement etc. are a few examples studied

under Simple Correlation.

When the study involves more than two variables then it is a problem of either Partial

or Multiple Correlation. In Partial Correlation, we study relationship between two

variables while the effect of other variable is held constant. In other words, in Part ial

Correlation we study the linear relationship between a dependent variable and one

Page 10: We thankfully acknowledge Krishna Kanta Handiqui State ...

7

particular independent variable out of a set of independent variables when all other

variables are held constant. For example, suppose our study involves three variables

X1, X2 and Y where X1 is the number of hours studied, X2 is I.Q. and Y is the marks

secured in the examination. Now if we study the relationship between number of

hours (X1) and marks obtained (Y) by the student keeping the effect of I.Q. (X2)

constant then it is a problem of Partial Correlation.

In Multiple Correlation three or more than three variables are studied simultaneously.

For example, the study of the relationship between the production of a particular crop

on one side and rainfall and use of fertilizer on the other side falls under Multiple

Correlation.

CHECK YOUR PROGRESS

Q 1: State whether the following statements are true or false:

(i) Correlation helps to formulate the relationship between the variables.

(ii) If the relationship between variables x and y is positive, then as the variable y

decreases, the variable x increases.

(iii) In a negative relationship as x increases, y decreases.

(iv)Multiple correlation deals with studying three or more than three variables

simultaneously

5.5 METHODS OF MEASURING CORRELATION

Here we shall confine our discussion to the methods of measuring only linear

relationships only. The commonly used methods for studying the correlation between

two variables are :

(i) Scatter Diagram method

(ii) Karl Pearson’s correlation coefficient method

(iii) Spearman’s Rank correlation method

(iv) Concurrent Deviation method

5.5.1 SCATTER DIAGRAM METHOD

Scatter Diagram is one of the simplest methods of diagrammatic representation of a

bivariate distribution and used to study the nature (i.e., positive, negative and zero)

and degree (i.e., weak or strong) of correlation between two variables. A scatter

diagram can be obtained on a graph paper by plotting observed pairs of values of

Page 11: We thankfully acknowledge Krishna Kanta Handiqui State ...

8

variables x and y, considering the independent variable values on the x-axis and the

dependent variable values on the y-axis. Suppose we are given n pairs of values

nn yxyxyx ,....,,.........,,, 2211 of two variables X and Y. These n points may be

plotted as dots ( ) in the xy plane. The diagram of dots so obtained is known as

scatter diagram. From scatter diagram we can form a fairly good, though rough idea

about the existence of relationship between the two variables. After plotting the

points in the xy plane we may have one of the types of scatter diagrams as shown

below:

Fig 5.1

Page 12: We thankfully acknowledge Krishna Kanta Handiqui State ...

9

Now with the help of Scatter Diagram, we can interpret the correlation between the

two variables as:

(i) If the points are very close to each other, then we can expect a fairly good amount

of correlation between the two variables. On the other hand, if there appears to be no

obvious pattern of the points of the scatter diagram then it indicates that there is

either no correlation or very low amount of correlation between the variables.

(ii) If the points on the scatter diagram reveal any trend (either upward or backward),

the variables are said to be correlated and the variables are uncorrelated if no trend is

revealed.

(iii) If there is an upward trend rising from lower left hand corner to the upper right

hand corner then we expect a positive correlation because in such a situation both the

variables move in the same direction. On the other hand, if the points on the scatter

diagram depict a downward trend starting from upper left hand corner to the lower

right hand corner, the correlation is negative since in this case the values of the two

variables run in the opposite direction.

(iv) In particular, if all the points lie on a straight line starting from the left bottom

and going up towards the right top, the correlation is perfect and positive, and if all

the points lie on a straight line starting from left top and coming down to right

bottom, the correlation is perfect and negative.

Remark: 1.The Scatter Diagram method enables us to form a rough idea of the

nature of the relationship between the two variables simply by inspection of the

graph. However, this method is not suitable to situations involving large number of

observations.

2. The method of scatter diagram provides information only about the nature of the

relationship that is whether it is positive or negative and whether it is high or low but

fails to provide an exact measure of the extent of the relationship between the two

variables.

Example 5.1: The percentage examination scores of 10 students in Data analysis

and Economics were as follows. Draw a scatter diagram for the data and comment on

the nature of correlation.

Student : A B C D E F G H I J

Data Analysis: 65 90 52 44 95 36 48 63 80 15

Economics : 62 71 58 58 64 40 42 66 67 55

Page 13: We thankfully acknowledge Krishna Kanta Handiqui State ...

10

Solution: A scatter diagram will give a preliminary indication of whether linear

correlation exists. We plot the ordered pairs (65, 62), (90, 71),………., (15, 55) as

shown in the following figure.

Fig 5.2

Since the points are very close to each other, we may expect a high degree of

correlation. Further, since the points reveal an upward trend starting from left bottom

to top right hand corner, the correlation is positive. Hence, we conclude that there

exists a high degree of positive correlation between the scores of the students in Data

Analysis and Economics.

5.5.2 KARL PEARSON’S CORRELATION COEFFICIENT METHOD

The scatter diagram gives a rough indication of the nature and extent/strength of the

relationship between the two variables. The quantitative measurement of the degree

of linear relationship between two variables, say ‘x’ and ‘y’, is given by a parameter

called correlation coefficient. It was developed by Karl Pearson. Karl Pearson;s

method of measuring correlation between two variables is called the coefficient of

correlation or correlation coefficient. It is also known as product moment coefficient.

For a set of n pairs of values of x and y , Karl Pearson’s correlation coefficient,

usually denoted by YXr , or xyr or simply r is defined by,

)()(

,

YVarXVar

YXCovr ……………………..(11.1)

Or, YX

YXCOVr

),(

where YYXXn

YXCov1

),(

Page 14: We thankfully acknowledge Krishna Kanta Handiqui State ...

11

22 1

),( XXn

XVar X

22 1

),( YYn

YVar Y

Substituting these values in the definition of r , We have

)3.12.......(..........

)2.12.........(..........

2222

22

YYnXXn

YXXYn

YYXX

YYXXr

=

)3.12.......(..........

2222

YnYXnX

YXnXY

Step deviation method for ungrouped data: When actual mean values x and y

are in fraction, the calculation of Pearson’s correlation coefficient can be simplified

by taking deviations of x and y values from their assumed means A and B,

respectively. Thus when AXd x and BYd y , where A and B are assumed

means of x and y values, the formula (11.2) becomes

)4.12.......(..........

2222

yyxx

yxyx

ddnddn

ddddnr

)4.12.......(..........

2222

yyxx

yxyx

dnddnd

ddnddr

Step deviation method for grouped data: When data on x and y values are

classified or grouped into a frequency distribution, (11.3) is modified as:

Page 15: We thankfully acknowledge Krishna Kanta Handiqui State ...

12

)5.12.......(..........

2222

yyxx

yxyx

fdfdnfdfdn

fdfddfdnr

Assumptions of Using Pearson’s Correlation Coefficient:

Karl Pearson’s correlation coefficient XYr is based on the following four

assumptions:

(i) It is appropriate to calculate when both variables x and y are measured

on an interval or a ratio scale.

(ii) The random variables X and Y are normally distributed.

(iii) There is a linear relationship between the variables.

(iv) There is a cause and effect relationship between two variables that

influences the distributions of both the variables.

Merits and Limitations of Correlation Coefficient: The correlation coefficient is a

numerical number lying between -1 and +1 that summarizes the magnitude as well as

direction of association between two variables. The chief merits of this method are

given below:

(i) Karl Pearson’s coefficient of correlation is a widely used statistical

device.

(ii) It summarizes the degree (high, moderate or low) in one figure.

(iii) It is based on all the observations.

However, analysis based on Pearsonian coefficient is subject to certain severe

limitations which are presented as:

(i) A value of XYr which is near to zero does not necessarily indicate that the two

variables X and Y are uncorrelated, but it merely indicates that there is no linear

relationship between them. There may be a curvilinear or some complex relationship

between the two variables which Pearson’s formula cannot detect as it is an

instrument of measuring linear correlation ship only.

(ii) Correlation is only a measure of the nature and degree of relationship between

two variables and it gives no indication of the kind of cause and effect relationship

that may exist between the two variables. It fails to identify the variables as

dependent or independent variables. Correlation theory simply seeks to discover if a

covariation between two variables exists or not. Statistical correlation technique may

reveal a very close relationship between two variables, but it cannot tell us about

cause and effect relationship between them or which variable causes other to react.

Page 16: We thankfully acknowledge Krishna Kanta Handiqui State ...

13

(iii) Two uncorrelated variables may exhibit a high degree of correlation between

them. For example, the data relating to the yield of rice and wheat may show a fairly

high degree of positive correlation although there is no connection between the two

variables, viz., yield of rice and yield of wheat. This may be due to the favourable

impact of extraneous factors like weather conditions, fertilizers used, irrigation

facilities, improved variety of seeds etc. on both of them.

(iv) Sometimes high correlation between two variables may be entirely spurious.

However, such a high correlation may exist due to chance and consequently such

correlations are termed as chance correlations.

Properties of Correlation Coefficient:

Some of the important properties of Correlation coefficient are given below:

(a) Correlation coefficient is a pure number.

(b) Correlation coefficient is independent of change of origin and scale of

measurement.

We shall now establish property (b).

Proof: Suppose we want to study the relationship between two variables X and Y.

Let these variables

be transformed to the new variables U and V by the change of origin and scale viz.,

h

aXU

,

k

bYV

....…………….(11.6)

where a and b are known as assumed mean or origin and h and k are known as scale

of measurement.

Therefore from (11.5) we have,

hUaX and kVbY ………………...(11.7)

Summing both sides and dividing by n we get,

UhaX and VkbY ………………….(11.8)

Subtracting (11.7) from (11.6) we have,

UUhXX and VVkYY

Putting these values in equation (11.2) we get,

2222 VVkUUh

VVkUUhrr XY

22

VVUUhk

VVUUhk

Page 17: We thankfully acknowledge Krishna Kanta Handiqui State ...

14

22

VVUU

VVUU

UVr

Since UVXY rr , therefore the correlation coefficient between the two original

variables X and Y is equal to the correlation coefficient between the new variables U

and V (where the new variables U and V are obtained from X and Y respectively

after changing the origin and scale of X and Y). Hence, we conclude that Correlation

coefficient is independent of change of origin and scale of measurement.

(c) Correlation coefficient lies between -1 and +1, i.e., 111 rorr .

Proof: Let us introduce two variables X and Y with their arithmetic means X and Y

and with standard deviations X and Y respectively.

Let us now consider the sum of squares

2

YX

YYXX

Which is always non-negative.

i.e., 0

2

YX

YYXX

02

22

YXYX

YYXXYYXX

022

2

2

2

YXYX

YYXXYYXX

Now dividing by n , we get

0121111 2

2

2

2 YYXX

nYY

nXX

n YXYx

0.2

.1

.1 2

2

2

2 YX

YX

Y

Y

x

X

022 r

01 r and 01 r

1 r and 1r

r 1 and r 1

11 r

Page 18: We thankfully acknowledge Krishna Kanta Handiqui State ...

15

Remarks: 1. This property provides us a check on our calculations. If any problem,

the obtained value of r lies outside the limits ,1 this implies that there is some

mistake in our calculations.

2. 1r indicates perfect positive correlation between the variables and 1r

indicates perfect negative correlation between the variables.

(d) If X and Y are two independent variables, ,0XYr but the converse is not true.

Proof: We have

yx

yXCovr

),(

22

YYEXXE

YYXXE

……………(11.9)

Now, YXYXYXXYEYYXXE

YXEYEXXEYXYE

= YXYXXYXYE

= YXXYE (Since X and Y are given to be independent,

= YXYEXE )()( YEXEXYE )

= YXYX

= 0

0 YYXXE

Therefore from (11.9),

0

0

22

YYEXXE

r

Thus when two variables are independent, they are uncorrelated i.e., 0XYr

But the converse is not true:

If two variables are uncorrelated then they may not be independent. Since we know

that Karl Pearson’s correlation coefficient is a measure of only linear correlation

between two variables, therefore 0XYr indicates that there exists no linear

relationship between the variables. There may, however, exist a strong non-linear or

curvilinear relationship between x and y even though .0XYr

(e) Correlation coefficient is symmetric, i.e., YXXY rr .

Proof: We have,

Page 19: We thankfully acknowledge Krishna Kanta Handiqui State ...

16

....(*)....................

22

YYXX

YYXXrXY

Now interchanging X and Y we get,

YXr

22

XXYY

XXYY

).......(**....................

22

YYXX

YYXXrYX

From (*) and (**) we find that

.YXXY rr

Interpretation of Various values of Correlation Coefficient:

The interpretations of various values of XYr are as follows:

(i) 10 XYr implies that there is positive correlation between X and Y. The closer

the value of XYr to 1, the stronger is the positive correlation.

(ii) 1XYr implies that there exists perfect and positive correlation between the

variables.

(iii) 01 XYr implies that there is negative correlation between X and Y. The

closer the value of XYr to-1, the stronger is the negative correlation.

(iv) 1XYr indicates that the correlation between the variables is perfect and

negative.

(v) 0XYr means that there is no correlation between the variables and hence the

variables are said to be uncorrelated.

CHECK YOUR PROGRESS

Q 2: Given ,43YYXX ,322

XX 722

YY

What is correlation coefficient r ?

Q 3: Given r =0.25, Cov(X, Y)=3.6, Var(X)=36 then what is S.D.(Y)

Example 5.2: The following data gives indices of industrial production and number

of registered unemployed people (in lakh). Determine Karl Pearson’s correlation

coefficient

Page 20: We thankfully acknowledge Krishna Kanta Handiqui State ...

17

Year Index of Production Number

Unemployed

1991 100 15

1992 102 12

1993 104 13

1994 107 11

1995 105 12

1996 112 12

1997 103 19

1998 99 26

Solution: To calculate the Karl Pearson’s correlation coefficient we prepare the

following table:

Year Productio

n (X) XX

2XX

Unemploye

d

(Y)

YY

2YY

YYXX

199

1

100 -4 16 15 0 0 0

199

2

102 -2 4 12 -3 9 6

199

3

104 0 0 13 -2 4 0

199

4

107 3 9 11 -4 16 -12

199

5

105 1 1 12 -3 9 -3

199

6

112 8 64 12 -3 9 -24

199

7

103 -1 1 19 4 16 -4

199

8

99 -5 25 26 11 121 -55

Tota

l

832 0 120 120 0 184 -92

;1048

832

n

XX 15

8

120

n

YY

Thus Karl Pearson’s Correlation Coefficient,

Page 21: We thankfully acknowledge Krishna Kanta Handiqui State ...

18

22

YYXX

YYXXrXY

=184120

92

Since coefficient of correlation 619.0r is moderately negative, it indicates that

there is a moderately large inverse correlation between the two variables. Hence, we

conclude that as the production index increases, the number of unemployed

decreases and vice-versa.

Example 5.3: The following data relate to age of employees and the number of days

they reported sick in a month.

Employees Age Sick

days

1 30 1

2 32 0

3 35 2

4 40 5

5 48 2

6 50 4

7 52 6

8 55 5

9 57 7

10 61 8

Calculate Karl Pearson’s coefficient of correlation and interpret it.

Solution: Let age and sick days be represented by variables X and Y respectively.

Then Karl Pearson’s

Correlation coefficient

22

YYXX

YYXXrXY

Now we prepare the following table for calculation

Age

X

Sick days

XX 2XX Y YY 2YY XX

YY

30 -16 256 1 -3 9 48

Page 22: We thankfully acknowledge Krishna Kanta Handiqui State ...

19

32 -14 196 0 -4 16 56

35 -11 121 2 -2 4 22

40 -6 36 5 1 1 -6

48 2 4 2 -2 4 -4

50 4 16 4 0 0 0

52 6 361 6 2 4 12

55 9 81 5 1 1 9

57 11 121 7 3 9 33

61 15 225 8 4 1 60

460 0 1092 40 0 64 230

6.410

46

n

XX

410

40

n

YY

Therefore we have,

22

YYXX

YYXXrXY

=641092

230

= 870.0

Since coefficient of correlation is closer to 1 and positive, therefore age of employees

and number of sick days are positively correlated to a high degree. Hence we

conclude that as the age of an employee increases, he is likely to go on sick leave

more often than others.

Example 5.4: The following data gives Sales and Net Profit for some of the top

Auto-makers during the quarter July-September 2006. Find the correlation

coefficient.

Company Average sales

estimates

(Rs.’00crores )

Average Net profit

Estimates(Rs.’0crores)

Tata Motors 65.00 47

Hero Honda 22.00 22

Bajaj Auto 24.00 34.5

TVS Motor 10.00 3.5

Bharat Forge 5 6

Ashok Leyland 16.00 9

M & M 24.00 20

Maruti Udyog 34.00 32

Page 23: We thankfully acknowledge Krishna Kanta Handiqui State ...

20

Solution: Let the average sales and average net profit of the given automobiles be

denoted by X and Y respectively. Then correlation coefficient is given by

2222 YYnXXn

YXXYnr [Using equation (11.3)]

Now we make the following table for calculation:

Company Average

sales(X)

Average

profit

(Y)

2X 2Y XY

Tata Motors 65.00 47 4225 2209 3055

Hero Honda 22.00 22 484 484 484

Bajaj Auto 24.00 34.5 576 1190.25 828

TVS Motor 10.00 3.5 100 12.25 35

Bharat Forge 5 6 25 36 30

Ashok Leyland 16.00 9 256 81 144

M & M 24.00 20 576 400 480

Maruti Udyog 34.00 32 1156 1024 1088

200 174 7398 5436.50 6144

Therefore correlation coefficient is given by,

22 17450.5436820073988

17420061448

r

=30276434924000059184

3480049152

=1321619184

14352

=9608629.1145063175.138

14352

=80578.15922

14352

=0.90135

Correlation Coefficient in case of Grouped Data:

In case of Bivariate frequency, if we are to deal with large volume of data then these

are classified in the form of a two-way frequency table known as bivariate table or

correlation table. Here for each of the variables, the values are classified into

different classes following the same considerations as in the case of univariate

distribution. If there are m classes for the values of the X variable and n classes for

Page 24: We thankfully acknowledge Krishna Kanta Handiqui State ...

21

the values of the Y variable then there will be nm cells in the two-way table.

Now we shall discuss the calculation of Karl Pearson’s correlation coefficient with

the help of the following example,

Example 5.5: Family income and its percentage spent on food in the case of 100

families gave the following bivariate frequency distribution. Calculate the coefficient

of correlation.

Food

Expenditure

(in %)

Family income (Rs.)

300200 400300 500400 600500 700600

Total

10-15

__

__ __

3

7

10

15-20

__

4

9

4

3

20

20-25

7

6

12

5

__

30

25-30

3

10

19

8

__

40

Total 10 20 40 20 10 100

Solution: Let us denote the income (in Rs.) and the food expenditure (%) by the

variables X and Y respectively. Now to calculate Karl Pearson’s coefficient of

correlation, we follow the steps given by,

Step 1: Find the mid points of various classes for X and Y series.

Step 2: Change the origin and scale in X series and Y series to the new variables u

and v by using the transformations:

100

450

X

h

AXu , and

5

5.17

Y

k

BYv

where x and y denote mid points of X series and Y series respectively and similarly

h and k denote magnitude of the classes of X and Y series respectively.

Step 3: For each class of X, find the total of cell frequencies of all the classes of Y

and similarly for each class of Y find the total of cell frequencies of all the classes of

X.

Step 4: Multiply the frequencies of X by the corresponding values of the variable u

and find the sum .fu

Step 5: Multiply the frequencies of Y by the corresponding values of the variable v

and find the sum .fv

Page 25: We thankfully acknowledge Krishna Kanta Handiqui State ...

22

Step 6: Multiply the frequency of each cell by the corresponding values of u and v

and write the product vuf within a square in the right hand top corner for each

cell.

Step 7: Add together all the figures in the top corner squares as obtained in step 6 to

get the last column uvf for each of the X and Y series. Finally, find the total of the

last column to get .fuv

8. Multiply the values of fu and fvby the corresponding values of u and v to get the

columns for2fu and

2fv . Add these values o obtain 2fu and 2fv .

The above calculations are presented below in the table:

X

Y

200-

300

300-

400

400-500 400-

500

500-

600

Mid

pt.(x)

250

350

450

550

650

u

v

-2

-1

0

1

2

f

fv

2fv

fuv

10-15

mid

pt.(y

)

12.5

-1

__

0

__

0

__

0

-

3

3

-

14

7

10

-10

10

-17

15-20

17.5

0

__

0

0

4

0

9

0

4

0

3

20

0

0

0

20-25

22.5

1

-14

7

-

6

6

0

12

5

5

__

0

30

30

30

-15

25-30

27.5

2

-12

3

-

20

10

0

19

16

8

__

0

40

80

160

-16

f 10 20 40 20 10 N=10

0 fv

=10

0

fv2

=20

0

fuv

= -

48

fu -20 -20 0 20 20 fu=0

2fu 40 20 0 20 40 fu2

=120

fuv -26 -26 0 18 -14 fuv

= - 48

Page 26: We thankfully acknowledge Krishna Kanta Handiqui State ...

23

uvr

2222 fvfvNfufuN

fvfufuvN

=

)100200100()0120100(

100048100

2

1000012000

4800

100002000012000

4800

4381.012000

48

Since Correlation coefficient is independent of change of origin and scale of

measurement, therefore, .4381.0 uvXY rr

5.5.3 SPEARMAN’S RANK CORRELATION COEFFICIENT

So far, we have confined our discussion with correlation between two variables,

which can be measured and quantified in appropriate units of money, time, etc.

However, sometimes, the data on two variables is given in the form of the ranks of

two variables based on some criterion. Here we introduce the method to study

correlation between ranks of the variables rather than their absolute values. This

method was developed by the British psychologist Charles Edward Spearman in

1904 .In other words, this method is used in a situation in which quantitative measure

of certain qualitative factors such as judgement, brands personalities, TV

programmes, leadership, colour, taste, cannot be fixed, but individual observations

can be arranged in a definite order. The ranking is assigned by using a set of ordinal

rank numbers, with 1 for the individual observation ranked first either in terms of

quantity or quality; and n for the individual observation ranked last in a group of n

pairs of observations. Mathematically, Spearman’s rank correlation coefficient is

defined as:

1

61

2

2

nn

dR …………(11.11)

Where R Rank correlation coefficient.

d = the difference between the pairs of ranks of the same

individual in the two

characteristics .

n =the number of pairs.

Advantages and Disadvantages of Spearman’s Correlation coefficient method:

Page 27: We thankfully acknowledge Krishna Kanta Handiqui State ...

24

Advantages:

(i) It is easy to understand and its application is simpler than Pearson’s

method.

(ii) It can be used to study correlation when variables are expressed in

qualitative terms like beauty, intelligence, honesty, efficiency and so on.

(iii) It is appropriate to measure the association between two variables if the

data type is at least ordinal scaled (ranked).

(iv) The sample data of values of two variables is converted into ranks either

in ascending order or descending order for calculating degree of

correlation between two variables.

Disadvantages:

(i) Values of both variables are assumed to be normally distributed and

describing a linear relationship rather than non-linear relationship.

(ii) It is not applicable in case of bivariate frequency distribution.

(iii) It needs a large computational time when number of pairs of values of

two variables exceed 30.

Case I: When ranks are given

When observations in a data set are already arranged in a particular order (rank),

consider the differences in pairs of observations to determine .d Square these

differences and obtain the total 2d . Finally apply the formula (11.11) to calculate

correlation coefficient.

Example 5.6: An office has 12 clerks. The long service clerks feel that they should

have a seniority increment based on length of service built into their salary structure.

An assessment of their efficiency by their departmental manager and the personnel

department produces a ranking of efficiency. This is shown below together with a

ranking of their length of service.

Ranking according to

length of service : 1 2 3 4 5 6 7 8 9 10 11 12

Ranking according

to efficiency : 2 3 5 1 9 10 11 12 8 7 6 4

Do the data support the clerks’ claim for seniority increment.

Solution: To determine whether the data support the clerks’ claim, we use

Spearman’s correlation coefficient which is given by

Page 28: We thankfully acknowledge Krishna Kanta Handiqui State ...

25

1

61

2

2

nn

dR

Since in the given data, the ranks are already been assigned, therefore we prepare the

following table for calculation.

Ranking according

to length of service

1R

Ranking according

to efficiency

2R

Difference

21 RRd

2d

1 2 -1 1

2 3 -1 1

3 5 -2 4

4 1 3 9

5 9 -4 16

6 10 -4 16

7 11 -4 16

8 12 -4 16

9 8 1 1

10 7 3 9

11 6 5 25

12 4 8 64

Total 178

Therefore, Spearman’s correlation coefficient is given by,

378.01716

10681

11212

17861

2

R

Thus from the result we observe that there exist a low degree of positive correlation

between length of service and efficiency. Therefore the claim of the clerks for a

seniority increment based on length of service is not justified.

Example 5.7: Ten competitors in a beauty contest are ranked by three judges in the

following order:

1st Judge : 1 6 5 10 3 2 4 9 7 7

2nd

Judge : 3 5 8 4 7 10 2 1 6 9

3rd

Judge : 6 4 9 8 1 2 3 10 5 9

Use the rank correlation coefficient to determine which pair of Judges has the nearest

approach to common tastes in beauty.

Solution: The pair of judges who have the nearest approach to common tastes in

beauty can be obtained in 3C2 =3 ways as follows:

(i) Judge 1and Judge 2, (ii) Judge 2 and Judge 3 and (iii) Judge 3 and Judge 1.

Page 29: We thankfully acknowledge Krishna Kanta Handiqui State ...

26

Now let 21 , RR and 3R denote the ranks assigned by the first, second and third Judges

respectively and let ijR be the rank correlation coefficient between the ranks assigned

by the ith

and jth

Judges, .3,2,1 ji Let ,jiij RRd be the difference of ranks of

an individual given by the ith

and jth Judges.

1R

2R

3R

d12

=R1-R2

d13

=R1-R3

d23

=R2-R3

d122

d132

d232

1 3 6 -2 -5 -3 4 25 9

6 5 4 1 2 1 1 4 1

5 8 9 -3 -4 -1 9 16 1

10 4 8 6 2 -4 36 4 16

3 7 1 -4 2 6 16 4 36

2 10 2 -8 0 8 64 0 64

4 2 3 2 1 -1 4 1 1

9 1 10 8 -1 9 64 1 81

7 6 5 1 2 1 1 4 1

7 9 7 -1 1 2 1 1 4

Total

200 60 214

We have 10n .

Applying the formula, Spearman’s rank correlation coefficients are given by,

2121.032

79910

20061

1

61

2

2

12

12

nn

dR

6363.011

79910

6061

1

61

2

2

13

13

nn

dR

297.0165

499910

21461

1

61

2

2

23

23

nn

dR

Since the correlation coefficient 6363.013 R is maximum, the pair of first and third

judges has the nearest approach to common tastes in beauty.

Since 2312 , RR are negative, the pair of judges (1, 2) and (2, 3) have opposite tastes

for beauty.

Case II: When ranks are not given

Spearman’s Rank correlation coefficient can also be used even if we are dealing with

variables which are measured quantitatively i.e., when the pairs of observations in

the data set are not ranked as in case I. In such a situation, we shall have to assign

ranks to the given set of data. The highest (smallest) observation is given the rank 1.

Page 30: We thankfully acknowledge Krishna Kanta Handiqui State ...

27

The next highest (next lowest) observation is given the rank 2 and so on. It is to be

noted that the same approach (i.e., either ascending or descending) should be

followed for all the variables under considerations.

Example 5.8: Calculate Spearman’s rank correlation coefficient between

advertising cost and sales from the following data:

Advertisement

cost (‘000Rs.): 39 65 62 90 82 75 25 98 36 78

Sales (lakhs) : 47 53 58 86 62 68 60 91 51 84

Solution: Let the variable X denote the advertisement cost (‘000 Rs.) and the

variable Y denote the sales (lakhs).

Let us now start ranking from the highest value for both the variables as given below:

X

Y

Rank

of X

(x)

Rank

of Y

(y)

d=x-y

d2

39 47 8 10 -2 4

65 53 6 8 -2 4

62 58 7 7 0 0

90 86 2 2 0 0

82 62 3 5 -2 4

75 68 5 4 1 1

25 60 10 6 4 16

98 91 1 1 0 0

36 51 9 9 0 0

78 84 4 3 1 1

Total

2d

=30

Here .10n

Therefore, Spearman’s rank correlation is

82.011

99910

3061

1

61

2

2

nn

dR

The result shows a high degree of positive correlation between Advertising cost and

sales.

Case III: When ranks are equal

While ranking observations in the data set by considering either the highest value or

lowest value as rank 1, we may encounter a situation of more than one observations

Page 31: We thankfully acknowledge Krishna Kanta Handiqui State ...

28

being of equal size. In such a case, the rank to be assigned to individual observations

is an average of the ranks which these individual observations would have got had

they differed from each other. For example, if two observations are ranked equal at

third place, then the average rank of (3+4)/2=3.5 is assigned to these two

observations. Similarly, if three observations are ranked equal at third place, then the

average rank of (3+4+5)/3=4 is assigned to these three observations.

While equal ranks are assigned to a few observations in the data set, an

adjustment is needed in the Spearman’s rank correlation coefficient formula as given

below:

1

........12

1

12

16

12

2

3

21

3

1

2

nn

mmmmd

R

where im (i=1, 2, …..) stands for the number of items an observation is repeated in

the data set for both variables.

Example 5.9: A financial analyst wanted to find out whether inventory turnover

influences any company’s earnings per share (in per cent). A random sample of 7

companies listed in a stock exchange were selected and the following data was

recorded for each:

Company Inventory

Turnover

(Number of times)

Earnings per

share

(%)

A 4 11

B 5 9

C 7 13

D 8 7

E 6 13

F 3 8

G 5 8

Find the strength of association between inventory turnover and earnings per share.

Interpret the findings.

Solution: Let us start ranking from lowest value of both the variables. Since there are

tied ranks, the sum of the tied ranks is averaged and assigned to each of the tied

observations as shown below:

Inventory

turnover

(x)

Rank

R1

Earnings

per share

(y)

Rank

R2

d=R1-

R2

d2

Page 32: We thankfully acknowledge Krishna Kanta Handiqui State ...

29

4 2 11 5 -3 9.00

5 3.5 9 4 -0.5 0.25

7 6 13 6.5 0.5 0.25

8 7 7 1 6.0 36.00

6 5 13 6.5 -1.5 2.25

3 1 8 2.5 -1.5 2.25

5 3.5 8 2.5 1.0 1.00

Total

2d

=51

It may be noted that a value 5 of variable x is repeated twice (m1=2) and values 8 and

13 of variable y is also repeated twice, so m2 =2 and m3=2. Applying the formula

1

12

1

12

1

12

16

12

3

3

32

3

21

3

1

2

nn

mmmmmmd

R

=

177

2212

122

12

122

12

1516

12

333

=

0625.09375.01336

5.05.05.05161

The result shows a very week positive association between inventory turnover and

earnings per share.

5.6 PROBABLE ERROR OF CORRELATION COEFFICIENT

Having determined the value of the correlation coefficient, the next step is to find the

extent to which it is dependable. Probable error of correlation coefficient usually

denoted by ).(. rEP is an old measure of testing the reliability of an observed value

of correlation coefficient in so far as it depends upon the conditions of random

sampling.

If r is the observed correlation coefficient in a sample of n pairs of observations then

its standard error, usually denoted by ).(. rES is given by,

n

rrES

21).(.

Probable error of the Correlation coefficient is given by,

).(.6745.0).(. rESrEP

We have taken the factor 0.6745 because in a normal distribution 50% of the

observations lie in the range ,6745.0 where is the mean and is the s.d.

Page 33: We thankfully acknowledge Krishna Kanta Handiqui State ...

30

Uses of Probable Error:

The important uses of probable error of correlation coefficient i.e., ).(. rEP are given

by

(a) ).(. rEP may be used to determine the two limits ).(. rEPr within which here

is 50% chance that correlation coefficients of randomly selected samples from the

same population will lie.

(b) ).(. rEP may be used to test if an observed value of sample correlation coefficient

is significant of any correlation in the population. The rules for testing the

significance of population correlation coefficient are as below:

(i) If )(. rEPr then the population correlation coefficient r is not significant.

(ii) If )(.6 rEPr then population correlation coefficient r is significant.

(iii) In other situations nothing can be concluded with certainty.

It is to be mentioned that one should use probable error to test the significance of

population correlation coefficient when n , the number of pairs of observations is

fairly large. Moreover, probable error can be applied only under the following

situations:

(a) The data must have been drawn from a normal population.

(b) The observations included in the sample must be drawn randomly.

Example 5.10: The following are the marks obtained by 10 students in Mathematics

and Statistics in an examination. Determine the Karl Pearson’s coefficient of

correlation for these two series of marks. Calculate the probable error of this

correlation coefficient and examine the reliability (significance) of the correlation

coefficient. Also compute the limits within which the population correlation

coefficient may be expected to lie.

Marks in

Maths

Marks in Statistics

45 35

70 90

65 70

30 40

90 95

40 40

50 60

Page 34: We thankfully acknowledge Krishna Kanta Handiqui State ...

31

75 80

85 80

60 50

Solution: Let the marks in Maths and the marks in Stats be denoted by the variables

X and Y respectively. Let us shift both the origin and scale of the original variables

X and Y obtain the new variables U and V as given by

5

65,

5

60

YV

XU (The scale 5 being common factor of each of

X and Y )

We have, UVXY rr

Now we prepare the following table to compute UVr

X Y U V 2U 2V UV

45 35 -3 -6 9 36 18

70 90 2 5 4 25 10

65 70 1 1 1 1 1

30 40 -6 -5 36 25 30

90 95 6 6 36 36 36

40 40 -4 -5 16 25 20

50 60 -2 -1 4 1 2

75 80 3 3 9 9 9

85 80 5 3 25 9 15

60 50 0 -3 0 9 0

Total 2 -2 140 176 141

Now, UVXY rr

2222 VVnUUn

VUUVn

=

417610414010

2214110

=17561396

1414

=0.9031

Again, n

rrEP

216745.0).(.

=

10

9031.016745.0

2

Page 35: We thankfully acknowledge Krishna Kanta Handiqui State ...

32

= 0405.01623.3

128155.0

1623.3

19.06745.0

Reliability of the value of r :

We have, 9.0r and 6× ).(. rEP =6×0.0405=0.243. Since the value of r is much

higher than the value of ).(..6 rEP , as such the value of r is highly significant.

Limits for Population correlation coefficient:

).(. rEPr =0.9031 0.0405 i.e., 0.8626 and 0.9436

This implies that if we take another sample of size 10 from the same population, then

its correlation coefficient can be expected to lie between 0.8626 and 0.9436.

Note: When we say r is reliable or significant then it usually means that, on

average, students getting good marks in Mathematics also get good marks in

Statistics and students getting poor marks in Mathematics also get poor marks in

Statistics. We must not interpret that all the students getting good (poor) marks in

Mathematics also get good (poor) marks in Statistics. It happens since correlation

indicates an average relationship between two series only and not between the

individual items of the series.

5.7 COEFFICIENT OF DETERMINATION

Coefficient of correlation between two variables is a measure of degree of linear

relationship that may exist in between them and indicates the amount of variation of

one variable which is associated with or is accounted for by another variable. A more

useful and readily comprehensible measure for this purpose is the coefficient of

determination which indicates the percentage of the total variability of the dependent

variable that is accounted for or explained by the independent variable. In other

words, the coefficient of determination gives the ratio of the explained variance to

the total variance. The coefficient of determination is expressed by the square of the

correlation coefficient, i.e., 2r . The value of 2r lie between 0 and 1. For example, let

the two variables, say x and y be inter-dependent, and variation in x causes variation

in y . Further, let the correlation coefficient between them be say, 0.9. The coefficient

of determination, in this situation, is 81.09.02 which implies that %81 of the

variation in the dependent variable y is due to variation in the independent variable

x or is explained by the variation in x . The remaining %81100%19 is due to

or is explained by some other factors.

Page 36: We thankfully acknowledge Krishna Kanta Handiqui State ...

33

The various values of coefficient of determination 2r can be interpreted in the

following way:

(i) 02 r indicates that no variation in y can be explained by the variable x which

in turn indicates that there exists no association between x and y .

(ii) 12 r indicates that the values of y are completely explained by x which in

turn indicates that there exists perfect association between x and y .

(iii) 10 2 r reveals the degree of explained variation in y as a result of variation

in the values of x . Value of 2r closer to 0 shows low proportion of variation in

y explained by Again, value of 2r closer to 1 shows that value of x can predict

the actual value of y .

Example 5.11: Five students of a Management Programme at a certain Institute were

selected at random. Their Intelligent Quotient (I.Q.) and the marks obtained by them

in the paper in Decision Science (including Statistics) were as follows:

I.Q. Marks in Decision

Sciences (out of 100)

120 85

110 80

130 90

115 88

125 92

120 87

Calculate the coefficient of determination and interpret the result.

Solution: Here, we may consider I.Q. as the independent variable as X , and Marks

in Decision Science as dependent variable Y .This happens so because the marks

obtained, would generally depend on the I.Q. of a student.

Now, we prepare the following table for calculation

Now, we have Coefficient of determination = 2r

Where

2222

yyxx

yxyx

dnddnd

ddnddr

22

763822062650

7206960

Page 37: We thankfully acknowledge Krishna Kanta Handiqui State ...

34

809.032.148

120

88250

840960

6545.02 r

Which implies that 65.45% of variation in the marks is explained by I.Q. The rest of

the 34.55% variation in I.Q. could be due to some other factors like preparation for

the examination by the students, their mental frame during the examination, etc.

5.8 CORRELATION WITH THE USE OF MS EXCEL

The correlation coefficient (a value between -1 and +1) tells you how strongly two

variables are related to each other. We can use the CORREL function or the Analysis

Tool add-in in Excel to find the correlation coefficient between two variables. A

correlation coefficient of +1 indicates a perfect positive correlation. As variable X

increases, variable Y increases. As variable X decreases, variable Y decreases.

Student I.Q.

( X )

Marks in Decision

Science ( Y ) 100 Xd x

80 Yd y

2

xd 2

yd yx dd

I 120 85 20 5 400 25 100

II 110 80 10 0 100 0 0

III 130 90 30 10 900 100 300

IV 115 88 15 8 225 64 120

V 125 92 25 12 625 144 300

VI 120 87 20 7 400 49 140

Total

Average

120

20

42

7

2650 382 960

CHECK YOUR PROGRESS

Q 4: Under what situation rank correlation coefficient is used?

Q 5: What is Coefficient of determination? Interpret the meaning of 49.02 r .

Page 38: We thankfully acknowledge Krishna Kanta Handiqui State ...

35

- A correlation coefficient of -1 indicates a perfect negative correlation. As variable

X increases, variable Z decreases. As variable X decreases, variable Z increases.

- A correlation coefficient near 0 indicates no correlation.

To use the Analysis Tool add-in in Excel to quickly generate correlation coefficients

between multiple variables, execute the following steps.

1. On the Data tab, in the Analysis group, click Data Analysis.

Note: can't find the Data Analysis button? Click here to load the Analysis Tool add-

in.

Page 39: We thankfully acknowledge Krishna Kanta Handiqui State ...

36

2. Select Correlation and click OK.

3. For example, select the range A1:C6 as the Input Range.

4. Check Labels in first row.

5. Select cell A8 as the Output Range.

6. Click OK.

Result :

Page 40: We thankfully acknowledge Krishna Kanta Handiqui State ...

37

Conclusion: variables A and C are positively correlated (0.91). Variables A and B are

not correlated (0.19). Variables B and C are also not correlated (0.11) . You can

verify these conclusions by looking at the graph.

5.9 LET US SUM UP

Correlation means existence of relationship between variables. When two variables

deviate in the same direction then we have positive correlation and when they move

in the opposite direction we say that here exists negative correlation between

variables. Again we have learnt that correlation between two variables is said to be

perfectly positive when there is proportional change in the two variables in the same

direction. On the other hand, correlation between two variables are said to be

perfectly negative when there is proportional change in two variables in opposite

directions. Again, if the change in one variable has no relation with the change in the

other variable then it is said that there is zero correlation.

We have learnt various methods with the help of which we can ascertain the

existence of relationship between variables. These methods include: (i) Scatter

diagram method, (ii) Karl Pearson’s correlation coefficient method, (iii) Spearman’s

rank correlation coefficient method.

Scatter diagram is a graphic tool to portray the relationship between variables. Karl

Pearson’s correlation coefficient measures the strength of the linear association

between variables with values near zero indicating a lack of linearity while values

near -1 or +1 suggest linearity. Karl Pearson coefficient of correlation is designated

by r .

Page 41: We thankfully acknowledge Krishna Kanta Handiqui State ...

38

We have also discussed a very important phenomenon in Correlation analysis which

is termed as Coefficient of determination. It is defined as the fraction of the variation

in one variable that is explained by the variation in the other variable and in other

words, it measures the proportion of variation in the dependent variable y that can be

attributed to independent variable x . It ranges from 0 to 1 and is the square of the

coefficient of correlation. Thus a coefficient of 0.82 suggests that 82% of the

variation in Y is accounted for by X

5.10 FURTHER READINGS

1) Srivastava, T.N., Rego, S. (2008). Statistics for Management. New Delhi. Tata

McGraw Hill Education Private Limited.

2) Sharma, J.K. (2007). Business Statistics. New Delhi. Pearson Education Ltd.

3) Hazarika, P.L. (2016). Essential Statistics For Economics And Business Studies.

New Delhi. Akansha Publishing House.

4) Lind, D.A., Marshal, W.G., Wathen, S.A. (2009) Statistical Techniques in

Business and Economics. New Delhi. Tata McGraw Hill Education Private Limited.

5) Bajpai, N. (2014). Business Statistics. New Delhi. Pearson Education Ltd.

5.11 ANSWERS TO CHECK YOUR PROGRESS

Ans. to Q No 1: (i) False, (ii) False, (iii) True, (iv) True.

Ans. to Q No 2:

22

YYXX

YYXXr

9.07232

43

Ans. to Q No 3: We have, YX

YXCovr

),(

Y6

6.325.0 (since 36)( 2 XXVar , therefore 6X )

4.2 Y

Thus S.D.(Y)=2.4.

Ans. to Q No 4: Rank correlation coefficient is used in a situation in which

quantitative measure of certain qualitative factors such as judgment, leadership,

colour, tastes etc. cannot be fixed, but individual observations can be arranged in a

definite order.

Page 42: We thankfully acknowledge Krishna Kanta Handiqui State ...

39

Ans. to Q No 5: Coefficient of determination is a statistical measure of the proportion

of the variation in the dependent variable that is explained by independent variable.

Coefficient of determination 49.02 r or 49% indicates that only 49% of the

variation in the dependent variable y can be accounted for in terms of variable x.The

remaining 51% of the variability may be due to other factors.

Ans. to Q No 6: (i) True, (ii) True, (iii) False, (iv) False, (v) True.

Ans. to Q No 7: (i) True, (ii) False, (iii) True.

5.12 MODEL QUESTIONS

1. What is correlation? Define positive, negative and zero correlation.

2. What is a scatter diagram? Discuss by means of suitable scatter diagrams different

types of correlation that may exist between the variables in bivariate data.

3. What is Karl Pearson’s correlation coefficient? How would you interpret the value

of a coefficient correlation?

4. Distinguish between the coefficient of determination and the coefficient of

correlation. How would you interpret the value of a coefficient of determination?

5. What is rank correlation coefficient method? Bring out its usefulness.

6. Write a short note on the probable error of correlation coefficient.

7. Find the coefficient of correlation from the following data:

Cost : 39 65 62 90 82 75 25 98 36 78

Sales : 47 53 58 86 62 68 60 91 51 84

Also interpret the result.

8. From the following data, calculate Karl Pearson’s correlation coefficient

(i)

9

1

9

1

29

1

2

.193,346,120i i

iii

i

i YYXXYYXX

(ii) .1007,650,1586,10,80,125 22 XYYXnYX

9. Calculate the coefficient of correlation and its probable error from the following

data:

X: 1 2 3 4 5 6 7 8 9 10

Y: 20 16 14 10 10 9 8 7 6 5

Page 43: We thankfully acknowledge Krishna Kanta Handiqui State ...

40

10. Two departmental managers ranked a few trainees according to their perceived

abilities. The ranking are given below:

Trainee A B C D E F G H I J

Manager A 1 9 6 2 5 8 7 3 10 4

Manager B 3 10 8 1 7 5 6 2 9 4

Calculate an appropriate correlation coefficient to measure the consistency in the

ranking.

Page 44: We thankfully acknowledge Krishna Kanta Handiqui State ...

41

UNIT 6 : REGRESSION ANALYSIS

Structure

6.0 Learning Objectives

6.1 Introduction

6.2 Regression Analysis

6.3 Correlation vs Regression

6.4 Regression Lines

6.4.1 Determination of Regression Lines of Y on X

6.4.2 Determination of Regression Lines of X on Y

6.4.3 Regression Coefficients

6.5 Standard Error of Estimate

6.6 Regression Analysis with the use of Ms EXCEL

6.7 Let us Sum up

6.8 Further Readings

6.9 Answer to Check your Progress

6.10 Model Questions

6.0 LEARNING OBJECTIVES

After going through this unit you will be able to

• Use regression analysis for estimating the relationship between variables

• How correlation is different from regression

• Use least square method for estimating equation to predict future values of

the dependent variable

6.1 INTRODUCTION

In many business environments, we come across problems or situations where two

variables seem to move in the same direction such as both are increasing or

decreasing. At times an increase in one variable is accompanied by a decline in

another. Thus, if two variables are such that when one changes the other also

changes then the two variables are said to be correlated. The knowledge of such a

relationship is important to make inferences from the relationship between variables

in a given situation. For example, a marketing manager may be interested to

investigate the degree of relationship between advertising expenditure and the sales

volume. The manager would like to know whether money that he is going to spent

on advertising is justified or not in terms of sales generated.

Page 45: We thankfully acknowledge Krishna Kanta Handiqui State ...

42

Correlation analysis is used as a statistical technique to ascertain the association

between two quantitative variables. Usually, in correlation, the relationship between

two variables is expressed by a pure number i.e., a number without having any unit

of measurement. This pure number is referred to as coefficient of correlation or

correlation coefficient which indicates the strength and direction of statistical

relationship between variables.

It may be noted that correlation analysis is one of the most widely employed

statistical devices adopted by applied statisticians and has been used extensively not

only in biological problems but also in agriculture, economics, business and several

other fields. In this unit we shall introduce correlation analysis for two variables.

The importance of examining the relationship between two or more variables can be

stated in the form of following questions and accordingly requires the statistical

devices to arrive at conclusions:

(i) Does there exist an association between two or more variables? If exists, what is

the form and the degree of that relationship?

(ii) Is the relationship strong or significant enough to be useful to arrive at a

desirable conclusion?

(iii) Can the relationship be used for predictions of future events, that is, to forecast

the most likely value of a dependent variable corresponding to the given value

of independent variable or variables?

The first two questions can be answered with the help of correlation analysis while

the final question can be answered by using the regression analysis.

In case of correlation analysis, the data on values of two variables must come from

sampling in pairs, one for each of the two variables.

6.2 REGRESSION ANALYSIS

Correlation analysis deals with exploring the correlation that might exist between

two or more variables and indicates the degree and direction of their association, but

fails to answer the question:

Is there any functional relationship between two variables? If yes, can it be used to

estimate the most likely value of one variable, given the value of other variable?

Thus the statistical technique that expresses the relationship between two or more

variables in the form of an equation to estimate the value of a variable, based on the

given value of another variable, is called regression analysis. The variable whose

value is estimated using the algebraic equation is called dependent variable and the

Page 46: We thankfully acknowledge Krishna Kanta Handiqui State ...

43

variable whose value is used to estimate this value is called independent variable.

The linear algebraic equation used for expressing a dependent variable in terms of

independent variable is called linear regression equation.

In many business situations, it has been observed that decision making is based upon

the understanding of the relationship between two or more variables. For example, a

sales manager might be interested in knowing the impact of advertising on sales.

Here, advertising can be considered as an independent variable and sales can be

considered as the dependent variable. This is an example of simple linear regression

where a single independent variable is used to predict a single numerical dependent

variable.

The meaning of the term regression is “stepping back towards the average.” The

term regression was first introduced by Sir Francis Galton in 1877. His study on the

height of one thousand fathers and sons exhibited an interesting result where he

found that tall fathers tend to have tall sons and short fathers tend to have short sons.

However, the average height of the sons of a group of tall fathers was less than that

of the fathers. Galton concluded that abnormally tall or short parents tend to

“regress” or “step-back” to the average population height.

Advantages of Regression Analysis:

Some of the important advantages of regression analysis are given below:

1. Regression analysis helps in developing a regression equation with the help of

which the value of a dependent variable can be estimated for any given value of

the independent variable.

2. It helps to determine standard error of estimate to measure the variability of values

of a dependent at the line fits the data. When all the points fall on the line, the

standard error of estimate becomes zero.

3. When the sample size is large ( 30n ), the interval estimation for predicting the

value of a dependent variable based on standard error of estimate is considered to

be acceptable by changing the values of either x or y . The magnitude of 2r

remains constant regardless of the values of the two variables.

6.3 CORRELATION VS REGRESSION

(a) With the help of correlation one measures the co variation between two variables.

In correlation neither variable may be termed as dependent or independent variable.

Since correlation does not establish a relationship between the two variables as such

one cannot estimate the value of one variable corresponding to a given value of the

other variable. Regression establishes a functional relationship between two

Page 47: We thankfully acknowledge Krishna Kanta Handiqui State ...

44

variables and hence one can estimate the value of one variable corresponding to a

given value of the other variable.

(b) With the help of Correlation analysis, one cannot study which variable is the

cause and which variable is the effect. For example, a high degree of positive

correlation between price and supply does not indicate whether supply is the effect

of price or price is the effect of supply. Regression analysis, in contrast to

correlation, determines the cause-and-effect relationship between x and y , that is, a

change in the value of independent variable x causes a corresponding change in the

value of dependent variable y if all other factors that affect y remain unchanged.

(c) Correlation coefficient between two variables x and y is always symmetric. i.e.,

YXXY rr . But regression coefficient is not symmetric in general i.e., YXXY bb .

(d) Correlation coefficient is independent of the change of both origins and scale,

regression coefficients are independent of change of origin only but not of scale.

6.4 REGRESSION LINES

A regression line is the line from which one can get the best estimated value of the

dependent variable corresponding to a given value of the independent variable. Thus

a regression line is the line of best fitted line. The term best fit is interpreted in

accordance with the Principle of Least Squares which consists in minimizing the

sum of the squares of the residuals or the errors of estimates.

In case of two variables x and y we usually have two regression lines because each

variable may usually be treated as the dependent as well as the independent variable.

For example, let us consider two variables namely, price (P) and supply (S). We

know that other conditions remaining same; if the price of a commodity increases

(decreases) the supplies of the commodity also increases (decreases). In this case, S

is the independent variable and P is the dependent variable. Also, other conditions

remaining same when the supply of a commodity increases (decreases), its price

decreases (increases). In this case supply S is the independent variable and price P

is the dependent variable. Thus for two variables x and y we have two regression

lines. It is to be noted that

(i) when ,1XYr i.e., when there exists either perfect positive or

perfect negative correlation between x and y then both the lines of

regression coincide.

(ii) when ,0XYr i.e., when x and y are uncorrelated then the two lines

of regression become perpendicular to each other.

Page 48: We thankfully acknowledge Krishna Kanta Handiqui State ...

45

Thus when we consider the variable x as the independent and the variable y as

dependent then we get the regression equation of y on x and similarly in case of

regression equation of x on y we will have y as the independent variable and x as

the dependent variable. Sometimes, of course, from two correlated variables it is not

possible to obtain both the regression lines. For example, if the variable x denotes

the amount of rainfall in some years and the variable y denotes the production of

paddy in these years then obviously, x can be considered only as the independent

variable whereas y can be considered only as the dependent variable.

The regression line of y on x is that line which gives the best estimated value of y

corresponding to a given value of .x

The regression line of x on y is that line which gives the best estimated value of x

corresponding to a given value of .y

6.4.1 DETERMINATION OF THE REGRESSION LINE OF y ON x

Let nn yxyxyx ,,,.........,,, 2211 be n pairs of observations on the two variables

x and y under study. Let

bxay …………….(11.12)

be the line of regression (best fit) of y on x .

For any given point ii yxP ,1 in the scatter diagram, the error of estimate or residual

as given by the line of best fit (11.12) is ii HP . Now, the x coordinate of iH and iP

are same viz., ix and since iH lies on the line (11.12), the y coordinate of iH ,

i.e., MH i is bxa . Hence the error of estimate for iP is given by

MHMPHP iiii

= ii bxay

Page 49: We thankfully acknowledge Krishna Kanta Handiqui State ...

46

which is the error for the ith

point. We will have such errors for all the points on the

scatter diagram. For the points which lie above the line, the error would be positive

and for the points which lie below the line, the error would be negative.

By applying the method of least squares, the unknown constants a and b in (11.12)

needs to be determined in such a manner that the sum of the squares of the errors of

estimates is minimum. In other words, we have to minimize

n

i

ii

n

i

ii bxayHPE1

2

1

2

Subject to variations in a and b .

E may also be expressed as

,22

bxayyyE e ……(11.13)

where ey is the estimated value of y as given by (11.12) for given value of x &

summation ( ) being taken over the n pairs of observations.

Using the principle of maxima and minima in differential calculus, E will have an

optimum (maximum or minimum) for variations in a and b if its partial derivatives

w.r.t. a and b vanish separately. Hence, from (11.13) we get

0

a

E and 0

b

E

0.2

0.2

bxayb

bxay

bxaya

bxay

On simplifying, we have,

xbnay …………(11.14)

And 2xbxaxy ……….....(11.15)

Equations (11.14) and (11.15) are called normal equations. From the given values of

x and y we calculate 2,, xyx and xy . Putting these values in (11.14)

and (11.15) and solving these two equations simultaneously for a and b we get the

values of a and b which are given by

22 xxn

yxxynb ………………….(11.16)

and xbya

Now putting these values in equation (11.12) we get the regression equation of y

on x which is given by

Page 50: We thankfully acknowledge Krishna Kanta Handiqui State ...

47

xxbyy

where b is called the regression coefficient of y on x and is generally denoted by

the symbol yxb .

Thus writing yxb for b in the above equation we have

xxbyy yx ……………..(11.17)

The regression line of y on x given by (11.17) is to be used to estimate the most

probable or average value of y for any given value of .x

6.4.2 DETERMINATION OF THE REGRESSION LINE OF x ON y

Let the line of regression of x on y to be estimated for the given data is:

ybax ……………(11.18)

Applying the least squares method in a similar manner as discussed in case of

regression line of y on x we have the following two normal equations for

determining the values of a and b .

ybanx …………….(11.19)

2ybyaxy ……………..(11.20)

By solving these equations simultaneously for a and b and putting these values in

equation (11.18) we obtain the regression line of x on y given by

yybxx xy ………………(11.21)

where

22 yyn

yxxynbxy ……………….(11.22)

Equation (11.21) is the regression line of x on y where xyb given by equation

(11.22) is called regression coefficient of x on y .

The regression line of y on x given by (11.17) is to be used to estimate the most

probable or average value of y for any given value of .x

The regression line of x on y given by (11.21) is to be used to estimate the most

probable or average value of x for any given value of .y

Remark: (i) When there exists either perfect positive correlation or perfect negative

correlation between x and y then 1r , and consequently, the regression equation

of y on x becomes:

Page 51: We thankfully acknowledge Krishna Kanta Handiqui State ...

48

xxyyx

y

xy

xxyy

………………..(*)

On the other hand, in such situation the regression equation of x on y becomes:

yyxxy

x

xy

xxyy

……………..(**)

From equations (*) and (**) we conclude that we have the same line. Thus if there

exists either perfect positive correlation or perfect negative correlation between the

two variables i.e., when r 1, the two regression lines coincide.

(ii) The two lines of regression pass through the common point yx, since this point

satisfies both the regression equations.

6.4.3 REGRESSION COEFFICIENT

The regression coefficient of y on x i.e., yxb gives the amount of increase (decrease)

in y corresponding to one unit increase (decrease) in x when yxb is positive. On the

other hand the negative value of yxb gives the amount of decrease (increase) in y

corresponding to a unit increase (decrease) in .x

Similarly, if xyb is positive, it gives the amount of increase (decrease) in x

corresponding to the unit increase (decrease) in y . Again the negative value of xyb

gives the amount of decrease (increase) in x corresponding to a unit increase

(decrease) in .y

Other expressions of Regression coefficients:

The regression coefficient of y on x i.e., yxb can also be expressed as

2

,

x

yx

yxCovb

………………(A)

and x

y

yx rb

… ...……….(B)

where r is the correlation coefficient between x and .y

Page 52: We thankfully acknowledge Krishna Kanta Handiqui State ...

49

Again, we have,

yyxxn

yxCov1

),(

22 1

xxn

x

2)(

xx

yyxxbA yx

22

xxn

yxxyn ……………..(C)

We are left with the same equation as obtained in equation (11.16).

Similarly, the regression coefficient of x on y can also be expressed as below:

We have,

2

,

y

xy

yxCovb

……..………( A )

Also, y

x

xy rb

…………..( B )

Again we have, 22 1

yyn

y

2

yy

yyxxbA xy

=

22 yyn

yxxyn ……………( C )

Which is the same expression as obtained in equation (11.22).

Step deviation method for ungrouped data: When actual mean values x and y

are in fraction, then calculation of regression coefficients can be simplified by taking

deviations of x and y values from their assumed means A and B, respectively. Thus

when AXd x and BYd y , where A and B are assumed means of x and y

values, then

).......(..........22

Dddn

ddddnb

xx

yxyx

yx

Page 53: We thankfully acknowledge Krishna Kanta Handiqui State ...

50

).......(..........22

Dddn

ddddnb

yy

yxyx

xy

Properties of Regression Coefficients:

Property 1: The correlation coefficient is the geometric mean of the regression

coefficients.

Proof: We have the regression coefficient of y on x

x

y

yx rb

……….(11.13)

Similarly, the regression coefficient of x on y

y

x

xy rb

……….(11.14)

Multiplying (11.13) and (11.14) we get,

y

x

x

y

xyyx rrbb

2rbb xyyx

xyyx bbr

Thus the correlation coefficient r is the geometric mean of the two regression

coefficients yxb and xyb

Property 2: The correlation coefficient and the two regression coefficients are

simultaneously positive or simultaneously negative.

Proof: We have,

x

y

yx rb

and

y

x

xy rb

………………(11.14)

Standard deviation being square quantity can never be negative. Here we assume that

.0;0 yx Hence, we have,

0x and 0y

Therefore from (11.14), we observe that when r is positive, both yxb and xyb are

positive and when r is negative, both yxb and xyb are negative.

Thus we can conclude that when both yxb and xyb are positive then

xyyx bbr

Page 54: We thankfully acknowledge Krishna Kanta Handiqui State ...

51

and when both yxb and xyb are negative then

xyyx bbr

Property 3: The product of the two regression coefficients cannot exceed unity.

Proof: We have

xyyx bbr and 11 r

If the product of the two regression coefficients exceeds unity then

xyyx bb

will exceed unity as the square root of a number greater than one is also greater than

one. In this case r will be greater than 1 if yxb and xyb are positive and r will be less

than -1 if yxb and xyb are negative which is impossible since 11 r . This indicates

that the product of the two regression coefficients cannot exceed unity.

Property 4: The two regression coefficients are independent of the change of origin

but are dependent on the change of scale.

Proof: The property states that if we change the origin of the regression coefficients

then the values of the regression coefficients remain unchanged but if we change

their scale then their values get changed.

Let u and v be the new variables obtained by changing the origin and the scale of the

original variables x and y as follows:

k

byv

h

axu

, ……………….(i).

where hba ,, and k are (>0) are constants.

Since correlation coefficient is independent of the change of origin and scale, we

have,

uvxy rr ……………….(ii)

Due to the transformation (i)

ux h and vy k

Now,

u

v

uv

u

v

uv

x

y

xyyx rk

h

h

krrb

i.e., uvyx bk

hb

………………(iii)

Similarly,

v

u

uv

v

u

uv

y

x

xyxy rh

k

k

hrrb

i.e., uvyx bk

hb ……………………..(iv)

Page 55: We thankfully acknowledge Krishna Kanta Handiqui State ...

52

Hence, from equations (iii) and (iv) we conclude that the two regression coefficients

are independent

Property 5: Regression coefficient is not symmetric i.e., in general,

yxxy bb .

Proof: We have,

y

x

xy rb

and

x

y

yx rb

We observe that, in general yxxy bb .

On the other hand, regression coefficients yxb and xyb and correlation coefficient r

become equal only when ,yx which usually does not occur.

CHECK YOUR PROGRESS

Q 6: State whether the following statements are true or false:

(i) Regression analysis is a statistical technique that expresses the functional

relationship in the form of an equation.

(ii) Correlation coefficient is the geometric mean of regression coefficients.

(iii) If one of the regression coefficients is greater than one the other must also be

greater than one.

(iv) The product of regression coefficients is always more than one.

(v) If xyb is negative, then yxb is negative.

6.5 STANDARD ERROR OF AN ESTIMATE

The regression equations enable us to estimate the value of the dependent variable

for any given value of the independent variable. The estimates so obtained are,

however, not perfect. A measure of the precision of the estimates so obtained from

the regression equations is provided by the Standard Error (S.E.) of the estimate.

While standard deviation of the values of a variable measures the variation or

scatteredness of the values about their arithmetic mean, the standard error of estimate

measures the variation of scatteredness of the points or dots of the scatter diagram

about the regression line. The more closely the dots cluster around the regression

line, the more representative the line is so far as the relationship between the two

variables is concerned and the better is the estimate based on the equation of this

line. If all the dots lie on the regression line then there exists no variation about the

line and, as a result of which correlation between the variables will be perfect.

Thus, Standard error (S.E.) of estimate of y for given x denoted by yxS is defined by

Page 56: We thankfully acknowledge Krishna Kanta Handiqui State ...

53

2

ˆ2

n

yyS yx

To simplify the calculations of yxS , the following equivalent formula is used,

2

2

n

xybyayS yx

Where a and b are respectively the intercept and the slope of the regression line of

y on x which are to be determined by using the method of least squares.

Similarly, Standard error (S.E.) of estimate of x for given y denoted by xyS is

defined by

2

ˆ2

n

xxS xy

To simplify the calculations of yxS , the following equivalent formula is used,

2

2

n

xybxaxS xy

Where a and b are respectively the intercept and the slope of the regression line of x

on y which are to be determined by using the method of least squares.

Again a much more convenient formula for numerical computations is given by

21 rS yyx and

21 rS xxy .

Example 6.1 : A company is introducing a job evaluation scheme in which all jobs

are graded by points for skill, responsibility, and so on. Monthly pay scales (Rs. in

1000’s) are then drawn up according to the number of points allocated and other

factors such as experience and local conditions. To date the company has applied

this scheme to 9 jobs:

Job Points Pay(Rs.)

A 5 3.0

B 25 5.0

C 7 3.25

D 19 6.5

E 10 5.5

F 12 5.6

G 15 6.0

H 28 7.2

I 16 6.1

(a) Find the least squares regression line for linking pay scales to points.

(b) Estimate the monthly pay for a job graded by 20 points.

Page 57: We thankfully acknowledge Krishna Kanta Handiqui State ...

54

Solution: We consider monthly pay (Y ) as the dependent variable and job grade

points ( X ) as the independent variable. Now, the least square regression line for

linking pay scales to points i.e., the line of regression of Y on X is given by,

XXbYY yx

Now, we prepare the following table for calculation

Point

s

( X )

15 Xd X

2

Xd Pay

Scale

Y

5 YdY

2

Yd YX dd

5 -10 100 3.0 -2 4 20

25 10 100 5.0 0 0 0

7 -8 64 3.25 -1.75 3.06 14

19 4 16 6.5 1.50 2.25 6

10 -5 25 5.5 0.50 0.25 -2.5

12 -3 9 5.6 0.60 0.36 -1.8

15 0 0 6.0 1.00 1.00 0

28 13 169 7.2 2.2 4.84 28.6

16 1 1 6.1 1.1 1.21 1.1

137 2 484 48.15 3.15 16.97 65.40

(a) Here, 22.159

137

n

XX ; 35.5

9

15.48

n

YY

Since mean values X and Y are non-integer value, therefore deviations are taken

from assumed mean as done in the above table.

133.02.435

3.582

24849

15.3240.659222

XX

YXYX

YX

ddn

ddddnb

Substituting these values of YX , and YXb in the regression line, we have

22.15133.035.5 XY

326.3133.0

35.502426.2133.0

XY

XY

(b) For job grade point ,20X the estimated average pay scale is given by

986.520133.0326.3133.0326.3 XY

Hence, likely monthly pay for a job with grade points 20 is Rs. 5.986.

Example 6.2 : In the estimation of regression equations of two variables X and Y

the following results were obtained

3900,2860,6360,10,70,90 22 xyyxnYX

where YYyXXx ;

Obtain the two regression equations.

Solution: We have the line of regression of Y on X given by,

Page 58: We thankfully acknowledge Krishna Kanta Handiqui State ...

55

)...(.................... iXXbYY YX

Where,

22 x

xy

XX

YYXXbYX

= 6132.06360

3900

From (i) the required regression equation is

906132.070 XY

70188.556132.0 XY

812.146132.0 XY

Similarly, the line of regression of X on Y is given by,

)...(.................... iiYYbXX XY

Where,

22 y

xy

YY

YYXXbXY

= 3636.12860

3900

From (ii) the required regression equation is

703636.190 YX

90452.953636.1 YX

452.53636.1 YX

Example 6.3: If the two lines of regression are:

03054 yx and 0107920 yx ,

Which of these is the line of regression of x on y ? Determine xyr and y when

.3x

Solution: We are given the regression lines as

)....(....................03054 iyx

)......(....................0107920 iiyx

In order to determine the 21 rS yyx line of regression of x on y we need to

apply the property of regression coefficients, i.e., 2rbb xyyx . In the given

problem, let (i) be the line of regression of x on y and (ii) be the line of regression of

y on x .

Page 59: We thankfully acknowledge Krishna Kanta Handiqui State ...

56

From (i),

4

5

4

30

4

5 xybyx

From (ii),

9

20

9

107

9

20 yxbxy

Now, 17778.24

5

9

202 yxxybbr

But 10 2 r , therefore our assumption is wrong.

Hence (i) is the line of regression of y on x and (ii) is the line of regression of x on

y .

Assuming (i) as the line of regression of y on x we have,

65

43045 xyxy

Regression coefficient of y on x =5

4

Similarly, assuming (ii) as the line of regression of x on y we have,

20

107

20

9107920 yxyx

Regression coefficient of x on y =20

9

36.020

9

5

4.2 xyyx bbr

6.036.0 r

6.0r (since both the regression coefficients are positive, r must be positive.)

Again, we have, y

x

xy rb

4

6.0

3

5

4.

r

b xyx

y

(since 3x )

Example 6.4 : The following data relate to advertising expenditure (Rs. in lakh) and

their corresponding sales (Rs. in crore):

Advertising expenditure: 10 12 15 23 20

Sales : 14 17 23 25 21

(a) Find the equation of the least squares line fitting the data.

(b) Estimate the value of sales corresponding to advertising expenditure of Rs. 30

lakh.

(c) Calculate the standard error of estimate of sales on advertising expenditure.

Solution: (a) Let the advertising expenditure be denoted by x and sales by y. Then

we obtain the least squares line of y on x which is of the form given by,

)...(.................... ixxbyy yx

Page 60: We thankfully acknowledge Krishna Kanta Handiqui State ...

57

Where

22

xx

yxyx

yx

ddn

ddddnb ……………….(ii)

Now we construct the following table for calculation.

Advt.

Expenditure

( x )

16 xd x

2

xd Sales( y

)

20 yd y

2

yd yx dd

10 -6 36 14 -6 36 36

12 -4 16 17 -3 9 12

15 -1 1 23 3 9 -3

23 7 49 25 5 25 35

20 4 16 21 1 1 4

80 0 118 100 0 84 84

205

100;16

5

80

n

yy

n

xx

From (ii) 712.01185

845

YXb

(a) Therefore from (i) the regression equation of y on x is

16712.020 xy

xy 712.0608.8

Which is the required least squares line of sales on advertising expenditure.

(b) The least squares line obtained in part (a) may be applied to estimate the sales

turnover corresponding to the advertising expenditure of Rs. 30 lakh as:

.968.29.30712.0608.8712.0608.8ˆ croreRsxy

(c) The standard error of estimate of sale ( y ) on advertising expenditure ( y )

denoted by yxS defined by

2

2

n

xybyayS yx …………………….(iii)

Now we make the following table to calculate yxS

x y 2y xy

10 14 196 140

12 17 289 204

15 23 529 345

23 25 625 575

20 21 41 420

80 100 2080 1684

Page 61: We thankfully acknowledge Krishna Kanta Handiqui State ...

58

From (ii)

25

1684712.0100608.82080

yxS

= .594.23

11998.8602080

CHECK YOUR PROGRESS

Q 7: State whether the following statements are true or false:

(i) Standard error of estimate is a measure of scatter of the observations about the

regression line.

(ii) The standard error of estimate of y on x, yxS is equal to 21 ry

(iii) Smaller the value of yxS , better the line fits the data.

6.6 REGRESSION ANALYSIS WITH THE USE OF MS EXCEL

This example teaches you how to run a linear regression analysis in Excel and how

to interpret the Summary Output. Below you can find our data. The big question is:

is there a relation between Quantity Sold (Output) and Price and Advertising (Input).

In other words: can we predict Quantity Sold if we know Price and Advertising?

1. On the Data tab, in the Analysis group, click Data Analysis.

Note: can't find the Data Analysis button? Click here to load the Analysis Tool add-

in.

Page 62: We thankfully acknowledge Krishna Kanta Handiqui State ...

59

2. Select Regression and click OK.

3. Select the Y Range (A1:A8). This is the predictor variable (also called dependent

variable).

4. Select the X Range (B1:C8). These are the explanatory variables (also called

independent variables). These columns must be adjacent to each other.

5. Check Labels.

6. Click in the Output Range box and select cell A11.

7. Check Residuals.

8. Click OK.

Excel produces the following Summary Output (rounded to 3 decimal places).

R Square

Page 63: We thankfully acknowledge Krishna Kanta Handiqui State ...

60

R Square equals 0.962, which is a very good fit. 96% of the variation in Quantity

Sold is explained by the independent variables Price and Advertising. The closer to

1, the better the regression line (read on) fits the data.

Significance F and P-values

To check if your results are reliable (statistically significant), look at Significance F

(0.001). If this value is less than 0.05, you're OK. If Significance F is greater than

0.05, it's probably better to stop using this set of independent variables. Delete a

variable with a high P-value (greater than 0.05) and rerun the regression until

Significance F drops below 0.05. Most or all P-values should be below below 0.05.

In our example this is the case. (0.000, 0.001 and 0.005).

Coefficients

The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 *

Advertising. In other words, for each unit increase in price, Quantity Sold decreases

with 835.722 units. For each unit increase in Advertising, Quantity Sold increases

with 0.592 units. This is valuable information.

You can also use these coefficients to do a forecast. For example, if price equals $4

and Advertising equals $3000, you might be able to achieve a Quantity Sold of

8536.214 -835.722 * 4 + 0.592 * 3000 = 6970.

Page 64: We thankfully acknowledge Krishna Kanta Handiqui State ...

61

Residuals

The residuals show you how far away the actual data points are fom the predicted

data points (using the equation). For example, the first data point equals 8500. Using

the equation, the predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 =

8523.009, giving a residual of 8500 - 8523.009 = -23.009.

You can also create a scatter plot of these residuals.

6.7 LET US SUM UP

This unit also focuses on the process of developing a model known as regression

model under regression analysis which is used to predict the value of a dependent

variable by at least one independent variable. Here we have discussed only simple

linear regression analysis that involves only two types of variables. The variable

whose value is influenced or is to be predicted is called dependent variable and the

variable which influences the value or is used for prediction is called independent

Page 65: We thankfully acknowledge Krishna Kanta Handiqui State ...

62

variable. Once the line of regression is developed, by substituting the required

variable values and values of regression coefficient, regressed values, or predicted

values can be obtained.

Having developed a regression model to predict the dependent variable with the help

of independent variable, we need to focus on a few measures of variation. As one of

such measures we have introduced Standard error of estimate (S.E.). It measures the

dispersion of the actual values iy around the regression line. Low values of S.E.

indicate that the points cluster closely about the regression line.

6.8 FURTHER READINGS

Srivastava, T.N., Rego, S. (2008). Statistics for Management. New Delhi.

Tata McGraw Hill Education Private Limited.

Sharma, J.K. (2007). Business Statistics. New Delhi. Pearson Education Ltd.

Hazarika, P.L. (2016). Essential Statistics For Economics And Business

Studies. New Delhi. Akansha Publishing House.

Lind, D.A., Marshal, W.G., Wathen, S.A. (2009) Statistical Techniques in

Business and Economics. New Delhi. Tata McGraw Hill Education Private

Limited.

Bajpai, N. (2014). Business Statistics. New Delhi. Pearson Education Ltd.

6.9 ANSWERS TO CHECK YOUR PROGRESS

Ans. to Q No 1: (i) False, (ii) False, (iii) True, (iv) True.

Ans. to Q No 2:

22

YYXX

YYXXr

9.07232

43

Ans. to Q No 3: We have, YX

YXCovr

),(

Y6

6.325.0 (since 36)( 2 XXVar , therefore 6X )

4.2 Y

Thus S.D.(Y)=2.4.

Ans. to Q No 4: Rank correlation coefficient is used in a situation in which

quantitative measure of certain qualitative factors such as judgment, leadership,

Page 66: We thankfully acknowledge Krishna Kanta Handiqui State ...

63

colour, tastes etc. cannot be fixed, but individual observations can be arranged in a

definite order.

Ans. to Q No 5: Coefficient of determination is a statistical measure of the

proportion of the variation in the dependent variable that is explained by independent

variable.

Coefficient of determination 49.02 r or 49% indicates that only 49% of the

variation in the dependent variable y can be accounted for in terms of variable x.The

remaining 51% of the variability may be due to other factors.

Ans. to Q No 6: (i) True, (ii) True, (iii) False, (iv) False, (v) True.

Ans. to Q No 7: (i) True, (ii) False, (iii) True.

6.10 MODEL QUESTIONS

1. Explain the concept of regression and point out its usefulness in dealing with

business problems.

2. What is linear regression? Why are there two regression lines? When do these

become identical?

3. Show that yxxy bbr 2 .

4. You are given below the following information about advertisement expenditure

and sales:

Adv. Exp.(x) Sales(y)

(Rs. in crore) (Rs. in crore)

Mean 20 120

Standard deviation 5 25

Correlation coefficient is 0.8.

(a) Obtain both the regression equations.

(b) Find the likely sales when advertisement expenditures Rs. 25 crore.

(c) What should be the advertisement budget if the company wants to attain sales

target of Rs. 150 crore?

5. A company believes that the number of salespersons employed is a good predictor

of sales. The following table exhibits sales (in thousands Rs.) and the number of

salespersons employed for different years:

Sales (in

thousands Rs.)

120 125 118 115 100 130 140 135 130 123

Page 67: We thankfully acknowledge Krishna Kanta Handiqui State ...

64

Number of

salespersons

Employed

10 15 12 18 20 21 22 20 15 19

Obtain a simple regression model to predict sales based on the number of

salespersons employed.

6. The HR manager of a multinational company wants to determine the relationship

between experience and income of employees. The following data are collected

from 14 randomly selected employees.

Employees 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Experience

(in years)

2 4 5 6 7 8 9 10 12 13 14 15 16 18

Income (in

thousands

Rs.)

30 40 45 35 50 60 70 65 60 55 75 80 85 75

(a) Develop a regression model to predict income based on the years of experience.

(b) Calculate the coefficient of determination and interpret the result.

(c) Calculate the standard error of estimate.

(d) Predict the income of an employee who has 22 years of experience.

Page 68: We thankfully acknowledge Krishna Kanta Handiqui State ...