ETC1000 Topic 2

1

ETC1000/ETW1000/ETX9000 Business and Economic Statistics

LECTURE NOTES

Topic 2: Understanding What is Happening

1. Relating Variables Together In the diabetes example in Topic 1 we had a contingency table that suggested not doing enough exercise was related to diabetes. With bivariate data we can do a lot to see how the two characteristics (variables) relate to each other. This can be particularly helpful in policy. For example, if we can get some idea of how exercise can affect the incidence of disease, then the government can come up with some strategies to improve public health and social wellbeing (and its budget). 1.1 Scatter Plots A SCATTER PLOT is a graph that allows us to visually see how two characteristics (variables) relate to each other. It is generally most appropriate for numerical data or data with some natural ordering. Suppose we are interested in another major government responsibility: education. In particular, suppose we are interested in whether investing in education improves income. To draw a scatter plot, we begin with some data. The data we need comes in pairs: a data point for years of education and a data point for income. Here’s what the data looks like in our spreadsheet:

2

We then plot income against education:

Notice that we have put education on the X-axis and income on the Y-axis. In general, the variable or characteristic that we have some control over goes on the X-axis, and the variable we want to influence goes on the Y-axis. You will see later why this is important. We can see from the scatter plot above that someone with 15 years of education earns a lot more than someone with only 9 years of education. As we move along the X-axis, education increases, and so does income. We would say there does seem to be a relationship between education and income, and this relationship is positive. This would suggest that by finishing your degree, you could earn a much higher income. The scatter plot in general is used for exploring questions like:

• Is there a relationship between X and Y?

X and Y related

0

50

100

150

200

0 0.5 1 1.5 2 2.5 3 3.5 4X

Y

3

No apparent relationship

0

50

100

150

200

250

0 2 4 6 8 10X

Y

• What direction is the relationship?

Icecream Demand - Positive Relationship

0

1

2

3

4

0 10 20 30 40

Temperature

Icecream Demand - Negative Relationship

0

1

2

3

4

0 1 2 3 4

Price

4

• Does the relationship appear to be linear or non-linear?

Sunscreen Consumption

0

5

10

15

20

25

0 10 20 30 40Temperature

Electricity Consumption

020406080

100120140160

0 10 20 30 40Temperature

1.2 Covariance We can quantify how (i.e. in what direction) two variables move together by a summary measure called the COVARIANCE. The covariance of two data sets, X and Y, is given by the formula:

Cov(X,Y)[ ][ ]

n

YYXX i

n

ii −−

=∑=1

Technically speaking, the covariance is the average of the products of paired deviations from the mean in two sets of data. To see what that means, notice that inside the summation operator, the formula involves subtracting the mean of X from each X data point, subtracting the mean of Y from each Y data point, and multiplying each “de-meaned” pair together. To see how this works visually, first consider where the mean of X and the mean of Y lie in a scatter plot of the data:

5

Drawing lines for the mean of X and the mean of Y , we split the data into 4 quadrants. The values we sum – the de-meaned pairs – will be positive or negative depending on which quadrant the data lies in. And the sign of the covariance will be positive or negative accordingly:

Y

X

Y

( )YYi − positive

( )YYi − negative

( )XX i − negative ( )XX i − positive

X

X

Y

X

Y

6

In the scatter plot above, there is clearly a positive relationship between X and Y. For the data points in the lower left quadrant:

( )XX i − is negative and ( )YYi − is negative, so ( )XX i − * ( )YYi − will be positive in each case.

For the data points in the upper right quadrant:

( )XX i − is positive and ( )YYi − is positive, so ( )XX i − * ( )YYi − will again be positive in each case.

Summing mostly positive values together will give you a covariance that is positive. Consider an example with a negative relationship. In this case, the data will mostly be in the upper left and lower right quadrants.

When ( )XX i − is negative, mostly ( )YYi − is positive, so ( )XX i − * ( )YYi − will

be negative. Likewise, when ( )XX i − is positive, ( )YYi − will be negative, so

( )XX i − * ( )YYi − will again be negative in each case. This would give us a negative covariance overall.

X

Y

( )YYi − positive

( )YYi − negative

( )XX i − negative ( )XX i − positive

X

Y

7

We can calculate the covariance automatically in Excel using: =COVAR(range of X data, range of Y data) e.g.

=COVAR(B2:B55, C2:C55) Or we can bring up the box below by going to Data Analysis under the Data tab and choosing Covariance:

The formula for the covariance we have seen so far is technically only appropriate if the data comprises the whole population. In practice, the data we have is a sample rather than the population. In this case, we have to estimate the means as well as the covariance, so the formula is slightly different to take into account the extra uncertainty.

8

The sample covariance of two random variables, X and Y, is given by the formula:

Cov(X,Y)[ ][ ]

11

−

−−=∑=

n

YYXX i

n

ii

Excel does not have an in-built function to calculate the sample covariance, but when n is large, there is practically no difference between the two calculations. What does the covariance mean? The covariance indicates both the strength and direction of the linear association between two variables. However, it has no meaningful interpretation – it is measured in squared units of the random variables, like the variance of a random variable. The interesting thing we learn from the covariance is about the direction of a relationship – do the two variables tend to move in the same direction or in opposite directions? A positive covariance indicates that when X is big, there is a higher probability of a big Y. Similarly, when X is small there is a higher probability of Y being small as well. i.e. X and Y tend to move together. This is captured with positive covariance. e.g. Income and education: When someone has completed a lot of education (eg. a

degree as opposed to completion of VCE), chances are they will earn higher income. So we’d say education and income have a positive covariance.

e.g. Interest rates and prices of bonds: When interest rates increase, the prices of

bonds decrease. Conversely, the prices of bonds increase when interest rates decrease. So we’d say that interest rates and bond prices have a negative covariance.

1.3 Correlation The problem with using the covariance to measure the relationship between two variables is that we can only interpret the direction of the relationship, not the strength of it. An alternative measure of relationship is the COEFFICIENT OF CORRELATION. The coefficient of correlation is a standardised measure in that its values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). If two variables are perfectly correlated, this means that if all points were plotted with a scatter plot, all the points could be connected with a straight line. Perfect negative correlation: Perfect negative correlation means that when X is big, Y is small, and it is small in a (perfectly) predictable manner.

X

Y

9

No correlation: When two variables are uncorrelated, when one variable changes, there is no tendency for the other variable to change. Perfect positive correlation: Perfect positive correlation means that when X is big, Y is also predictably big. In the intervening cases, e.g. the case where there is a coefficient of correlation of -0.8: In this case, we can say that large values of X tend to be paired with small values of Y. The data do not form a complete straight line, though, so it cannot be described as ‘perfect’. In the case of income and education, our coefficient of correlation is 0.693, which means that large values of X tend to be paired with large values of Y.

Y

X

X

X

Y

Y

10

How do we calculate the coefficient of correlation? The formula is:

Corr(X,Y)[ ][ ]

[ ] [ ]∑∑

∑

==

=

−−

−−=

n

ii

n

ii

i

n

ii

YYXX

YYXX

1

2

1

2

1

The numerator is the same as the numerator of the covariance, but the denominator is different: notice that the denominator is effectively standardising the numerator by removing the “squared units” of X and Y. In Excel, we can automatically calculate the correlation using the function: =CORREL(range of X data, range of Y data) 2. Making Causal Connections While tools such as scatter plots, covariance and correlation can tell us a great deal about the strength and direction of a relationship between two variables, they do not quantify the effect in a way that would allow us to make informed decisions. For example, we may be interested in quantities:

• Predicting how much more your income could be if you complete your degree rather than finish education at the end of secondary school.

• Establishing the effectiveness of advertising expenditure in increasing market share.

• Estimating how production or service delivery costs vary with certain key factors.

• Predicting how sales / turnover will increase if certain inputs are varied. Consider again the scatter plot of income on education.

11

We can call income Y, as that is the factor we want to influence, and education X, as that is the variable we can potentially change. Now let’s draw a line roughly through the middle of the dots.

The line represents the average slope of Y with respect to X. The equation for a straight line can be written:

bXaY += (you may have seen it written as cmXY += ) where dXdYb /= = rise / run, is the slope of the line (the average slope of Y with respect to X). The slope of the line tells us how much Y would change (and in what direction) if X were 1 unit larger. That is, the slope of around 4.047 tells us that if someone were to do 1 more year of education, their income would be $4,047 (4.047*1000) higher. We can develop a mathematical model which exploits this nice interpretation for the slope. The model defines the relationship between X and Y, and allows us to check if some relationship exists, and then quantify it. The model is a mathematical function which describes how one variable (Y) changes in response to another (X). The model we will use is called the SIMPLE LINEAR REGRESSION MODEL.

12

2.1 The Simple Linear Regression Model We write the Simple Linear Regression Model as:

iii XY εββ ++= 10 The subscript i denotes the ith observation, such as the ith person, the ith country, the ith firm, etc. Y is the Dependent Variable (the thing you want to influence) X is the Independent or Explanatory Variable (the thing you can control) ε (pronounced “epsilon”) is the error – how far the observed value of iY is from the regression line. We need to include ε in our model because not all data points lie on the line – e.g. education does not entirely determine income.

Income

0

20,000

40,000

60,000

80,000

100,000

120,000

0 5 10 15 20 25 30

Years of Education

Inco

me(

$)

errorerror

Now, the way we have set up the model so far is really only valid for when we have data on the whole population of interest (e.g. all Australians, all countries, all firms, etc.). But in reality we never have information on the whole population. Rather, we would have data on a sample. But that’s okay. As long as our sample is representative of the population (e.g. a non-representative sample would be one where we select a few firms in a particular industry rather than a representative sample of all industries), we can use the sample data to estimate the true population model. We just make some slight but important notational differences: The true population model is: iii XY εββ ++= 10 An estimate of the true model using our sample data is: iii eXbbY ++= 10 Now let’s go back to our sample scatter plot.

13

Income

0

20,000

40,000

60,000

80,000

100,000

120,000

0 5 10 15 20 25 30

Years of Education

Inco

me(

$)

2.2 The Concept of Least Squares We actually could have drawn a number of possible lines through the plot, each with a different intercept and/or slope. That is, each possible line has a different 0b and/or

1b . Which line is the “best” one? That is, how do we choose 0b and/or 1b ? We need a criterion for deciding the “best” line. The best line would be the one where the errors are closest to 0. The criterion we use is to minimise the SUM OF SQUARED ERRORS. The aim is to make errors as small as possible:

error = iii XbbYe 10 −−=

But we will have some errors which are positive, and some negative. So we square all errors, to make them all positive, and then add them up. We then choose the line which minimises the sum of squared errors. This criterion translates into some formulas for calculating b0 and b1 with a given set of data. We won’t go through this calculation, but it is worth noting the formulas for b0 and b1 that result. First, we have 0 1b Y b X= − . This tells us that the intercept is a function of the mean of Y and the mean of X. From the scatter plot, this makes sense: the intercept will be somewhere in the middle of the range of Y, adjusted by the range that X takes.

Second, we have 11 2

1

n

i ii

n

ii

X X Y Yb

X X

=

=

⎡ ⎤ ⎡ ⎤− −⎣ ⎦ ⎣ ⎦=

⎡ ⎤−⎣ ⎦

∑

∑.

e23

e15

b0

b1

14

If we were to divide the numerator and the denominator by 1n− (roughly the number of observations we have), we would not change the value of b1. We’d then have:

( )( )

11 2

1

1samplecov ,1

1 sample var1

n

i ii

n

ii

X X Y Y X YnbXX X

n

=

=

⎡ ⎤ ⎡ ⎤− −⎣ ⎦ ⎣ ⎦−= =⎡ ⎤−⎣ ⎦−

∑

∑

So the slope is a function of how much and in what direction X and Y vary together, relative / standardised according to the overall variation in X. The equation for the regression line, based on sample data, can then be written as:

0 1i iY b b X= + Notice that the estimated equation for the line has a “hat” on the Y and no error term. We call Y the predicted value of Y, given X. This prediction is essentially an average value of Y for any given X. There is no error term in this prediction because we assume the errors are unpredictable, and any error is averaged out.

We can get Excel to compute b0 and b1 for us using Data Analysis under the Data tab and choosing Regression. The Excel output for our income/education model looks like this:

ˆ 22596 2154.3i iY X= +

15

From this we end up with a line as follows:

0 1i iY b b X= +

or

22596 2154.3iIncome = + x Years of Educationi This is a model which we can use to help with policy decision-making. e.g. “The government decides that the minimum number of years of education an

individual should have is 10. What is the salary that an individual with 10 years of education can expect?”

We get: ˆ 22596 2154.3*10 $44,136iY = + = So, the government can expect an individual educated for 10 years to earn $44,136 annually. e.g. “Stephen has had 15 years of education and earns an annual income of $50,000.

Is Stephen earning more or less than an average individual with the same amount of education?”

The model gives us an indication of the average income for an individual with X years of education, so we can look to the model to estimate what the average income of an individual with 15 years of education is. The model predicts an individual with 15 years of education should earn an annual income of $54,910.50 on average: i.e. ˆ 22596 2154.3*15 $54,910.50iY = + = Since Stephen’s income is lower than this, Stephen is earning less than an average individual with the same amount of education, according to this model.

ˆ 22596 2154.3i iY X= +

16

Before going too far on the uses of these models, we need to look more closely at what it all means, and how we can evaluate the Excel output we have obtained. 2.3 Interpreting the Model We want to understand what the model we have estimated tells us about the nature of the relationship between X and Y. More specifically, the actual estimates b0 and b1 are informative. b0 is known as the intercept – the estimated value of Y when X = 0. b1 is the slope of Y with respect to X – the estimated change in Y for a 1 unit change in X. e.g. In the above example, we found:

22596 2154.3iIncome = + x Years of Educationi We have b0 = 22596. This means that an individual who has zero years of education can expect an annual income of $22,596. This is an example of a case where b0 doesn’t have a meaningful interpretation. It does not really make sense to speak of an individual with zero years of education. Often this happens with the intercept. Whenever interpreting b0, we need to consider whether the interpretation is sensible. b1 = 2154.3 is an estimate of the effect of education on income: it tells us how much income would change if education were 1 year higher. In particular, we can say “take 2 people, one of whom has 1 more year of education than the other. The person with 1 year higher education can expect to earn, on average, $2,154.30 more per annum than the person with lower education.” Being able to specify quantities like this can be very important in policy and planning. In this case, it suggests that if a number of people could be encouraged to stay at school one year longer, then they could earn more income (and the government could receive more tax income!). 2.4 Is There Really a Relationship between X and Y? Usually a regression model is used to consider a factor that affects Y. If you are working as an analyst, it will probably be up to you to decide what factor is important in explaining or causing Y to change. You may have a good idea about what influences Y, but how do you know that X really does cause Y? e.g. We think money spent on training schemes for the unemployed (X) should

reduce the unemployment rate (Y). But does it? Recall β1 is the slope of the line relating Y and X. If there is no relationship between X and Y, then they would not co-vary together (i.e. the covariance would be 0), and the slope of the line would be 0:

17

Training Expenditure

0

1

2

3

4

5

6

7

0 100 200 300 400 500 600 700

Training Expenditure ($m)

Unem

ploy

men

t Rat

e(%

)

From this scatter plot, there doesn’t seem to be an obvious relationship. Here’s Excel’s regression output:

We see the estimated coefficient on Training Expenditure is very small (-0.002). Now, β1 is a continuous numerical variable – it could take any value, depending on how many decimal places you went to. Chances are that because we have a sample, we will not get an estimate of β1 = 0 exactly. So how big does 1b need to be before we can say the true slope is nonzero? We can answer this with a HYPOTHESIS TEST for whether the true slope, β1, is actually zero. How do we do this test? There are 4 steps: 1. Formulate “Null” and “Alternative” Hypotheses

0: 10 =βH i.e. The null is that there is NO relationship between X and Y

0: 11 ≠βH i.e. The alternative is that there IS a relationship between X and Y: the true slope is not 0.

18

2. Decide a “Significance Level” This is a small value we choose ourselves, denoted α. Usually we choose α = 0.05 or 5%.

3. Calculate the p-value The p-value is a probability – we will go through the meaning and theory

underlying p-values in Topic 3. For now, it is a value we calculate that we use to determine whether 1b is close enough to 0.

Excel calculates the p-value for us in its regression output. From the output above,

the p-value for our estimate of 1b is 0.007272. 4. Make a Decision To make a decision, we compare the p-value for 1b to our chosen significance level α. The decision rule is to reject H0: β1 = 0 if the p-value < α. That is, if the p-value is smaller than α, we would conclude that there IS a

relationship between X and Y. Conversely, if the p-value is bigger than α, we would conclude that there IS NOT a relationship between X and Y.

In the above example, since 0.007272 < 0.05, we reject H0 and conclude that the

amount of money spent on training schemes DOES affect the unemployment rate. In fact, with 1b negative, we could say that higher training expenditure reduces the unemployment rate. Who would have thought this by looking at the scatter plot!

So, what next? If you conclude the slope is nonzero, then you can go ahead with associated policy and planning. But if you conclude the slope is zero, you may need to think about trying another factor which influences Y. Of course, often finding no relationship between X and Y can be useful for policy and planning too! We could have also done a similar test on β0 – then we’d be testing whether the intercept is 0 or not. However, the implications of this is, if we conclude β0 = 0, then we should drop the intercept. This is generally not a good idea, as dropping the intercept effectively forces the regression line through the origin, which may change the slope of the line inappropriately. Let’s do another example. A well-known model in finance, called the market model, assumes that the rate of return on a share (R) is linearly related to the monthly rate of return on the overall market (M). The mathematical description of the model is:

Ri = β0 + β1Mi + ei For practical purposes, M is taken to be the annual rate of return on some major stock market index, such as the Australian All Ordinaries Index. The coefficient β1, well-known as the share’s “beta-coefficient”, measures how sensitive the share’s rate of return is to changes in the level of the overall market. For example, if β1 > 1, the share’s rate of return is more sensitive to changes in the level of the overall market than is the average share. Conversely, β1 < 1 suggests it is less sensitive.

19

The scatter plot below shows the return of a particular share – ANZ – against the market (All Ordinaries) return. From the scatter plot alone, a positive linear relationship seems to exist.

ANZ Share Returns against Market Returns

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

Market Returns

ANZ

shar

e Re

turn

s

The estimation output is given below: SUMMARY OUTPUT

Regression Statistics Multiple R 0.500805313 R Square 0.250805961 Adjusted R Square 0.241556652 Standard Error 0.043341318 Observations 83

ANOVA df SS MS F Significance F

Regression 1 0.050936932 0.050937 27.11618 1.42333E-06 Residual 81 0.152156054 0.001878 Total 82 0.203092986

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 0.005039079 0.005053313 0.997183 0.321645 -0.005015437 0.0150936M (Market Return) 0.843432857 0.161970617 5.20732 1.42E-06 0.521161879 1.1657038

Estimated equation:

ii MR 843.00050.0ˆ += Interpretation of coefficients: b0 = 0.0050 This means that when the market return is zero, one would expect the

ANZ share return to be 0.0050 or 0.5%. b1 = 0.843 Consider a share in ANZ on two particular trading days. On the

second day, the All Ordinaries return was 1% higher than it was on the first day. On the second day, the ANZ share would be expected to earn a return 0.843% higher than it would on the first day.

20

Hypothesis test 1. Formulate “Null” and “Alternative” Hypotheses

0: 10 =βH Market returns have no impact on ANZ share returns

0: 11 ≠βH Market returns have an impact on ANZ share returns 2. Decide a “Significance Level”

Test at 5% level of significance, i.e. α = 0.05 3. Calculate the p-value

p-value = 610*42.1 − 4. Make a Decision The decision rule is to reject H0: β1 = 0 if the p-value < α.

Since 610*42.1 − < 0.05, we reject H0 and conclude that the return on the market does affect the return on ANZ shares.

N.B. Wouldn’t it be nice if we could test specific financial and economic theories, like whether the ANZ share return is sensitive to changes in the overall market return (i.e.

0 1: 1H β = )? 2.5 Evaluating the Model We have come up with a simple model to explain the behaviour of Y. It assumes that X is the key factor in explaining Y, and that the relationship is linear. We have shown how the model can be used to aid understanding, policy and decision-making. An important next step, though, is to evaluate how good the model is. There is no point coming up with a model and making all these predictions from it, etc, if it is lousy at explaining the relationship between X and Y. Any conclusions we draw from the model are likely to be misleading and unhelpful. So how do we evaluate our model? There are three things we can use. (1) R2

2R is closely related to the covariance of X and Y. In fact, 2R equals the square of the sample correlation coefficient:

( )[ ]22 ,YXCorrR =

In the Excel output you will see a quantity called R Square.

21

This will be a value between 0 and 1. It measures the proportion of variation in Y that the model has been able to explain. So, a value of R2 close to 1 indicates that the model has been able to explain a large proportion of variation in Y, and hence is a very good model. A value of R2 close to zero indicates a poor model – not much of Y has been explained.

Let’s look a little more closely at what R2 measures.

Xi

Y

X

Y

SST = ∑(Yi - Y)2

SSE =∑(Yi - Yi )2∧

SSR = ∑(Yi - Y)2∧

__

_

The distance of each observed data point, Yi, from the fitted line is the error, so if we sum the squares of all these errors we get the SSE, or Sum of Squared Errors:

iiiii YYXbbYe ˆ10 −=−−=

( )∑∑==

−=n

iii

n

ii YYe

1

2

1

2 ˆ

The distance of Y from the mean line contributes to SST, the Total Sum of Squares, or the total variation in Y from its mean. The other distance, from the value of Y on the line to its mean, is the part of Y’s behaviour that the model has been able to explain. So if we square and add these for all data points, we get SSR, the Regression Sum of Squares.

22

R2 is defined as

2 SSRRSST

= =( )( )∑

∑−

−2

2ˆ

YY

YY

i

i

Hence the interpretation as the proportion of variation in Y explained by the model. e.g. In the above output, we have an R2 of 0.0358. So we can say that 3.58% of

the variation in unemployment rates can be explained by variation in training expenditure. This is a very poor result, but not unexpected, as a there are a lot of other factors that affect the unemployment rate, such as education, economic factors, etc.

(2) Standard Error

The STANDARD ERROR provides another measure of how good our model is. It is simply the standard deviation of the error term in the model. Positive errors (where the model predicts a Y value smaller than what actually happened) will cancel out with the negative errors, so the average error will be zero. This means the error standard deviation is actually just the square root of the average of the squared errors. Or more loosely, the standard error gives us an estimate of the magnitude of the average error that will be produced from the model. This is the same intuitive interpretation we gave to the standard deviation in Topic 1. e.g. In the unemployment rate / training expenditure case, we have a standard

error of 1.685. So we say that, on average, the model’s predictions of Y (the unemployment rate) are in error by 1.685%, either above or below the actual value of Y. This can be very useful. For example, if we were using the model to predict how the government’s pledge to increase training for the unemployed might improve the overall unemployment rate, we can gauge from the standard error that our prediction of the unemployment rate will be out, on average, by 1.685% either above or below.

Another way to get a handle on the standard error is to compare it with the actual mean of Y and/or with the kinds of values Y takes. In Australia, the unemployment rate fluctuates around 6%, not usually going below 4% or above 8%. So to have a model that predicts the unemployment rate with an “average error” of 1.685% is not particularly accurate.

(3) Error / Residual Plots

The aim of a model is to explain patterns in Y. Sometimes variables just fluctuate randomly, in a totally unpredictable way. No model can expect to explain this. But what we do hope for is that a good model will have at least explained the main patterns in Y. A tool for evaluating whether the model has succeeded in this is the ERROR PLOTS, or RESIDUAL PLOTS (residual = error). We would hope that the errors would not contain an obvious pattern in them. If they do, then there is something wrong with or missing from the model – there is some pattern that the data should

23

have picked up by the model. What we should ideally see in the error is only the random, unpredictable bits left over after the model has taken care of all the patterns. Excel can produce residual plots as part of the regression function. At this stage, we look to the residual plots for two things: • Evidence that the use of a linear model may not have been appropriate. • Evidence that there may be an important variable left out of our model.

Let’s take a look at some examples. In the graph below, the errors seem to be randomly distributed – no particular pattern. Looks good.

Residual Plot

-0.2-0.15

-0.1-0.05

00.05

0.10.15

0.2

0 20 40 60 80

X

Res

idua

ls

The second graph, below, suggests either a variable has been left out, or more likely we have fitted a linear model when the relationship is not linear.

Residual Plot

-20-10

0102030

0 20 40 60 80

X

Res

idua

ls

In fact, the above residual plot came from the following scatter plot and fitted line:

Observed and Predicted Y

-10

0

10

20

30

40

50

60

0 20 40 60 80X

YY

Predicted Y

24

Clearly Y and X are not related in a linear way – some kind of curve would fit much better. The linear model gives errors which are all negative for X between 25 and 70, and all positive for X outside this range. A linear model is clearly not appropriate for this data. The third graph, below, has a clear cyclical pattern to it. This is clear evidence of some other variable being important in causing Y.

Residual Plot

-3

-2

-1

0

1

2

0 20 40 60 80

X

Res

idua

ls

What do we do if we find evidence in our residual plots of a problem with the model? Nothing much at this stage – we need more skills – see the next section. But it’s important to be aware of the problems right from the beginning, as it causes us to draw conclusions from the model with a little more caution.

3. The Multiple Regression Model We want to generalise the simple regression model because we believe that there are likely to be several factors causing variation in Y. It is unrealistic to restrict ourselves to a model which allows for just one factor. In fact, sometimes considering just one factor can give us a misleading impression about what is relevant. e.g. Suppose we are seeking to explain differences in infant mortality rates across

countries (number of infants who died before the age of 1, per 1000 live births). There are three possible causal factors: average income levels (Real per capita GDP), average education levels (secondary school enrolment ratio), and number of TV sets per capita.

Here are the results we obtain.

25

Model 1: Y = Infant Mortality Rate. X = Number of TV sets per capita.

Model 2: Y = Infant Mortality Rate. X = Real GDP per capita.

26

Model 3: Y = Infant Mortality Rate. X = Secondary School Enrolment Rate.

All three models look plausible, and suggest a relationship between the variables: p-values on the X variables are all small, R2 values are reasonably big, β1 estimates are the right sign (infant mortality rates fall as income rises, as education rises, and as number of TV sets rises). There is a hint that the most important of the three variables could be the education variable – it has a much bigger R2. But even better, it is conceivable that all of these three variables are important in explaining infant mortality rate. Here’s what happens when we estimate a MULTIPLE LINEAR REGRESSION MODEL – a model with more than one X variable. It’s easy to do in Excel – put all three X variables together in 3 adjacent columns (getting rid of rows where there are blanks / missing data). Then choose Data Analysis under the Data tab, Regression, and select these columns in the Input X Range part of the dialog box.

27

The story now is very different. We now have four β coefficients to interpret and test: the constant / intercept, and the coefficients of each of the three variables. Note that all the coefficients have the same sign as before – all negative, as expected. But the actual coefficients are all smaller, especially for TV sets and GDP. Increasing the number of TV sets and the income level has a much smaller effect on infant mortality rates than what is suggested by the simple regression results. In fact, looking at the p-values for these two variables, they are 0.0816 and 0.6634 – bigger than α = 0.05, so we would not reject the null hypothesis that these coefficients are zero. This is not surprising. When we just do simple regression, a factor like GDP can show up as significant. But really, it is not important. It is the education level that really matters in reducing infant mortality rates. The significance of GDP in the simple regression result is that countries with higher GDP usually have higher education levels, so GDP is capturing the effect of higher education, and thus appears to be an important factor. When the model includes both GDP and Education, then the reality shows up – variations in GDP cannot explain variations in infant mortality rates, once differences in education level are taken into account. Let’s go into the ideas of multiple regression a little more systematically. 3.1 The Model and Interpretation Here’s the mathematical representation of the multiple regression model:

ikikiii eXXXY +++++= ββββ K22110 . Assuming we have k possible X variables explaining Y – in the infant mortality example, k = 3. The β’s represent the population parameters – they have a similar interpretation as for the simple regression model, but there is a small but important difference. β0 = intercept = the average value of Y when ALL the X variables equal zero.

N.B. Usually we don’t worry about interpreting β0, as it often doesn’t make a lot of sense for all the X’s to be zero (e.g. what sense is there is considering what the mortality rate would be for a country with no TV sets, zero income, and no people in secondary school?).

β1 = the change in Y for a 1 unit change in X1, holding all other X variables constant.

In a simple regression of Y on X1, β1 tells us about the total contribution of X1 to explaining Y. In multiple regression, β1 tells us about the contribution of X1 to explaining Y after taking account of the possible impact of the other variables on Y.

The same interpretation applies for β2, β3, etc.

28

Return to the infant mortality rate example. Using our sample, we came up with estimates of these β’s – the bj’s. These could be interpreted as follows: b1 = -45.2359. If there were two countries which had the same GDP per capita and

the same secondary school enrolment rate, but one country had 1 more TV set per capita than the other, then the country with more TV’s per capita would have an infant mortality rate of 45.2 deaths per 1000 live births lower than the country with fewer TV’s per capita.

b2 = -0.00034. If there were two countries which had the same number of TV sets per

capita and the same secondary school enrolment rate, but one country’s GDP was US$1 more than the other, then the country with the higher GDP would have an infant mortality rate 0.00034 deaths per 1000 live births lower than the country with lower GDP.

b3 = -0.62023. If there were two countries which had the same GDP per capita and

the same number of TV sets per capita, but one country’s secondary school enrolment rate was 1% higher than the other, then the country with the higher rate would have an infant mortality rate 0.62 deaths per 1000 live births lower than the country with the lower enrolment rate.

N.B. Sometimes we want to vary our interpretations to take into account the sorts

of values the different X variables take. For example, the data on TV sets per capita range in value from 0.0 to 0.71 (the USA). No country has more than 1 TV set per person! So to talk of differences in the number of TV sets per capita of ONE is totally unrealistic. It’s probably better to talk, say, of differences of 0.1 – this might represent differences between Uruguay (0.2) and Taiwan (0.3), for example. If b1 = -45.2, then a 1 unit difference leads to a 45.2 unit difference in Y. So our interpretation would be more relevant if we said:

If there were two countries which had the same GDP per capita and the same secondary school enrolment rate, but one country had 0.1 more TV set per capita than the other, then the country with more TV’s per capita would have an infant mortality rate of 4.52 deaths per 1000 live births lower than the country with fewer TV’s per capita.

From a policy angle, we can use these differences in mortality rates to predict what might happen if we were able to change one of the X variables. For example, if a particular country could introduce a policy aimed at increasing the secondary school enrolment rate by 1%, then our model suggests that that country could reduce its infant mortality rate by 0.62 deaths per 1000 live births. 3.2 Hypothesis Testing Using more than one X variable in our model does not change the way we do hypothesis tests. What we did in testing 1β in simple regression can be extended to testing 2β and 3β , etc. Each coefficient has its own p-value which can be compared to the chosen significance level. Let’s go through how we’d do the test for the infant mortality example. Here’s the relevant part of the output.

29

Let’s look first at whether the TV sets variable affects Y. Our null and alternative hypotheses would be: H0: β1 = 0 TV sets per capita has no impact on infant mortality rate H1: β1 ≠ 0 TV sets per capita does have an impact on infant mortality rate From the output, the p-value is 0.082. If we choose a significance level of 5%, or 0.05, then since the p-value exceeds 0.05, we do not reject H0. That is, we conclude that there is not sufficient evidence to support the view that number of TV sets per capita can help explain infant mortality rates, once the other variables have been taken into account. Note in this case that the decision is pretty close: had we chosen a significance level of 10%, or 0.1, then we would have rejected H0 and concluded that the number of TV sets is important in explaining infant mortality. This is where we need to be careful not to be too “black and white”. The choice of a 5% significance level was subjective, and could have made a difference to our conclusion. Given all this uncertainty and subjectiveness, a reasonable conclusion might be “TV sets per capita might be an important explanatory variable, but the evidence for this is not conclusive”. In the case of Real GDP and School enrolments, we have: H0: β2 = 0 Real GDP per capita has no impact on infant mortality rate H1: β2 ≠ 0 Real GDP per capita has an impact on infant mortality rate From the output, the p-value is 0.664. If we choose a significance level of 5%, or 0.05, then since the p-value exceeds 0.05, we do not reject H0. That is, we conclude that there is not sufficient evidence to support the view that real GDP per capita can help explain infant mortality rates. In this case, the decision is clear-cut: the p-value is huge compared to the significance level. There is virtually no evidence suggesting GDP is an important variable, once the other variables have been taken into account. For secondary school enrolment rates: H0: β3 = 0 Secondary School enrolment rates have no impact on infant mortality

rate H1: β3 ≠ 0 Secondary school enrolment rates have an impact on infant mortality

rate (i.e. higher schooling, lower mortality) From the output, the p-value is 0.0000006. If we choose a significance level of 5%, or 0.05, then since the p-value is less than 0.05, we reject H0. That is, we conclude that there is sufficient evidence to support the view that secondary school enrolment rates help explain infant mortality rates. In this case, the decision is also clear-cut: the p-value is tiny compared to the significance level. There is quite convincing evidence suggesting schooling is an important variable, once the other variables have been taken into account.

30

Take note of the slight but important difference in how we interpret the results of the hypothesis test – “once the other variables have been taken into account”. If we conclude that X1 is not important, but X2 is important, we would say that X1 is not important, once X2 is taken into account. That’s exactly the case we had in this infant mortality example – at first, when we did regressions one variable at a time, it seemed that all three variables were important in explaining the infant mortality rate. But, for example, countries which have more TV sets per capita and higher GDP tend to have higher school enrolment rates. When we put these variables in separately, they seemed significant, but only because they were acting as a proxy for the effect of higher school enrolment rates. So when we put all three in together, it was clear that the only strong determinant of infant mortality was education. 3.3 Assessing the Overall Model In simple regression we used three things to help us decide if we have a decent model: R2, the standard error, and residual analysis. All of these tools are available to us in multiple regression. The relevant output for our infant mortality example is:

(1) R2

As with simple regression, R2 measures the proportion of variation in Y explained by the fitted equation – that is, the proportion explained by X1, X2….Xk. In this case, we have R2=68.2%, which is reasonable. Note that in the simple regression model with just the School Enrolments variable, we got an R2 of 66.3%, so the addition of these two extra variables has done little to improve the model. This fits with the fact that the tests discussed above suggested neither of the other variables were convincingly significant explanatory variables. We’d say that 68.2% of the variation in infant mortality rates is explained by the variation in TV’s, GDP and the secondary school enrolment rate. N.B. One important point about R2: we can show mathematically that R2 will

always increase if we add more explanatory variables to a model. Thus it is not a good idea to use R2 as a guide to whether one model is better than another, especially if one has more explanatory variables than the other. If you are considering adding another variable, then using a hypothesis test to decide if its coefficient is significantly different from zero is a much better guide.

31

(2) Standard Error

The understanding of the standard error is the same as for simple regression. The standard error is the standard deviation of the error term in the model – if our additional X variables are helpful in explaining Y, then the error term should be smaller on average, because more of Y is able to be explained. In the case of our example, we get a standard error of 21.34 deaths per 1000 live births. That is, loosely speaking our model will provide predictions of the infant mortality rate which are in error by an average of 21.34 deaths. This is reasonable but not brilliant. Infant mortality rates range from 3.7 for Japan to 135 for Mozambique, so to have a model that on average will make an error of around 20 is something: the model can put us in the general ball park of the kinds of infant mortality rate a country should have, given its level of TV ownership and income and education levels, but it is not able to predict outcomes in a very precise manner. There are clearly still several other important factors explaining differences in infant mortality rates.

(3) Residual Analysis.

When we plotted the residuals in simple regression, we plotted them against the X variable. With multiple regression, there is more than one X variable. So we need to do a number of different plots, with each different X variable. If you have a large number of X variables, this would be very tedious! So we tend not to use this tool in most cases.

4. Regression with Different Types of X Variables 4.1 When the Relationship between X and Y is Not Linear So far we have limited ourselves to modelling the relationship between the X variables and Y with a linear relationship. This is somewhat necessary for practical reasons, but it would be nice to be able to look at some alternatives other than the linear relationship. At this stage we can consider only one specific type of “non-linear” models, namely those where the data can be transformed so that the relationship becomes linear. This then allows us to estimate linear relationships in the transformed variables. More complex options are explored in further econometrics subjects! Let’s look at our infant mortality example. If we do a scatter plot of each X variable against the dependent variable, here’s how they look:

32

Infant Mortality rate vs TV sets per capita

0.00

50.00

100.00

150.00

0.00 0.20 0.40 0.60 0.80

It is obvious here that the variables do not have a linear relationship with infant mortality rates. We want a way of capturing this non-linear relationship. One possibility is to add a quadratic term for each Xj variable to the model. It is conceivable that a quadratic curve may provide a reasonably accurate fit to the data as shown in these scatter plots. The model we want to estimate would now be:

iiiiiiii eXXXXXXY +++++++= 236

225

2143322110 βββββββ .

This allows for each of the X variables to influence Y in a non-linear (quadratic) way. To estimate this model in Excel, we just need to add more columns to our data sheet, where all the values of each of the three X variables are squared. Here’s how a part of the data sheet would look, showing the formulas:

Of course, we can enter these formulas by typing in just one, and getting the rest by clicking and dragging.

Infant Mortality rate vs Real GDP per capita

0.0020.0040.00

60.0080.00

100.00120.00

140.00160.00

0.00 5000.00 10000.00 15000.00 20000.00 25000.00

Infant Mortality rate vs Sec. School Enrolments

0.0020.0040.0060.0080.00

100.00120.00140.00160.00

0.00 50.00 100.00 150.00 200.00

33

Now, we simply do the regression analysis (Data Analysis under Data tab, Regression) with the X data range to include these extra columns (i.e. the X range is C1:H99). Here’s how the output looks:

Examining the p-values for each of the X variables, we see that the venture into quadratic models has been successful. School Enrolment Squared is clearly significant, and GDP Squared is also significant. More interestingly, the GDP variable is also now very significant (p-value = 0.006). In the purely linear model we had concluded that GDP was not relevant. But now when we include it in quadratic form, we find that it is an important variable. On the other hand, TV sets is possibly not an appropriate variable to include: the p-value for the linear term is 0.053 (almost significant), and for the quadratic term it is 0.108. Both of these values are in that indecisive range – not small enough to be absolutely clear that this is a relevant variable, but not big enough to lead us to discard the variable outright. In general, the p-values on the quadratic terms in the model tell us whether we were justified in using a quadratic model instead of a linear model. If none of the quadratic terms had small p-values (i.e. none were significantly different from zero), then this suggests that the linear model was adequate and there is no need to consider a quadratic model. Notice the improvement in R2 for this quadratic model: we had an R2 of 68.2% with the purely linear model. This has now risen to 83.6% - quite a substantial improvement. The final model we might consider is the quadratic model without either TV sets or TV sets squared. After all, we don’t really believe more TV sets make people healthier. And the evidence is not entirely convincing that it should be there.

34

Here’s how the model output looks now:

All the variables are highly significant, so we might settle on this as our preferred model. How do we interpret the coefficients in this quadratic model? Do we even know if they have the right sign? This is nowhere near as easy to answer as in the straight linear model. The easiest way to do it is to come up with predictions of Y for a range of values of each of the X variables, then observe how varying the X’s affects our predictions. For example, the table below shows how varying Real GDP per capita can affect the model’s predictions for Y, given a School Enrolment of 60%. Real GDP p.c. 2000 4000 6000 8000 10000 12000 14000School enrolment 60 60 60 60 60 60 60Predicted Y 43.0454 29.57807 18.91118 11.04474 5.978734 3.713169 4.248044

0

5

10

15

20

25

30

35

40

45

50

2000 4000 6000 8000 10000 12000 14000

Real GDP p.c.

Pred

icte

d Y

35

Increases in GDP of $2000 reduce infant mortality by around 10-14 deaths per 1000 live births at the low-income end (up to $6000), but then improvements in infant mortality rates are not as substantial if income continues to increase. This is not surprising – for a very poor country moving to a middle-income country, we would expect rapid improvements in many social welfare outcomes. But then as that country goes to being a high-income country, there is less room for improvement, and the kinds of improvements that are needed will cost substantially more, so improvements are much slower in coming. 4.2 Categorical Explanatory Variables 4.2.1 Variables with Two Categories So far we have concentrated on X variables that are numerical. However, there are occasions when an explanatory factor is a categorical variable. For example, you may be interested in whether there is gender discrimination in income. You have data like the following:

Gender is qualitative: individuals are male or female. So how can we include characteristics like this in our model? We invent an extra variable, known as a DUMMY variable, which takes only the values 0 and 1, depending on which category each observation (person) belongs to. Your spreadsheet would look like this:

36

The column “MALE” is what we call a dummy variable. A value of 1 indicates that the person is male, and a 0 indicates the person is not male (i.e. female). We use the MALE column as X in our regression:

The main difference with using dummy variables is in the interpretation of the “slope” coefficient. What does 12628 mean for the MALE dummy variable? It means that males are estimated to earn $12,628 more than females on average. How did we make this interpretation? Recall the β1 coefficient tells us how much Y would change if X were 1 unit higher. With a dummy variable, we only have two values: 0 for female and 1 for male, so increasing X by 1 unit means changing X = 0 to X = 1. The β1 coefficient thus tells us what the difference in Y would be if the person was male rather than female. In particular, we can say “take 2 people who are identical in every way, except one is female and the other is male. The person who is male can expect to earn, on average, $12,628 more per annum than the person who is female.” So b1 can give us an idea of whether there is income discrimination purely on the basis of gender. What about the intercept? Recall β0 tells us what Y would be, on average, if X = 0. When X = 0, the person is female, so β0 tells us what Y would be, on average, for females. In this case, we could say “the model predicts that females earn, on average, $56,672 per annum”. It follows, then, that males are estimated to earn, on average, $56,672 + $12,628 = $69,300 per annum. That is, b0 = 56672 is the estimated average income of females, and b1 = 12628 is the estimated male-female difference.

37

i.e. Essentially we have 2 different intercepts: one for males and one for females. That is, we have a model with just an intercept for females:

ˆ 56,672iY = And a model with a different intercept for males:

ˆ 56,672 12,628 169,300

iY = + ×=

i.e. This model is the same as the first, except the intercept is bigger, by 12,628. It also follows that if there is no difference in income between males and females,

01 =β . N.B. If we were to regress Y on just a constant, there would be no X and 0b would

just be equal toY . In tutorials you will test this out using data in Excel, and in Topic 3 we will build on this concept.

To help us think more about this, consider a simple model with only an intercept and one X variable, a 0-1 dummy variable. Recall the formulae for the estimated coefficients for a simple regression:

0 1b Y b X= −

11 2

1

n

i ii

n

ii

X X Y Yb

X X

=

=

⎡ ⎤ ⎡ ⎤− −⎣ ⎦ ⎣ ⎦=

⎡ ⎤−⎣ ⎦

∑

∑

If X takes only a value of 0 or 1, then X is the proportion of observations that take a value of 1, e.g. the proportion of males in the sample.

Income

Females

Males

β0 β1

β0 + β1

38

With a fair bit of manipulation, it can be shown that:

0 0Xb Y ==

1 1 0X Xb Y Y= == − where 0=XY is the average value of Y for observations when X = 0 and 1=XY is the average value of Y for observations when X = 1. So, the intercept is the mean of Y for females, and 1b is the difference between the mean of Y for males and the mean of Y for females. So we can interpret 1b as the average difference in income between males and females. 4.2.2 Including Other X Variables The same concept holds when we include other X variables in the model: we just add the phrase “holding all other variables constant” to our interpretations. Here is an example. We now include an additional X-variable in our data – Years of Education. The spreadsheet would look like this:

Therefore, our Y-variable is income and our X-variables are MALE and Years of Education.

39

We now have: b0 = 23396: The model predicts that females with no education earn, on average,

$23,396 per annum. As before, it is not really plausible to speak about someone having no years of education.

b1 = 4598: On average, males earn $4,598 more per annum than females, holding

years of education constant. That is, take 2 people with the same number of years of education, but one is female and the other male, the male would be expected to earn around $4,598 more per annum than the female.

b2 = 2013: Take 2 people of the same gender, but one has 1 more year of

education than the other. The person with the higher education would be expected to earn, on average, $2,013 per annum more than the person with less education.

With a continuous X variable in our model, we are effectively estimating a model of income on education with two intercepts, depending on whether the person is male or female. That is, for females the model becomes:

2ˆ 23396 2013i iY X= +

While for males:

2

2

ˆ 23396 4598 1 201327994 2013

i i

i

Y XX

= + × +

= +

40

i.e. By including another factor in the model, we again see the importance of multiple regression over simple regression. In the simple regression case, where we just included gender, there was strong evidence of gender discrimination – the p-value for MALE was 0.001. When we take into account education, however, the p-value increases to 0.13. Clearly, once we take education into account, there is no longer any evidence of gender discrimination. What was driving the apparent gender bias was the fact that, in this particular sample, on average, females have lower education than males, and with education strongly related to earnings, the gender dummy picked up a lot of the effect of education on income. This kind of distinction is important – if we had gone with the first model, the appropriate policy would be to tackle gender discrimination in the workforce. But with the second model, we can see that policy action would be more effective in the education sector – females are earning less because they have lower levels of education in general, not because of gender discrimination of employers. The more pertinent policy issue is why females seem to have lower education levels than men. 4.2.3 Multiple Categories Now suppose we have a categorical variable with more than 2 categories. e.g. We have 3 broad mutually exclusive and exhaustive occupation types –

managerial, clerical and labour. What would we do in this case? We would create a set of dummy variables as follows:

Income Females

Males

β0 β1

β0 + β1

β2

β2

Education

41

We then include 2 of the occupational category columns along with the other X variables in our regression. Why only 2 categories? We need to leave one category out to refer to as our base category (note that in the gender example, we only included the MALE category, not MALE and FEMALE). If you try to put all 3 in, Excel will come up with an error – you’ll find out the technical reason why in second year!

Here, our “base” person is female and works as a labourer. Our interpretations then become: b0 = 8872 This is the case where MALE = 0, Education = 0, Clerical = 0 and

Managerial = 0, i.e. the individual is female, with no education and works as a labourer. The model predicts that a person with these characteristics would earn $8,872 per annum on average.

42

b1 = 5593 The coefficient on the MALE dummy tells us the difference in income between males and females, holding the other X variables constant. That is, if we were to take 2 people with the same level of education and who work in the same occupation, except that one was male and the other female, the male would be expected to earn, on average, $5,593 more than the female.

b2 = 2618 The coefficient on Years of Education tells us how education affects

income. If there were 2 people of the same gender and same occupational type, but one had 1 more year of education than the other, the person with higher education is estimated to earn $2,618 more than the person with lower education.

b3 = 2085 The coefficient on the Clerical dummy tells us the difference in income

between Clerical occupations and Labour occupations, holding the other X variables constant. That is, if there were 2 people of the same gender and same level of education, but one worked as a clerk and the other worked as a labourer, the clerk is estimated to earn $2,085 more than the labourer, on average.

b4 = 20625 The coefficient on the Managerial dummy tells us the difference in

income between Managerial occupations and Labour occupations, holding the other X variables constant. That is, if there were 2 people of the same gender and same level of education, but one worked as a manager and the other worked as a labourer, the manager is estimated to earn $20,625 more than the labourer, on average.

Income ($000)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Education (Years)

Inco

me

($00

0)

Male and Labour

Male and Clerical

Male and Managerial

Female and Labour

Female and Clerical

Female and Managerial

43

5. Putting It All Together and Exploring What Might Happen Next So far we have developed a mathematical model which allowed us to identify the factors that contribute to variations in Y. We can use that model to make predictions about what Y would be if we were to change a key factor. This is a powerful tool for all kinds of problems in business and economics. Let’s go through an example to see how a regression model can be used to shape government policy. The state and commonwealth governments have begun to include spending on mental health care in its health priorities. Critics argue, however, that, given its high prevalence and heavy disease burden, the amount allocated to mental health care falls far below what is needed in Australia. It is claimed that, amongst a number of other things, those who are mentally ill tend to earn less than those without mental illness. Using our knowledge of Topic 2, we could investigate these claims. We could model income ($’000) as a function of a bunch of characteristics like education (categories: primary (base), secondary and tertiary), age, age2 and gender (male=1/female=0), as well as a dummy variable indicating whether or not the person has been diagnosed with a mental illness. Using data from a recent health survey, we obtain the following regression output:

What do these results tell us? There are a number of factors that do not appear to affect income:

• Age does not appear to affect income in either a linear or non-linear (squared) way, once other factors are controlled for.

• Secondary education is no different from primary education in terms of

income, once gender, age and mental health status are taken into account.

44

There are also a number of factors that do seem to affect income:

• Males are estimated to earn $9,423 more per annum than females of the same age, education and mental health status. This would suggest there is gender discrimination in income.

• Individuals who received tertiary education tend to earn $15,237 more per

annum than individuals who received only primary education. This seems to be a sensible result, as we would expect more highly educated people to earn higher incomes, and the magnitude looks plausible.

• An individual with mental illness earns, on average, $17,555 less per annum

than an otherwise identical individual without mental illness. This result goes in favour of those who argue that current government spending is inadequate.

The important point to note from the output is that those with mental illness do seem to earn significantly less than those without mental illness, even after we take into account education, age, and gender. In fact, if we imagine 2 people who are identical in terms of gender, age and education, except that one has a mental illness, then the person with mental illness would be expected to earn, on average, $17,555 per annum less than the person without mental illness. So far this is nothing new – the results do suggest that people with mental illness experience poorer income outcomes than those without mental illness, and the government could do well to allocate more funds towards mental health services. But we can also use this information to do more than this. We can use this information to quantify how much mental illness is costing Australia, in terms of individual lost earnings (and government tax revenue), and therefore how much the government should pledge towards mental health care services. Health professionals estimate that around 12% of the Australian working-age population suffers from mental illness. With a working-age population of 12 million people, that would mean 1,440,000 people suffer from mental illness in Australia. $17,555 per person per annum x 1,440,000 people = $25,279,200,000 per annum We estimate that the annual “cost” of mental illness in Australia, in terms of lost income, is around $25.3 billion. The current spending on mental health care is less than one tenth of this amount, which suggests there is a need for the government to consider sizeable increases to its spending on mental health. 6. Regression When the Data is Recorded Over Time So far the examples that we have looked at involved observations at the level of individual, country, firm, household etc. That is, we’ve used a subscript “i” for our variables. Regression can also be used when the data is observed over time. For example, we could look at factors that affect the Australian inflation rate over time. In this case, we tend to use a “t” subscript:

0 1t t tY Xβ β ε= + + or;

0 1 1 2 2t t t k kt tY X X Xβ β β β ε= + + + + +L

45

Data over time is called a TIME SERIES: it is a series of observations on some variable of interest over a sequence of time periods. Most areas of business and economics would record or maintain time series data – e.g. production, unemployment, sales, inflation, interest rates, prices, inventory etc. The data would look something like this:

The spreadsheet tells us that in 1981, total revenue for this company was $1,622.8 million dollars. A helpful way of representing this data is with a simple LINE GRAPH – the data has a natural ordering, so a line graph is most appropriate.

Revenue

0.0

500.0

1000.0

1500.0

2000.0

2500.0

Year

1982

1984

1986

1988

1990

1992

1994

1996

1998

Why do we want to model a time series? Probably the most common reason is that we need to be able to forecast what is going to happen to this series in the future. For example, this series might represent revenue of some company, and knowing what is likely to happen to revenue in the future is important to being able to anticipate share price movements and hence to plan investment strategies.

46

There are many approaches to forecasting the future. One is to use one’s intuition to make an educated guess about what might happen. Sometimes this works well, but often it works badly – it all depends on the experiences and biases of the person making the forecast. Instead, a much more reliable approach is to study the past values of the series and look for patterns. If we can safely assume that the patterns will persist into the future, then we can make use of the patterns we have identified to provide predictions for the future. For example, you may find the following pattern in the past data: “sales seem to grow from one year to the next; typically they grow by about 5% per year”. This provides some guidance for future forecasts. You will forecast ongoing growth at around 5% per year. Modelling time series is about observing and defining / measuring the patterns that are present in the data. It is the building block for successful forecasting. 6.1 Components of a Time Series Data that is recorded over time has some special features or components. These features can be seen in the graph of mobile phone sales from Topic 1.

0

50

100

150

200

250

300

350

Jan-

96

May

-96

Sep

-96

Jan-

97

May

-97

Sep

-97

Jan-

98

May

-98

Sep

-98

Jan-

99

May

-99

Sep

-99

Jan-

00

May

-00

Sep

-00

Jan-

01

May

-01

Sep

-01

Jan-

02

May

-02

Sep

-02

Jan-

03

May

-03

Sep

-03

Jan-

04

May

-04

Sep

-04

Jan-

05

May

-05

Sep

-05

Jan-

06

May

-06

Sep

-06

The features to note in this graph are: Trend: The trend is a persistent, long term upward or downward pattern of

movement. The duration of a trend is usually several years. The source of such a trend might be gradual and ongoing changes in technology, population, wealth, etc.

Cycle: The cycle is a pattern of up-and-down swings that tend to repeat every 2-10

years. They have periods of contraction, leading to peaks, then contractions leading to troughs, with the cycle then repeating itself.

Seasonal: A seasonal pattern is a regular pattern of fluctuations that occur within each

year, and tend to repeat year after year. Irregular: This component represents whatever is “left over” after identifying the other

three systematic components. It represents the random, unpredictable fluctuations in data. There is no pattern to the irregular component.

47

We will now look briefly at how to model the Trend and Seasonal components of a time series. We will ignore Cycle – that’s for next year! And of course, the Irregular component can’t be modelled, by definition – there is no pattern to the irregular component. 6.2 Modelling the Trend The aim in trend analysis is to fit a simple model that captures the long-term movement in a series. It is essential to medium or long term forecasting. Consider the following example, which shows annual revenue for the Coca-Cola company over a 25-year period.

Annual Revenues at Coca-Cola (US$billion)

0

5

10

15

20

25

1975 1978 1981 1984 1987 1990 1993 1996 1999

From the graph there is a general upward trend, with revenue growing steadily over the sample period. There are a range of possible trend models that can be applied to a set of data, but in this course we will focus on only one: the linear trend model. This model can be understood as a linear relationship between the actual series (Yt) and the time sequence variable “time” (t). For the Coca-Cola example, the data and the time variable is given as follows:

48

The linear trend model is given by the equation:

tt etY ++= 10 ββ . In this model, we are assuming that there is reasonably steady growth in Y each time period. Since t represents years in this case, the model implies that Y grows, on average, by the amount β1 per period. Here’s the model for the above Coca-Cola data:

Annual Revenues at Coca-Cola (US$billion)

y = 0.7382x + 0.316R2 = 0.9349

0

5

10

15

20

25

1975 1978 1981 1984 1987 1990 1993 1996 1999

To get the equation and R2 printed with the chart, we have to choose Options in the Add Trendline dialog box, and tick the Display Equation on chart and Display R2 value on chart options, then return and select the linear trend model. To estimate the β coefficients, we choose Data Analysis under the Data tab, Regression, and choose the column of Y data for the Y range, and the column of observations on time (t) for the X range. If the data were laid out as per the adjacent sheet, the regression dialog box would look like this:

49

Here’s how the Excel output would look in this case:

Notice that the coefficients and the R2 value are the same as those from the trendline option in the chart output. How do we interpret the estimates of β0 and β1? b0 = 0.316. This is the fitted value of Y when X = 0; since X is time in this case,

and t = 1 in 1975, then t = 0 in 1974. So we say that the model predicts a trend value for revenue of $316 million in 1974.

b1 = 0.738. This is the change in Y for a 1-unit change in X. In this case, X

changes by one unit each year. So we say that the model estimates the average growth in revenue to be $738 million per year.

6.3 Modelling Seasonal Patterns Let’s consider another time series example. This time we will look at quarterly sales of Wal-Mart, a huge chain of department stores, which started in the USA, but has since spread world-wide. Here’s what a selection of the data looks like (several rows are hidden so that it doesn’t take too much space), together with a graph of the time series.

50

Global Sales of Wal-Mart (US$ million)

0

10000

20000

30000

40000

50000

60000

1992-1 1993-1 1994-1 1995-1 1996-1 1997-1 1998-1 1999-1 2000-1

From this graph, two features of this series strike us immediately: a general upward trend, and a recurring pattern of ups and downs throughout the four quarters of each year. The peak occurs in the 4th quarter of each year (October-December), and the 1st quarter (January-March) is usually the lowest. We have learned how to model the trend already. Here’s how the linear trend fits, as modelled using the Add Trendline function:


y = 908.47x + 7304R2 = 0.9162

0

10000

20000

30000

40000

50000

60000

1992-1 1993-1 1994-1 1995-1 1996-1 1997-1 1998-1 1999-1 2000-1

Now, how do we augment our model to take account of the obvious seasonal pattern in the data? To deal with seasonality, we need to add dummy variables to the trend equation, to allow the model to have a different intercept for each quarter of the year.

51

The linear model we estimate is:

ttttt eQQQtY +++++= 34231210 βββββ Q1t is a dummy variable that takes the value 1 in the first quarter of each year, and zeros for the other quarters. Likewise, Q2t = 1 in the 2nd quarter of each year, and zero for other quarters, and Q3t is the 3rd quarter dummy variable. Recall when we discussed dummy variables earlier in this topic, we interpreted them as ways of allowing for a different intercept in periods where the dummy variable equals one. The same interpretation applies here. Consider the first quarter of each year. In this case Q1 = 1, but Q2 = Q3 = 0. So the model becomes:

tt etY +++= 210 βββ So the intercept in the first quarter of each year is β0 + β2. By similar logic, the intercept in the 2nd quarter of each year is β0 + β3, and it is β0 + β4 for the 3rd quarter. In the 4th quarter, all the dummies equal zero, so the intercept is just β0. Here’s how the data sheet would look for this example, with the three dummy variables added.

Now we use Tools, Data Analysis, Regression, with column B as the Y range, and columns C, D, E, and F in the X range, and obtain the following results:

52

This is the “complete” model, incorporating a linear trend and seasonal components:

tttt QQQtY 321 6.452375.418068.5760849.88824.11283ˆ −−−+= We can interpret the coefficients as follows: b0 = 11283.24: the estimated trend value of the intercept in period t = 0 (4th quarter

1991) is $11,283.24 million. b1 = 888.849: the estimated average growth in sales is $888,849,000 per quarter. b2 = -5760.68: we estimate that sales in the 1st quarter each year are typically $5,761

million below what they would be in the 4th quarter, after adjusting for trend.

b3 = -4180.75: we estimate that sales in the 2nd quarter each year are typically $4,181


b4 = -4523.6: we estimate that sales in the 3rd quarter each year are typically $4,524


The 4th quarter of each year is the quarter for which we don’t have a dummy variable, so it is like the “benchmark” quarter. The information in these seasonal estimates can be quite useful to business planners. For example, knowing just how much higher sales are in the 4th quarter of each year can help them in planning staff levels, and ensuring adequate supply of stock, etc. Similarly, they would not want to be overstaffed in the 1st quarter of each year, as this is always much quieter than the rest of the year. But more importantly, the model can now be used to generate forecasts into the future – see the next section.

53

6.4 Using the Time Series Model to Forecast Now that we have our complete time series model, incorporating both trend and seasonal components, we can easily use it to generate forecasts for Yt into the future. All that we need to do is plug in the appropriate values for t (time) and the quarterly dummy variables into the right hand side of the models. The linear model we estimated was:

tttt QQQtY 321 6.452375.418068.5760849.88824.11283ˆ −−−+= The data series ended in the 4th quarter of 2000, with a t value of 36. So to forecast sales for the 1st quarter of 2001, we plug in t = 37, and Q1t = 1, Q2t = Q3t = 0, giving: 1st quarter of 2001:

98.38409

68.576037849.88824.1128337

=−×+=Y

Predictions for the quarters that follow could be calculated in the same way: 2nd quarter of 2001:

75.40878

75.418038849.88824.1128338

=−×+=Y

3rd quarter of 2001

75.41424

6.452339849.88824.1128339

=−×+=Y

4th quarter of 2001:

20.46837

40849.88824.1128340

=×+=Y

0

10000

20000

30000

40000

50000

60000

1992-1 1993-1 1994-1 1995-1 1996-1 1997-1 1998-1 1999-1 2000-1 2001-1


Sales

Predicted

54

6.5 Other Aspects of Modelling Time Series There are several other important issues we have not dealt with in relation to modelling time series and forecasting with the linear trend model. We will briefly highlight these here. Even if you don’t know a lot about these things, it’s good to be aware of the issues, and to make use of what you know in practical modelling situations. 6.5.1 Outliers When you graph the time series at the beginning of your analysis, look for any unusual data points – values which are especially large or small relative to the rest of the values. This might suggest a data error (check that the data point is actually correct!), or an “outlier” – a value that occurs because of a one-off event. e.g. A September 11th attack, or a big strike, or a political revolution. If we don’t recognise this value, it can have undue influence on the rest of the modelling, and produce unhelpful models. Once we identify the outlier, sometimes it’s best to just omit that data point from the analysis. Here’s an example:

Average Cost of One Night Accommodation in Hotels / Motels in NSW

$100

$110$120

$130

$140

$150$160

$170

Mar-97

Sep-97

Mar-98

Sep-98

Mar-99

Sep-99

Mar-00

Sep-00

Mar-01

Sep-01

Notice the significant extra cost of a night’s accommodation in the September quarter 2000. This is when the Sydney 2000 Olympics were taking place, and hotel / motel rates increased astronomically. This once-off event can be thought of as an outlier – the increase in average cost from around $110 per night to near $160 per night was clearly not part of an ongoing trend, nor a seasonal fluctuation. It is the result of a one-off event.

55

6.5.2 Cycle

We have learned how to model trend and seasonal patterns, but not cycle. This will come in later years! Meanwhile, we should recognise that our models will produce forecasts that ignore this component, and maybe make some allowance for this. e.g. If your model produces a set of forecasts for sales, and it’s generally believed that we are about to enter a bad economic recession, then your forecasts are likely to be too high. You might want to make some adjustment to your forecasts. 6.5.3 Standardising In Topic 1 we talked about the importance of standardising data before drawing interpretations or conclusions from it. For example, to say that sales by Wal-Mart have risen from $9 billion to $51 billion over the past 10 years sounds impressive. But remember there has been inflation in that period, so some of this growth could be attributed to rising prices, and doesn’t represent “real” growth. A better picture would be obtained by looking at “Real Sales” – sales adjusted for inflation by dividing the sales figure by the CPI. When modelling time series, sometimes it helps to analyse a time series in each of its component parts, rather than the “final” series. For example, movements in total sales of Wal-Mart come about because of growth in the number of stores, growth in real sales per store, and growth in overall price level. That is: Total Nominal Sales = Number of Stores x Average Real Sales per Store x Price Level We could do trend and seasonal models of each of these three components, and then combine them. Sometimes this gives better insight into what drives the total sales variable (e.g. is the growth mostly from opening new stores, or from growth in sales in each store, or inflation?), and possibly more accurate forecasts. 6.5.4 Other Functional Forms In this course we have only modelled a linear trend and additive seasonal dummies. For the trend, this means that we expect Y to grow at a constant amount per year. Often this is not appropriate, and other functional forms would fit the data better and make more accurate forecasts. For example, if we believe Y grows at a constant percentage rate per year, an exponential model might be more appropriate:

56

y = 10566e0.0399x

R² = 0.9476

0

10000

20000

30000

40000

50000

60000

1992-1 1993-1 1994-1 1995-1 1996-1 1997-1 1998-1 1999-1 2000-1


Or, if the rate of growth of Y is not constant, but levels off over time, a logarithmic model may be better:

y = 9783ln(x) - 1901.1R² = 0.7168

-10000

0

10000

20000

30000

40000

50000

60000

1992-1 1993-1 1994-1 1995-1 1996-1 1997-1 1998-1 1999-1 2000-1


6.5.5 Growth Rates vs. Levels So far we have only talked about modelling the level of a series – e.g. sales. Sometimes it makes more sense to model the growth rate in a series. The growth rate is more difficult to model – it fluctuates more than the underlying level of a series – but it can sometimes give more accurate forecasts of the actual level. You’ll learn more about this in later years!

ETC1000 Topic 2

Documents

Transcript of ETC1000 Topic 2