Download - Week11 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 11

ACTL2002/ACTL5101 Probability and Statistics

c© Katja Ignatieva

School of Risk and Actuarial StudiesAustralian School of BusinessUniversity of New South Wales

[email protected]

Week 11Probability: Week 1 Week 2 Week 3 Week 4

Estimation: Week 5 Week 6 Review

Hypothesis testing: Week 7 Week 8 Week 9

Linear regression: Week 10 Week 12

Video lectures: Week 1 VL Week 2 VL Week 3 VL Week 4 VL Week 5 VL

mailto:[email protected]


Last ten weeks

Introduction to probability;

Moments: (non)-central moments, mean, variance (standarddeviation), skewness & kurtosis;

Special univariate (parametric) distributions (discrete &continue);

Joint distributions;

Convergence; with applications LLN & CLT;

Estimators (MME, MLE, and Bayesian);

Evaluation of estimators;

Interval estimation.3201/3252


Final two weeks

Simple linear regression:- Idea;

- Estimating using LSE (& BLUE estimator & relation MLE);

- Partition of variability of the variable;

- Testing:i) Slope;

ii) Intercept;

iii) Regression line;

iv) Correlation coefficient.

Multiple linear regression:- Matrix notation;

- LSE estimates;

- Tests;

- R-squared and adjusted R-squared.3202/3252


Matrix notation

Linear Algebra and Matrix Approach

Multiple Linear regression

Matrix notationLinear Algebra and Matrix ApproachThe Model in Matrix FormLinear models

Statistical Properties of the Least Squares EstimatesStatistical Properties of the Least Squares EstimatesCI and Tests for Individual Regression ParametersCI and Tests for functions of Regression Parameters

Example: Multiple Linear RegressionExample regression outputExercise: Multiple Linear RegressionExample: Multiple Linear Regression

AppendixSimple linear regression in matrix form


Matrix notation



In general we will consider multiple regression problem:

y = β0 + β1x1 + β2x2 + . . .+ βp−1xp−1

and data points:

y1 x11 x12 . . . x1,p−1y2 x21 x22 . . . x2,p−1...

......

. . ....

yn xn1 xn2 . . . xn,p−1

3203/3252


Matrix notation


Multiple Regression: Linear Algebra and Matrix Approach

Observations yi are written in a vector y .

Regression coefficients are the vector (p by 1)β = [β0, β1, . . . , βp−1]> where > indicates transpose (β acolumn vector).

The matrix X (size n by p) is:

X =

1 x11 x12 . . . x1,p−11 x21 x22 . . . x2,p−1...

......

. . ....

1 xn1 xn2 . . . xn,p−1

Predicted values are:

y = Xβ.3204/3252


Matrix notation


Multiple Regression: Linear Algebra and Matrix Approach

Least squares problem is to select β to minimize:

S(β)

=(y − Xβ

)> (y − Xβ

).

Proof: see next slides.

Differentiate with respect to each of the β′s and the normalequations become:

X>Xβ = X>y .

If X>X is non-singular then the parameter estimates are:

β =(

X>X)−1

X>y .

The residuals are:

ε = y − y = y − Xβ.3205/3252


Matrix notation


The least squares problem is to find the vector β that minimizes:

S(β)

=n∑

i=1

ε2i =n∑

i=1

(yi − yi )2

=n∑

i=1

(yi − β0 − β1xi1 − . . .− βp−1xip−1)2

=(y − Xβ

)> (y − Xβ

).

Derivation of least squares estimator:

0 =∂

∂β

(y − Xβ

)> (y − Xβ

)=∂

∂β

(y>y − 2

(X>y

)>β + β>X>Xβ

)=− 2X>y + X>Xβ +

(X>X

)>β

=− 2X>y + 2X>Xβ

⇒ X>y =X>Xβ ⇒ β =(

X>X)−1

X>y .3206/3252


Matrix notation


The Least Squares EstimatesDifferentiating this matrix w.r.t. β and equating equal to zeroleads:

X>Xβ = X>Y ,

i.e., the normal equations. If(X>X

)−1exists, the solution is:

β =(

X>X)−1

X>Y .

The corresponding vector of fitted (or predicted) values of y is:

Y = Xβ

and the vector of residuals:

ε = Y − Y = Y − Xβ

gives the differences between the observed and fitted values.3207/3252


Matrix notation

The Model in Matrix Form







Matrix notation



Consider the regression model of the form:

y = β0 + β1x1 + . . .+ βp−1xp−1 + ε.

Fitted to data, the model becomes:

yi = β0 + β1xi1 + . . .+ βp−1xip−1 + εi , for i = 1, 2, . . . , n.

Define the vectors:

Y[n×1]

=

y1y2...yn

, β[p×1]

=

β0β1...

βp−1

, and ε[n×1]

=

ε1ε2...εn

.3208/3252


Matrix notation



Together with the matrix:

X[n×p]

=

1 x11 . . . x1,p−11 x21 . . . x2,p−1...

.... . .

...1 xn1 . . . xn,p−1

.Write the model in matrix form as follows:

Yn×1

= X[n×p]

β[p×1]

+ ε[n×1]

.

The fitted value is:

Y[n×1]

= X[n×p]

β[p×1]

.

3209/3252


Matrix notation

Linear models







Matrix notation

Linear models

Introduction

To apply linear regression properly:

Effects of the covariates (explanatory variables) must beadditive;

Homoskedastic (constant) variance (otherwise useAutoRegressive Conditional Heteroscedasticity model (ARCH)model, from Robert Engle; 2003 Nobel prize for Economics);

Errors must be independent of the explanatory variables withmean zero (weak assumptions);

Errors must be Normally distributed, and hence, symmetric(only in case of testing, i.e., strong assumptions).

3210/3252


Matrix notation

Linear models

Linear models in general

A linear model involves a response variable datum, yi , treatedas an observation on a random variable, (Yi |X = x), whereE[Yi |X = x ] ≡ µi , the εi ’s are zero mean random variablesindependent of X , and the βi ’s are model parameters, thevalues of which are unknown and need to be estimated usingdata.

The following are examples of linear models:

- Affine form: µi = β0 + xiβ1;- Polynomial (cubic) form: µi = β0 + xiβ1 + x2i β2 + x3i β3;- Affine form with interaction terms:µi = β0 + xiβ1 + ziβ2 + (xizi )β3.

For all linear forms we have: Yi = µi + εi .

3211/3252


Matrix notation

Linear models

Linear models

The first model can be re-written in matrix-vector form as:µ1µ2µ3...µn

=

1 x11 x21 x3...

...1 xn

︸︷︷︸

X

[β0β1

]= [1n X ]β.

So model has general form µ = Xβ, i.e., the expected valuevector µ is given by a model matrix (or design matrix), X,multiplied by a parameter vector, β.

All linear models can be written in this general form.

3212/3252


Matrix notation

Linear models

Linear models

The second model (the cubic) given above can be written inmatrix-vector form as:

µ1µ2µ3...µn

=

1 x1 x21 x311 x2 x22 x321 x3 x23 x33...

......

...1 xn x2n x3n

︸︷︷︸

X

β0β1β2β3

.

3213/3252


Matrix notation

Linear models

Models in which data are divided into different groups, eachof which are assumed to have a different mean, are lessobviously of the form µ = Xβ, but they can be written likethis using dummy variables.

Consider the model:

yi = βj + εi if observation i is in group j ,

and suppose there are three groups, each with two data. Thenthe model can be re-written:

y1y2y3y4y5y6

=

1 0 01 0 00 1 00 1 00 0 10 0 1

︸︷︷︸

X

β0β1β2

+ ε.

3214/3252


Matrix notation

Linear models

Marginal effects

Assume that we have the multiple regression model of theform:

y = β0 + β1x1 + . . .+ βp−1xp−1 + ε.

Assume that xk is a continuous variable so that if we increaseit by one unit while holding the values of the other variablesfixed, the value of y becomes:

ynew = β0 + β1x1 + . . .+ βk (xk+1) + . . .+ βp−1xp−1 + ε.

Since E [ε] = 0, then the marginal effect of xk is:

βk = E [ynew ]− E [y ] ,

is therefore the expected increase (or decrease) in the value ofy whenever you increase the value of xk by one unit.

3215/3252


Statistical Properties of the Least Squares Estimates










Assumptions

The residuals terms εi satisfy the following:

E [εi |X = x] = 0, for i = 1, 2, . . . , n;Var (εi |X = x) = σ2, for i = 1, 2, . . . , n;

Cov (εi , εj |X = x) = 0, for all i 6= j .

In words, the residuals have zero means, common variance,are uncorrelated with explanatory variables and areindependent of other residuals.

In matrix form, we have:

E [ε] = 0;

Cov (ε) = σ2In,

where In is a matrix of size n × n with ones on the diagonaland zeros on the off-diagonal elements.

3216/3252




Statistical Properties of the Least Squares EstimatesThe following properties of the least squares estimates can beverified:

1. The least squares estimates are unbiased: E[β]

= β.

2. The variance-covariance matrix of the least squares estimates

is: Var(β)

= σ2 ·(X>X

)−1.

3. An unbiased estimate of σ2 is:

s2 =1

n − p

(y − y

)> (y − y

).

Note that:

(n − p) · S2

σ2∼ χ2(n − p),

and β and S2 are independent.3217/3252





4. Each component βk is normally distributed with mean:

E[βk

]= βk ,

and variance:Var

(βk

)= σ2 · ckk ,

where ckk is the (k + 1)th diagonal entry of the matrix

C =(X>X

)−1(because c11 corresponds to the constant) and

covariance between βk and βl :

Cov(βk , βl

)= σ2 · ckl ,

3218/3252



CI and Tests for Individual Regression Parameters










The standard error of βk is estimated using:

se(βk

)= s√ckk .

Under the normality (strong) assumption, we have:

βk − βkse(βk

) ∼ t (n − p) .

A 100 (1− α) % confidence interval for βk is given by:

βk ± t1−α/2,n−p · se(βk

).

3219/3252




CI and Tests for Individual Regression ParametersIn testing the null hypothesis H0 : βk = βk0 for some fixed constantβk0 , we use the test statistic:

T =βk − βk0se(βk

)which under the null hypothesis, it has a t-distribution with n − pdegrees of freedom. The common test is to test the significance ofthe presence of the variable xk , in which case the test statisticsimply becomes:

T =βk

se(βk

) ,because we test H0 : βk = 0 against H1 : βk 6= 0 when we test forthe significance/importance of the variable.

3220/3252




CI and Tests for Individual Regression ParametersHowever, we can always have more general tests for the regressioncoefficients as demonstrated in the three cases below:

1. Test the null hypothesis:

H0 : βk = βk0

against the alternative:

H1 : βk 6= βk0 .

Use the decision rule (using generalized LRT, week 7):

Reject H0 if: |T | =

∣∣∣∣∣∣ βk − βk0se(βk

)∣∣∣∣∣∣ > t1−α/2,n−p.

3221/3252





2. Test the hypothesis:

H0 : βk = βk0 v.s. H1 : βk > βk0 .

Use the decision rule (using UMP, week 7):

Reject H0 if: T =βk − βk0se(βk

) > t1−α,n−p.

3. Test the hypothesis:

H0 : βk = βk0 v.s. H1 : βk < βk0 .

Use the decision rule (using UMP, week 7):

Reject H0 if: T =βk − βk0se(βk

) < −t1−α,n−p.

3222/3252



CI and Tests for functions of Regression Parameters









CI and Tests for functions of Regression ParametersLet D be a matrix (size m × p) of m linear combinations of theexplanatory variables.Then we have that:

E[Dβ]

=Dβ

Var(

Dβ)

=DVar(β)

D> = σ2D(X>X)−1D>

Under the normality (strong) assumption, we have:

D(β − β)√s2D(X>X)−1D>︸︷︷︸

=se(Dβ)

∼ t (n − p) .

A 100 (1− α) % confidence interval for Dβ is given by:

Dβ ± t1−α/2,n−p · se(

Dβ).

3223/3252




Adjusted R-SquaredThe coefficient of determination may is:

R2 =SST− SSE

SST= 1− SSE

SST.

In the simple linear regression model, the R-squared provides adescriptive measure of the success of the regressor variables inexplaining the variation in the dependent variable.

The R-squared will always increase when adding additionalregressor variables increase even if regressor variables addeddo not strongly influence the dependent variable.

An alternative is to correct it for the number of regressorvariables present. Thus, we define adjusted R-squared:

R2a = 1− SSE/ (n − p)

SST/ (n − 1)= 1− s2

MST= 1− n − 1

n − p

(1− R2

).

3224/3252




Can we test wether the regression explains anythingsignificant? E.g. can we jointly test wether[β1, . . . , βp−1]> = 0 (note: excluding β0)?

Use the F-statistic:

F =|Xβ|2/(p − 1)

|ε|2/(n − p)=

SSM/(p − 1)

SSE/(n − p)∼ Fp−1,n−p.

Under the strong assumptions |Xβ|2/σ2 ∼ χ2p−1 and

|ε|2/σ2 ∼ χ2n−p are chi-squared distributed (note: X is the

matrix X without the constant).

Interpretation: If the regression model explains a largeproportion of the variability in y , then |Xβ|2 should be largeand |ε|2 should be small.

Hence, test H0 : β = 0 v.s. H1 : at least one βk 6= 0.

Reject H0 if F > Fp−1,n−p(1− α).3225/3252




ANOVA table and sum of squares:- SST is the total variability in the absence of knowledge of the

variables X1, . . . ,Xp−1;- SSE is the total variability remaining after introducing the

effect of X1, . . . ,Xp−1;- SSM is the total variability “explained” because of knowledge

of X1, . . . ,Xp−1.

This partitioning of the variability is used in ANOVA tables:Source Sum of squares Degrees Mean F p-value

of freedom square

Regression SSM=n∑

i=1(yi − y)2 DFM=p − 1 MSM= SSM

DFMMSMMSE 1−

FDFM,DFE(F )

Error SSE=n∑

i=1(yi − yi )

2 DFE=n − p MSE= SSEDFE

Total SST=n∑

i=1(yi − y)2 DFT=n − 1 MST= SST

DFT

3226/3252


Example: Multiple Linear Regression

Example regression output









Example regression output (=summary)Error variance and standard deviation

s2: MSE=∑n

i=1 ε2i

n−p CI s2: SSEχ21−α/2(n−p)

SSEχ2α/2

(n−p)

s:√s2 CI s:

√SSE

χ21−α/2(n−p)

√SSE

χ2α/2

(n−p)

ANOVA

Source Sum of squares Degrees Mean F p-valueof freedom square

Regression SSM=n∑

i=1(yi − y)2 DFM=p − 1 MSM= SSM

DFMMSMMSE 1−

FDFM,DFE(F )

Error SSE=n∑

i=1(yi − yi )

2 DFE=n − p MSE= SSEDFE

Total SST=n∑

i=1(yi − y)2 DFT=n − 1 MST= SST

DFT3227/3252




Example regression output (cont.) (=summary)

R2: 1− SSESST R:

√R2

R2a : 1− SSE/(n−p)

SST/(n−1) Ra:√

R2a

Coefficients:

β se(β) t p-value CI(β)(X>X

)−1X>y

√Cov(β)kk

β

se(β)1− tn−p(|t|) β − t1−α/2(n − p) · se(β)

β + t1−α/2(n − p) · se(β)

Covariance matrix:

Cov(β) = s2 ·(X>X

)−13228/3252



Exercise: Multiple Linear Regression









Exercise regression

Given is the following linear regression:

Yi = β0 + β1 · x1i + β2 · x2i + εi

For our sample with 20 observations we have∑20i=1(yi − y)2 = 53.82:

(X>X)−1 =

0.19 −0.08 −0.04−0.08 0.11 −0.03−0.04 −0.03 0.05

β =

0.20.930.95

20∑i=1

ε2i = 11.67

a. Question: What is the estimate of variance of the residual?

b. Question: What is the 95% CI for β1?

c. Question: What is the 95% CI for β1 − β2?

d. Question: Are X1 and X2 jointly significant?3229/3252




Exercise regressiona. Solution: s2 =

∑20i=1 ε

2i /(n − p) = 11.67/17 = 0.69.

b. Solution: Var(β1) = s2 · c11 = 0.69 · 0.11 = 0.076⇒se(β1) =

√0.076 = 0.276.

F&T page 163: t0.975(17) = 2.110, thus 95% CI for β1 is;

(β1 − t0.975(17) · se(β1), β1 + t0.975(17) · se(β1)) = (0.35, 1.51)

c. Solution: D = [0 1 − 1]; Var(Dβ) = s2 ·D(X>X)−1 ·D> is:

Var(Dβ) =0.69 · [0 1 − 1] ·

0.19 −0.08 −0.04−0.08 0.11 −0.03−0.04 −0.03 0.05

· 0

1−1

=0.69 · [−0.04 0.14 − 0.08] ·

01−1

= 0.69 · 0.22 = 0.151.

3230/3252




Exercise regression

c. Solution (cont.): se(Dβ) =√Var(Dβ) =

√0.151 = 0.389.

F&T page 163: t0.975(17) = 2.110, thus 95% CI for β1−β2 is;

(β1 − β2 − t0.975(17) · se(Dβ), β1 − β2 + t0.975(17) · se(Dβ))

= (−0.84,0.80)

d. Solution: SST=53.82; SSE=11.67; SSM=42.14;

MSM=42.14/2=21.07; MSE=11.67/17=0.687;F=21.07/0.687=30.68.

F0.01(2, 17) = 6.112, thus X1 and X2 are jointly significanteven for α = 0.01.

3231/3252





We use a dataset consisting of salaries of football players and someregressor variables that may influence their salaries:

1. SALARY = “player’s salary”;

2. DRAFT = “the round in which player was originally drafted”;

3. YRSEXP = “the player’s experience in years”;

4. PLAYED = “the number of games played in the previous year”;

3232/3252





Regressor variables (cont.):

5. STARTED = “the number of games started in the previousyear”;

6. CITYPOP = “the population of the city in which the player isdomiciled”;

7. OFFBACK = “an indicator of player’s position in the game”(takes value 1 = offback defensive, 0 = others), i.e., it is adummy variable.

3233/3252





Summary Statistics of Variables in the Football Players Salary Data

Count Mean Median Std Dev Minimum Maximum

SALARY 169 336809 265000 255118 75000 1500000

DRAFT 169 6.473 5 4.61 1 13

YRSEXP 169 4.077 4 3.352 0 17

PLAYED 169 10.237 14 6.999 0 16

STARTED 169 5.97 1 6.859 0 16

CITYPOP 169 4980435 2421000 5098109 1176000 18120000

OFFBACK 169 0.2367 0 0.4263 0 1

3234/3252





The Correlation Matrix

SALARY DRAFT YRSEXP PLAYED STARTED CITYPOP OFFBACK

SALARY

DRAFT -0.454

YRSEXP 0.345 -0.059

PLAYED 0.212 -0.108 0.646

STARTED 0.440 -0.253 0.557 0.633

CITYPOP 0.077 −0.126 0.129 0.193 0.178OFFBACK 0.179 -0.209 -0.050 -0.043 -0.081 -0.067

3235/3252





3236/3252





3237/3252





ANOVA Table

Source Degree of Sum of Mean F-Ratio Prob(> F)freedom Squares Squares

Regression p− 1 SSM MSM=SSM/p− 1 MSM/MSE p-value

Error n− p SSE MSE=SSE/n− p

Total n− 1 SST MST=SST/n− 1

3238/3252





From this ANOVA table, we can derive several statistics that canbe used to summarise the quality of the regression model. Forexample:

- The coefficient of determination is defined by:

R2 =SSM

SST

and has the interpretation that it gives the proportion of thetotal variability that is explained by the regression equation.

3239/3252





- The adjusted coefficient of is defined by:

R2a = 1− SSE/ (n − p)

SST/ (n − 1)= 1− s2

S2y

and has the same interpretation as the R-squared, except thatthis is adjusted for the number of regressor variables.

In multiple regression, the R-squared increases as the numberof variables increases, but not necessarily so for adjustedR-squared.

It increases only if an influential variable is added.

3240/3252





- The size of a typical error, denoted by s, is the square root ofs2 and is also the square root of the error mean square:

s =√s2 =

√MSE =

√SSE

n − p.

It gives the average deviation of the actual y against thatpredicted by the regression equation.

3241/3252





- The F -ratio defined by:

F -ratio =MSM

MSE,

is the test statistic used for model adequacy.

It provides another indication of how good the model is.

Its corresponding p-value should be as small as possible.

3242/3252





Summary of the results of the regression of the players’ salariesagainst the regressor variables:

Regression Analysis

The regression equation is

SALARY = 361663 - 19139 DRAFT + 21301 YRSEXP - 7948 PLAYED

+ 12965 STARTED - 0.00070 CITYPOP + 82941 OFFBACK

Predictor Coef SE Coef T p

Constant 361663 43734 8.17 0.000

DRAFT -19139 3674 -5.21 0.000

YRSEXP 21301 6370 3.34 0.001

PLAYED -7948 3281 -2.42 0.017

STARTED 12965 3189 4.07 0.000

CITYPOP -0.000699 0.003176 -0.22 0.826

OFFBACK 82941 38241 2.17 0.032

S = 203817 R-sq = 38.5% R-sq(adj) = 36.2%

3243/3252





ANOVA Table:

Analysis of Variance

SOURCE DF SS MS F p

Regression 6 4.20463E+12 7.00772E+11 16.87 0.000

Error 162 6.72970E+12 41541379329

Total 168 1.09343E+13

3244/3252





3245/3252





3246/3252





Improving the Regression Model

Here we give you summary of the results of the improvedregression model:

Regression Analysis

The regression equation is

LOGSAL = 11.8 + 0.0733 YRSEXP - 0.00981 PLAYED + 0.0264 STARTED

+ 0.000000 CITYPOP + 0.187 OFFBACK + 0.933 1/DRAFT

Predictor Coef SE Coef T p

Constant 11.7509 0.0814 144.42 0.000

YRSEXP 0.07332 0.01471 4.98 0.000

PLAYED -0.009815 0.007607 -1.29 0.199

STARTED 0.026380 0.007596 3.47 0.001

CITYPOP 0.00000001 0.00000001 0.70 0.482

OFFBACK 0.18741 0.08691 2.16 0.033

1/DRAFT 0.9334 0.1242 7.52 0.000

S = 0.4713 R-sq = 54.6% R-sq(adj) = 52.9%

3247/3252





New ANOVA Table:

Analysis of Variance

SOURCE DF SS MS F p

Regression 6 43.3145 7.2191 32.50 0.000

Error 162 35.9891 0.2222

Total 168 79.3035

3248/3252





3249/3252





3250/3252


Appendix

Simple linear regression in matrix form







Appendix


For simple linear regression in matrix form we have:

y =

y1y2...yn

X =

1 x11 x2...

...1 xn

.Hence

X>X =

[n

∑ni=1 xi∑n

i=1 xi∑n

i=1 x2i

]and(

X>X)−1

=1

n ·∑n

i=1 x2i − (

∑ni=1 xi )

2︸︷︷︸=n·

∑ni=1(xi−x)2

[ ∑ni=1 x

2i −

∑ni=1 xi

−∑n

i=1 xi n

].

3251/3252


Appendix


Thus:

X>y =

[ ∑ni=1 xi∑n

i=1 xiyi

].

Hence

β =

[β0β1

]=(

X>X)−1 (

X>y)

=1

n ·∑n

i=1(xi − x)2

[ ∑ni=1 x

2i

∑ni=1 yi −

∑ni=1 xi

∑ni=1 xiyi

n∑n

i=1 xiyi −∑n

i=1 xi∑n

i=1 yi

]

3252/3252