Diploma in Statistics Introduction to Regression...Lecture 1.1 Diploma in Statistics: Introduction...
Transcript of Diploma in Statistics Introduction to Regression...Lecture 1.1 Diploma in Statistics: Introduction...
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
1 of 83
Diploma in StatisticsIntroduction to Regression
Lecturer: Prof John Haslett
Department of Statistics
Lloyd Building, Room 146
email: [email protected]
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
2 of 83
Lecture 1.11. Introduction to course
– Course outline
– Learning objectives
– References
2. Illustrative examples– Scatter plots
3. Objectives of regression
3. Introduction to SLR– Case study: Productivity in mail sorting
4. Exercises and Readings
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
3 of 83
1 Introduction to CourseCourse Outline
Introduction:
• case studies• graphical analysis, scatter plots
Review of Simple Linear Regression
• Initial analysis• Linear model• Prediction formula• Estimation and Testing• Correlation
Non-linear relationships
• the log transformation,• and others
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
4 of 83
Course Outline
Multiple regression analysis
• Initial data analysis• Least squares fit and interpretation• Diagnostic analysis of residuals• Application
Special topics
• indicator variables• correlated explanatory variables• model selection
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
5 of 83
Course Outline
Issues in interpreting regression
– "cause and effect"– control of study environment– observational studies
vscontrolled experiments
Specialisations and extensions
a list!
Statistical computing laboratories
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
7 of 83
Introduction to Course
Learning ObjectivesBe able to • use standard methods in straightfoward applications
• explain the results of applying the methods
• outline the basis for the methods
• describe and check the assumptions underlying the methods
• identify sources of further information regarding more detailed understanding and more extensive methodology
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
8 of 83
Introduction to Course
ReferencesStuart, M. Introduction to Statistical Analysis for Business and Industry,
Arnold, 2003, particularly Chapters 6, 8, ISBN 0340808446, Hamilton 519.5 N33, Lecky LEN 658.072 P32
Mullins, E., Statistics for the Quality Control Chemistry Laboratory, Royal Society of Chemistry, 2003, particularly Chapter 6, ISBN 0854046712, Hamilton, 543 P51
Hamilton, L.C., Regression with Graphics, Duxbury Press, 1992, ISBN 0534159001, Hamilton 519.5 N27
Ryan, T.P. Modern Regression Methods (2nd ed) Wiley, 2009, ISBN 9780470081860 (Santry JL- 7-204 1st ed)
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
9 of 83
2. Illustrative Examples
Example 1Ambient Temperature and Gas Consumption
Weekly household gas consumption (in 1000 cubic
feet) and the average outside temperature (in
degrees Celsius) were recorded for 56 weeks.
The house thermostat was set at 20°C throughout.
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
10 of 83
Ambient Temperature and Gas Consumption
Week Temperature Gas Week Temperature Gas
1 -0.8 7.2 27 -0.7 4.8 2 -0.7 6.9 28 0.8 4.6 3 0.4 6.4 29 1.0 4.7 4 2.5 6.0 30 1.4 4.0 5 2.9 5.8 31 1.5 4.2 6 3.2 5.8 32 1.6 4.2 7 3.6 5.6 33 2.3 4.1 8 3.9 4.7 34 2.5 4.0 9 4.2 5.8 35 2.5 3.5 10 4.3 5.2 36 3.1 3.2 11 5.4 4.9 37 3.9 3.9 12 6.0 4.9 38 4.0 3.5 13 6.0 4.3 39 4.0 3.7 14 6.0 4.4 40 4.2 3.5 15 6.2 4.5 41 4.3 3.5 16 6.3 4.6 42 4.6 3.7 17 6.9 3.7 43 4.7 3.5
18 7.0 3.9 44 4.9 3.4 19 7.4 4.2 45 4.9 3.7 20 7.5 4.0 46 4.9 4.0 21 7.5 3.9 47 5.0 3.6 22 7.6 3.5 48 5.3 3.7 23 8.0 4.0 49 6.2 2.8 24 8.5 3.6 50 7.1 3.0 25 9.1 3.1 51 7.2 2.8 26 10.2 2.6 52 7.5 2.6 53 8.0 2.7 54 8.7 2.8 55 8.8 1.3 56 9.7 1.5
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
11 of 83
Example 1
Ambient Temperature and Gas Consumption
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
13 of 83
Example 1
Ambient Temperature and Gas Consumption
Regression Analysis: Gas versus Temperature
The regression equation is
Gas = 5.49 - 0.290 Temperature
Predictor Coef SE Coef T P
Constant 5.4862 0.2357 23.28 0.000
Temperature -0.29021 0.04220 -6.88 0.000
S = 0.860608
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
14 of 83
Example 1Ambient Temperature and Gas
Consumption
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
15 of 83
3. Objectives of Simple Lin RegressionFrom Base Module
• Detailing a known quantitative relation for Y and X– Spot strength of welds and diameter
• Is there a relation?– Dietary fat and prostate cancer levels
• Reliable prediction bounds– Shelf life and potency
• Precision in instrument calibration– Peak area and Dye concentration
Also:
Simple summary
Checking to
confirm there is
no (evidence of)
a relationship
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
16 of 83
Galton’s Data
Galton’s heights data - 1078 pairs; Corr = 0.501
Y=Offspring
X=Mid-parent
Slope = 0.514
Reversion to the mediocre
Regression to the mean
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
17 of 83
Issues in Linear Regression
Details of a known relationship
• Symmetry which is X? Does it matter?
– Regression is asymmetric slopes -0.29, -1.60
– Correlation is symmetric -0.683– Note -0.29 -1.60= (-0.683)2
87654321
10
8
6
4
2
0
Gas
Te
mp
era
ture
Scatterplot of Temperature vs Gas
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
18 of 83
Issues in Linear RegressionDetailing a known quantitative relation for Y and X
Direction Important?
Galton’s heights data - 1078 pairs;
Corr = 0.501
offspring =
33.9 + 0.514 mid-parent
mid-parent =
34.1 + 0.489 offspring
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
19 of 83
Issues in Linear Regression
Details of a known relationship
• Linear relationship? Does it matter? Why?
• Are coefficients important?
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
21 of 83
Are precision details important?
Precision about what?
Coefficients
Gas = 5.49 - 0.290 Temp (- 0.290 ± ?)
offspring = 33.9 + 0.514 mid-parent ( 0.514 ± ?)
Corr( Hts) = 0.501 (0.501 ± ?)
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
22 of 83
Issues in Linear RegressionInverse use in calibration
Fit forward Regression model; use inversely
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
24 of 83
Example 1: DemonstrationUsing ‘Case’ variable
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
25 of 83
Example 1: DemonstrationUser 1 There are two different lines!!
That must be because something
happened. I bet is was.........
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
27 of 83
Example 1: DemonstrationUser 2 Of course there are. Insulation was
installed. What seems to be
interesting is that they have
different slopes.
That’s not what we were expecting!
Or is it just my
eyes..........?
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
28 of 83
Example 1: DemonstrationUser 2 Of course there are. Insulation was
installed. They don’t seem to be
linear. That’s not what we were
expecting.
Or is it just my
eyes?
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
29 of 83
Example 1: Lesson
• Omitting a key variable makes nonsense of the analysis and interpretation
• Typically Naive Researcher who
– Is not quite sure what is the ROLE of regression in answering a question OR
– Is not quite sure what data will answer a question OR
– Is not quite sure what is the interesting research question
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
30 of 83
Example 2CEO Compensation (US$) and Company Sales (US$m)
(Forbes Magazine, May 1994)
Total comp Industry Sales
28816 Financial 242
52000 ComputersComm 553
100000 Insurance 3653
102308 ComputersComm 2195
221641 Financial 238
250000 Entertainment 415
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
31 of 83
Example 2CEO Compensation and Company Sales,
logarithmic transformation
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
32 of 83
Example 2CEO Compensation and Company Sales,
logarithmic transformation
User 1: Which scale to use?
User 2: What does this tell us
about the nature of the
relationship?
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
33 of 83
Example 2CEO Compensation and Company Sales,
logarithmic transformation
User 2: What does this tell us
about the nature of the
relationship?
10 10
10 10 10
roughly
log log
log log log
where 1
roughly
Compconst
Sales
Compconst
Sales
Comp const Sales
Lcomp Lconst LSales
Lcomp a bLSales b
Regression
Log10Comp
= 5.28 + 0.26Log10Sales
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
34 of 83
Example 3Multiple regression
Relating
Respiratory Muscle Strength
to
other measures of lung function
in patients suffering from cystic fibrosis,
adjusting for sex and body size.
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
35 of 83
The variablesPEmax Maximal static expiratory pressure
a measure of expiratory muscle strength
FEV1 Forced expiratory volume in 1 second
RV Residual volume (after 1 second)
FRC Functional residual capacity
TLC Total lung capacity
Sex 0 = Male, 1 = Female
Height cms.
Weight kg.
BMP Body mass (percent of median of normal cases)
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
36 of 83
The regression equation is
PEmax = 102 + 1.36 FEV1 + 0.172 RV - 0.206 FRC + 0.275 TL
- 0.8 Sex - 0.373 Height + 2.15 Weight - 1.39 BMP
Predictor Coef SE Coef T P
Constant 101.7 172.3 0.59 0.563
FEV1 1.3626 0.9185 1.48 0.157
RV 0.1723 0.1860 0.93 0.368
FRC -0.2056 0.4410 -0.47 0.647
TLC 0.2751 0.4614 0.60 0.559
Sex -0.76 14.04 -0.05 0.958
Height -0.3731 0.8721 -0.43 0.675
Weight 2.152 1.195 1.80 0.091
BMP -1.3868 0.9135 -1.52 0.148
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
37 of 83
Multiple vs Simple Regression
Prediction
• SLR Model Y predicted linearly by X varapart from random error
Y = a+ bX + error
• MLR Y predicted by several X vars
Y = a+ b1X1 + b2X2 + ...error
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
38 of 83
Multiple vs Simple Regression
Prediction
• SLR Model Y = a+ bX + error
Y changes by b when X increases by 1
on average, or apart from error
• MLR Model Y = a+ b1X1 + b2X2 + ...error
Y changes by b1 when X1 increases by 1
on average, or apart from error
when other X variables do not change
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
39 of 83
Multiple RegressionInterpretation of coefficients
Y changes by b1 when X1 increases by 1
on average, or apart from error
when other X variables do not changeResearch Issue
adjusting for sex and body size.
PEmax = 102 +1.36FEV1 +0.172RV -0.206FRC +0.275TL
-0.8Sex -0.373Height +2.15Weight
- 1.39BMP + error
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
40 of 83
Part 3: Introduction to SLR
Case study: Productivity in mail sorting
– Initial analysis
– Linear model
– Prediction formula (least squares)
– Significance testing
– Confidence intervals
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
41 of 83
Mail Processing Hours(Fiscal Years 1962 -63)
Fiscal Year 1962 Fiscal Year 1963
Four-week
accounting
period
Pieces of mail
handled
(in millions)
Manhours
used
(in thousands)
Four-week
accounting
period
Pieces of mail
handled
(in millions)
Manhours
used
(in thousands)
1 157 572 1 154 569
2 161 570 2 157 564
3 168 645 3 164 573
4 186 645 4 188 667
5 183 645 5 191 700
6 184 671 6 180 765
7 268 1053 7 270 1070
8 180 675 8 180 637
9 175 670 9 172 650
10 193 710 10 184 655
11 184 656 11 179 665
12 179 640 12 169 599
13 164 599 13 160 605
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
42 of 83
Exercise 1
Discuss the reasonableness of "man hours used" as a measure of cost of handling mail.
Suggest and discuss alternatives, where appropriate.
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
47 of 83
Initial analysis
Line plots of Manhours and Volume
600
700
800
900
1000
Manhours
0 13 26
Four-week periods
150
175
200
225
250
275
Volume
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
48 of 83
Line plots of Manhours and Volume,Christmas excluded
550
600
650
700
750
800
Manhours
0 13 26
Four-week periods
150
160
170
180
190
200
Volume
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
49 of 83
Scatter plots of Manhours and Volume
150 175 200 225 250 275
Volume
600
700
800
900
1000
1100
Manhours
exceptions?
exception?
stable system?
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
50 of 83
Scatter plots of Manhours and Volumewith curve representing return to scale
150 175 200 225 250 275
Volume
600
700
800
900
1000
1100
Manhours
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
51 of 83
Exercise 2
On a rough graph of y against x, plot the points
(3,4), (0,3), (9,6),
and the line with equation
y = 3 + ⅓ x
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
52 of 83
Linear model
Scatter plot, reduced data set
150 160 170 180 190
Volume
550
600
650
700
Manhours
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
55 of 83
Simple linear regression modelwith Normal model for chance variation
150 160 170 180 190
Volume
550
600
650
700
Manhours
Y = α + βX + ε
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
56 of 83
The simple linear regression model• Y = α + βX + ε
Y is the Response variable
X is the Explanatory variable
ε represents chance variation
• Model parameters:
α and β are the linear parameters
hidden parameter, standard deviation σ,
measures spread of Normal curve
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
57 of 83
The estimated model
Illustration
Use the prediction formula to estimate the loss incurred through equipment breakdown in Period 6, Fiscal 1962, when Y was 765 and X was 180
40X3.350Y
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
58 of 83
The estimated model
Illustration
Use the prediction formula to estimate the loss incurred through equipment breakdown in Period 6, Fiscal 1962, when Y was 765 and X was 180
X = 180 implies Y = 50 + 3.3 180 40
= 644 40
= 604 to 684
Y = 765 considerably exceeds these limits and so is not consistent with predicted behaviour of the system
40X3.350Y
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
59 of 83
Exercise 3
Use the prediction formula to predict the extra manpower requirement during Christmas period, based on the experience of Period 7, Fiscal 1962,
when Y was 1,053 and X was 268
Compare with actual.
Comment.
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
60 of 83
Homework
Exercise 3 (continued)
Use the prediction formula to predict the extra manpower requirement during Christmas period, based on the experience of Period 7, Fiscal 1963,
when Y was 1,070 and X was 270.
Compare with actual.
Comment.
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
62 of 83
Choosing a prediction formula
Find values for and that minimise the deviations
Y1 − − X1,Y2 − − X2,Y3 − − X3,
Yn − − Xn
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
63 of 83
Trial regression lines, with "residuals"
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
64 of 83
The method of least squares
Find values for and that minimise the sum of the squared deviations:
(Y1 − − X1)2
+ (Y2 − − X2)2
+ (Y3 − − X3)2
+ (Yn − − Xn)2
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
65 of 83
"Least squares" regression line, with "residuals"
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
66 of 83
The method of least squares
Solution:
For these data,
2in
1
iin1
)XX(
)YY)(XX(ˆ
XˆYˆ
3.3ˆ
50ˆ
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
67 of 83
Interpretation
ˆ
ˆ
is the marginal change in Y for a unit change in X.
Check the measurement units!
is overheads.
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
68 of 83
"Least squares" regression line,with non-linear extensions
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
69 of 83
Using the fitted line; predictionPrediction equation:
Prediction equation allowing for chance variation:
Original model:
SD =
XˆˆY
ˆ2XˆˆY
XY
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
70 of 83
Estimating
measures spread of deviations from the true line;
s measures spread of deviations from the fitted line
Define
fitted values:
residuals:
N.B. no e-bar; n 2 instead of n (or n 1)
ii XˆˆY
iii YYe
2n
esˆ
2i
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
71 of 83
Minitab results
Regression Analysis: Manhours versus Volume
The regression equation is
Manhours = 50.4 + 3.35 Volume
23 cases used, 3 cases contain missing values
Predictor Coef SE Coef T P
Constant 50.44 59.46 0.85 0.406
Volume 3.3454 0.3401 9.84 0.000
S = 18.9300
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
72 of 83
Standard errors ofestimated regression coefficients
• Regression coefficient estimate subject to chance variation
• Normal model applies
• Standard deviation of the Normal model is the standard error of the coefficient estimate
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
73 of 83
Parameter Estimation
Confidence interval for marginal change
Recall confidence interval for
Confidence interval for :
)ˆ(SE2ˆ
n/2X
)ˆ(SE2ˆ
n/2X
)ˆ(SE2ˆ
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
74 of 83
Parameter estimation
Confidence interval for marginal change
Recall confidence interval for
or
Confidence interval for :
)ˆ(SE2ˆ
n/2X
)ˆ(SE2ˆ
n/2X
)ˆ(SE2ˆ
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
75 of 83
Exercise 4Calculate a 95% confidence interval for .
Calculate a 95% CI for change in manhours corresponding to a 10m. increase in pieces of mail handled.
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
76 of 83
Testing the statistical significance of the intercept
Formal test:
H0: = 0
Test statistic:
Critical value: 2
Calculated value: 0.848
Comparison: Z < 2
Conclusion: Accept H0
)ˆ(SE
ˆ
)ˆ(SE
0ˆZ
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
77 of 83
Testing the statistical significance of the intercept
Informal test:
is less than its standard error,
Equivalently, t, their ratio, (check it!) is less than 2.
46.59)ˆ(SE
4394.50ˆ
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
78 of 83
Administrative applications
Process monitoring:
– compare latest manhours with prediction, given latest volume
plot point on scatter plot with "band"
(a "regression control chart")
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
79 of 83
Administrative applicationsBudgeting
– based on historical data
or
– reflecting marketing efforts
or
– reflecting local knowledge
"variance" analysis
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
80 of 83
Administrative applications
Strategic, rather than operational, changes in procedures
e.g., productivity improvement,
monitored through regression control chart
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
81 of 83
HomeworkIn a study of a wholesaler's distribution costs, undertaken with a view to cost control, the volume of goods handled and the overall costs were recorded for one month in each of ten depots in a distribution network. The results are presented in the following table.
Depot 1 2 3 4 5 6 7 8 9 10
Volume 48 57 49 45 50 62 58 55 38 51 (£ thousands)
Costs 20 22 19 18 20 24 21 21 15 20 (£ hundreds)
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
82 of 83
Homework
The simple linear regression of costs (Y) on volume (X) was calculated, and resulted in the following numerical summary.
Dependent variable is:
No Selector
Costs
R squared = 93.1% R squared (adjusted) = 92.3%
s = 0.6676 with 10 - 2 = 8 degrees of freedom
Source
Regression
Residual
Sum of Squares
48.4344
3.56555
df
1
8
Mean Square
48.4344
0.445694
F-ratio
109
Variable
Constant
Volume
Coefficient
2.98160
0.331743
s.e. of Coeff
1.646
0.0318
t-ratio
1.81
10.4
prob
0.1077
Š 0.0001
M.Stuart
Lecture 1.1Diploma in Statistics: Introduction to
Regression 2010
83 of 83
Reading
SA, 1.5, 6.1, 6.3
EM, 6.1, 6.2.1, 6.2.2
M.Stuart