Part 20: Aspects of Regression 20-1/26 Statistics and Data Analysis Professor William Greene Stern...
-
Upload
lane-jorden -
Category
Documents
-
view
216 -
download
3
Transcript of Part 20: Aspects of Regression 20-1/26 Statistics and Data Analysis Professor William Greene Stern...
Part 20: Aspects of Regression 20-1/26
Statistics and Data Analysis
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
Part 20: Aspects of Regression 20-2/26
Statistics and Data Analysis
Part 20 – Aspects of Regression
Part 20: Aspects of Regression 20-3/26
Regression Models
Using the regression model to predict the value of the dependent variable.
‘Cleaning’ the data to remove what look like extreme values. Trimming – removing values with extreme ‘x’ Truncation – removing values with extreme ‘y’
Part 20: Aspects of Regression 20-4/26
Prediction
Use of the model for predictionUse “x” to predict y based on y = α+βx+ε
Sources of uncertainty Predicting “x” first Using sample estimates of α and β (and, possibly, σ) Can’t predict noise, ε Predicting outside the range of experience –
uncertainty about the reach of the regression model.
Part 20: Aspects of Regression 20-5/26
Base Case Prediction Predict y with a given value of x*: We would use the regression equation.
True y = α + βx* + ε Since α and β must be estimated, the
obvious estimate is y = a + bx We have no prediction for ε other than 0.
Sources of prediction error Can never predict ε at all The farther from the center of experience,
the greater is the uncertainty.
Part 20: Aspects of Regression 20-6/26
A Prediction Interval
22e N 2
i 1 i
Prediction includes a range of uncertainty
ˆPoint estimate: y a bx*
The range of uncertainty around the prediction:
1 (x * x)a bx* 1.96 S 1+
N (x x)
The usual 95% Due to ε Due to estimating α and β with a and b
(Remember the empirical rule, 95% of the distribution will be within two standard deviations.)
Part 20: Aspects of Regression 20-7/26
Slightly Simpler Formula for Prediction
22 2e
Prediction includes a range of uncertainty
ˆPoint estimate: y a bx*
The range of uncertainty around the prediction:
1a bx* 1.96 S 1+ (x * x) SE(b)
N
Part 20: Aspects of Regression 20-8/26
Prediction from Internet Buzz Regression
Buzz = 0.48242
Max(Buzz)= 0.79
Part 20: Aspects of Regression 20-9/26
Prediction Interval for Buzz = .8
2 2 2
2 2 2
Predict Box Office for Buzz = .8
a+bx = -14.36 + 72.72(.8) = 43.82
1 s 1 (.8 Buzz) SE(b)
N
113.3863 1 (.8 .48242) 10.94
62
13.93
Interval = 43.82 1.96(13.93)
= 16.52 to
e
71.12
Part 20: Aspects of Regression 20-10/26
Predicting Using a Loglinear Equation
Predict the log first Prediction of the log Prediction interval – (Lower to Upper)
Prediction = exp(lower) to exp(upper)
This produces very wide intervals.
Part 20: Aspects of Regression 20-11/26
Interval Estimates for the Sample of Signed Monet Paintings
ln (SurfaceArea)
ln (
US$)
7.67.47.27.06.86.66.46.26.0
18
17
16
15
14
13
12
11
10
S 1.00645R-Sq 20.0%R-Sq(adj) 19.8%
Regression95% PI
Fitted Line Plotln (US$) = 2.825 + 1.725 ln (SurfaceArea)Regression Analysis: ln (US$) versus
ln (SurfaceArea) The regression equation isln (US$) = 2.83 + 1.72 ln (SurfaceArea)Predictor Coef SE Coef T PConstant 2.825 1.285 2.20 0.029ln (SurfaceArea) 1.7246 0.1908 9.04 0.000S = 1.00645 R-Sq = 20.0% R-Sq(adj) = 19.8%
Mean of ln (SurfaceArea) = 6.72918
Part 20: Aspects of Regression 20-12/26
Prediction for An Out of Sample Monet
Claude Monet: Bridge Over a Pool of Water Lilies. 1899. Original, 36.5”x29.”
2 2 2
2 2
lnSurface ln(36.5 29) 6.96461
Prediction 2.83 1.72(6.96461) 14.809
1Uncertainty 1.96 1.00645 1 (6.96461 6.72918) (.1908)
328
1.96 1.012942(1.003049) (.23453) (.1908)
1.96(1.008984)
1.977608
Prediction Interval = 14.809 1.977608
= 12.83139 to 16.786608
Part 20: Aspects of Regression 20-13/26
Predicting y when the Model Describes log y
Predicted Price: Mean = Exp(a + bx )
= Exp(14.809 ) = $2
The inter
,700,641.
val predicts log price. What abo
78
Upper Limit
ut the
= Exp(
price?
14.809+1.9776)
= $19,513,166.53
Lower Limit = Exp(14.809-1.9776)
= $ 373,771.53
Part 20: Aspects of Regression 20-14/26
39.5 x 39.125. Prediction by our model = $17.903MPainting is in our data set. Sold for 16.81M on 5/6/04 Sold for 7.729M 2/5/01Last sale in our data set was in May 2004Record sale was 6/25/08. market peak, just before the crash.
Van Gogh: Irises
Part 20: Aspects of Regression 20-15/26
Uncertainty in Prediction
2 2 2e
1 1.96 s 1+ (x* x) (SE(b))
N
The interval is narrowest at x* = , the center of our experience. The interval widens as we move away from the center of our experience to reflect the greater uncertainty.(1) Uncertainty about the prediction of x(2) Uncertainty that the linear relationship will continue to exist as we move farther from the center.
x
Part 20: Aspects of Regression 20-16/26
http://www.nytimes.com/2006/05/16/arts/design/16oran.html
Part 20: Aspects of Regression 20-17/26
32.1” (2 feet 8 inches)
26.2” (2 feet 2.2”)
167” (13 feet 11 inches)
78.74” (6 Fe
et 7 inch
)
"Morning", Claude Monet 1920-1926, oil on canvas 200 x 425 cm, Musée de l
Orangerie, Paris France. Left panel
Part 20: Aspects of Regression 20-18/26
Predicted Price for a Huge Painting
Regression Equation: ln $ = 2.825 + 1.725 ln Surface Area
Width = 167 Inches
Height = 78.74 Inches
Area = 13,149.58 Square inches, ln = 9.484
Predicted ln Price = 2.825 + 1.725 (9.484) = 19.185
Predicted Price = exp(19.185) = $214,785,473.40
Part 20: Aspects of Regression 20-19/26
Prediction Interval for Price
22 2
e
Prediction Interval for ln Price is
1Predicted ln Price 1.96 S 1 ln Area* ln Area ( )
ln Area* = ln (167 78.74) = 9.484
ln Area = 6.72918 (computed from the data)
S = 1.00645 (from
e SE bN
22 2
regression results)
SE(b) = 0.1908
119.185 1.96 (1.00645) 1 9.484 6.72918 (.1908)
328
19.185 2.228 = [16.957 to 21.413]
Predicted Price = exp(16.957) to exp(21.413) =
$23,138,304 to $1,993,185,600
Part 20: Aspects of Regression 20-20/26
Use the Monet Model to Predict a Price for a Dali?
118” (9 feet 10 inches)
157
” (1
3 F
eet
1 in
ch)
Hallucinogenic Toreador
26
.2”
(2 f
ee
t 2
.2”) 32.1” (2 feet 8 inches)
Average Sized Monet
Part 20: Aspects of Regression 20-21/26
Part 20: Aspects of Regression 20-22/26
Forecasting Out of Sample
Income
G
2750025000225002000017500150001250010000
8
7
6
5
4
3
S 0.370241R-Sq 88.0%R-Sq(adj) 87.8%
Regression95% PI
Fitted Line PlotG = 1.928 + 0.000179 Income
Per Capita Gasoline Consumption vs. Per Capita Income, 1953-2004.
How to predict G for 2017? You would need first to predict Income for 2017.
How should we do that?
Regression Analysis: G versus Income The regression equation isG = 1.93 + 0.000179 IncomePredictor Coef SE Coef T PConstant 1.9280 0.1651 11.68 0.000Income 0.00017897 0.00000934 19.17 0.000S = 0.370241 R-Sq = 88.0% R-Sq(adj) = 87.8%
Part 20: Aspects of Regression 20-23/26
Data Trimming
ln (SurfaceArea)
ln (
US$)
9876543
18
17
16
15
14
13
12
11
10
9
S 1.10354R-Sq 33.4%R-Sq(adj) 33.2%
Fitted Line Plotln (US$) = 5.290 + 1.326 ln (SurfaceArea)
ln (SurfaceArea)
ln (
US$)
7.67.47.27.06.86.66.46.26.0
18
17
16
15
14
13
12
11
10
S 1.09636R-Sq 17.8%R-Sq(adj) 17.6%
Fitted Line Plotln (US$) = 3.068 + 1.662 ln (SurfaceArea)
All 430 Sales:
4.290 + 1.326 log area
377 Sales of area 403.4 < area < 2981.0(log > 6 and < 8)
3.068 + 1.662 log area The sample is restricted to particular values of X – area between 403 and 2981. Trimming is generally benign, but the regression should be understood to apply to the specified range of x. The trimming is based on a variable not related to the underlying noise in Y.
DataSubset Worksheet Rows that match condition.
Part 20: Aspects of Regression 20-24/26
Truncation
ln (SurfaceArea)
ln (
US$)
7.57.06.56.05.5
15.0
14.5
14.0
13.5
13.0
S 0.487426R-Sq 5.9%R-Sq(adj) 5.4%
Fitted Line Plotln (US$) = 11.44 + 0.3821 ln (SurfaceArea)
ln (SurfaceArea)
ln (
US$)
9876543
18
17
16
15
14
13
12
11
10
9
S 1.10354R-Sq 33.4%R-Sq(adj) 33.2%
Fitted Line Plotln (US$) = 5.290 + 1.326 ln (SurfaceArea)
Entire Sample: 5.290+1.326 log AreaSubsample: 500,000 < Price < 3,000,000 11.44 + 0.3821 log Area
Truncation based on the values of the dependent variable is VERY BAD. It reduces and sometimes destroys the relationship. This is one reason we resist removing “outliers” from the sample.
Part 20: Aspects of Regression 20-25/26
Where Have We Been? Sample data – describing, display Probability models
Models for random experiments Models for random processes underlying
sample data Random variables Models for covariation of random variables Linear regression model for covariation of a
pair of variables
Part 20: Aspects of Regression 20-26/26
Where Do We Go From Here? Simple linear regression
Thus far, mostly a descriptive device Use for prediction and forecasting Yet to consider: Statistical inference, testing the
relationship Multiple linear regression
More than one variable to explain the variation of Y More elaborate model building