Statistics lecture 11 (chapter 11)
-
Upload
jillmitchell8778 -
Category
Education
-
view
585 -
download
3
description
Transcript of Statistics lecture 11 (chapter 11)
1
2
• Analyze the relationship among two
quantitative variables
• Correlation determines the strength and
direction between the variables
• Regression determines a mathematical
equation to explain the relation
• Equation can be used for prediction
3
• Regression Analysis – X → independent variable
– Y → dependent variable
– Independent variable influence depended variable
– Sample consists of n pairs of observations
– Ascertain if a relation exists
– Examine the nature of the relation
– Obtain an equation that relates Y to X
– The magnitude in change of one variable due to change in another variable can be evaluated
– Predict value of Y on different values of X
4
• Regression Analysis – scatter plot – Effective way to display the relationship
– X variable on horizontal axis
– Y variable on vertical axis
– Plot a dot for each pair of observations
– Can determine the • Form
– Linear or nonlinear
• Direction
– Positive or negative
• Strength
– Dots scattered close – strong relation
– Large scatter – weak relation
5
• Regression Analysis – scatter plot
– Example
– Two variables
• Cost of producing units
• Number of units produced
– Cost is depending on number of
units
Number
Units (x)
Cost per
unit (y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
Relation between units produced
and cost of production
0.00
2.00
4.00
6.00
8.00
10.00
12.00
0 30 60 90 120 150
Number of units
Co
st p
er u
nit
(R
)
From the graph it seems there is a negative relation between number of units and cost – more units then decrease in cost
6
• Simple linear regression analysis
– Which line fits the data best?
Relation between units produced
and cost of production
0.00
2.00
4.00
6.00
8.00
10.00
12.00
0 30 60 90 120 150
Number of units
Co
st p
er u
nit
(R
)
7
• Simple linear regression analysis
– Which line fits the data best?
– Method of least squares
– y = a + b x
• b → slope
• a → y intercept
– ∑ei = 0
– ∑ei2 measures size
of set of errors
– Least squares method
• Sum squares of errors the smallest
8
• Least squares regression model
– Population regression model
• Y = α + βx + ε
• ε random error
– Sample regression model
• ŷ = a + b x
• b → change in y due to change in x
• a → value of y when x = 0
9
• Least squares
regression model
– ŷ = a + b x
Number Units
(x)
Cost per unit
(y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
∑x = 470 ∑y = 47,4
∑x2 = 38300 ∑y2 = 335,54
∑xy = 2033
212
212
1
and
where,
S =
S =
S =
xy
xx
xx n
yy n
xy n
Sb a y bx
S
x x
y y
xy x y
58,75x 5,925y
Number
Units (x)
Cost per unit
(y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
∑x = ? ∑y = ?
∑x2 = ? ∑y2 = ?
∑xy = ? 10
• Least squares
regression model
ŷ = a + b x
212
212
1
and
where,
S =
S =
S =
xy
xx
xx n
yy n
xy n
Sb a y bx
S
x x
y y
xy x y
Calculate Sxx, Syy, Sxy
Number
Units (x)
Cost per unit
(y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
∑x = 470 ∑y = 47,4
∑x2 = 38300 ∑y2 = 335,54
∑xy = 2033 11
• Least squares
regression model
– ŷ = a + b x
58,75x 5,925y
1 2
8
1 2
8
1
8
S =38300 (470) 10687,5
S =335.54 (47,4) 54,695
S =2033 (470) 47,4
751
d
5
a
,7
nxy
x
x
x
x
yy
xy
Sb a y bx
S
• Least squares regression model
S =10687,5 S =54,695 S 751,75
58,75 5,925
xx yy xy
x y
5,925 ( 0,07)(58,75)
10,0375
a y bx
751,75
10687,5
0,07
xy
xx
Sb
S
→ ŷ = 10,0375 – 0,07x
Note Syy not used
here but we will
use later!!
13
• Least squares regression
model
– ŷ = a + b x
– ŷ = 10,0375 – 0,07x
x
y
b > 0
Positive linear
x
y
b < 0
Negative linear
x
y
b = 0
No relation
14
• Plot least squares regression model
– ŷ = 10,04 – 0,07x
Relation between units produced
and cost of production
0.00
2.00
4.00
6.00
8.00
10.00
12.00
0 30 60 90 120 150
Number of units
Co
st
per
un
it (
R)
If x = 30:
→ ŷ = 10,04 - 0,07(30)
=7,94
If x = 90:
→ ŷ = 10,04 - 0,07(90)
= 3,74
EXAMPLE A car manufacturing business wants to find out
how the price of its car models depreciate with
age. The business took a sample of 8 models and
collected the following information on age (yrs) and
price (R1000):-
Find the equation for the regression line with price
as dependent variable and age as independent
15
Age 8 3 6 9 2 5 6 3
Price 16 74 38 19 102 36 33 69
Example answer
Example 11.4, textbook, part 2, page 383
16
PREDICTIONS IN REGRESSION ANALYSIS
• A sample regression line usually obtained
for the purpose of prediction
• That is to estimate the value of Y
corresponding to as selected value of x
• Two ways to estimate y:-
– Point estimate
– Confidence interval
17
18
• Prediction with regression model – Point estimate using ŷ = 10,04 – 0,07x
– What will be the estimated cost if 60 units
will be produced?
– ŷ = 10,04 – 0,07(60)=R5,84
– What will be the estimated cost if 25 units
will be produced?
– ŷ = 10,075 – 0,07(25)=R8,29
ERRORS
• When regression line estimates every
observed value has a predicted value
• Predicted values will all fall exactly on
regression line
• All observed values will not fall on
regression line
• Difference between the two values is
known as an ERROR and is denoted by
ei
19
ERRORS • Since the observed values deviate from the
predicted values the regression equation is not a
perfect predictor
• Need to be able to assess the accuracy of the
regression line in predicting the values and this
is done by analysing the errors ei
• STD DEV errors measures how widely observed
values are spread around regression line
• The smaller the STD DEV the closer the points
cluster around line
20
21
• Standard deviation of random errors
– ŷ = 10,04 – 0,07x
– ei indicate how the observed and expected values differ
– Standard deviation of errors measures spread around the line
• Smaller - points closer to line
ŷ = 10,04 – 0,07(10) = 9,34 ŷ = 10,04 – 0,07(20) = 8,64
Number
Units
(x)
Cost
per
unit (y)
Predicted
cost per
unit (ŷ)
Difference ei
= yi - ŷi
10 10,00 9,34 0,66
20 8,80 8,64 0,16
30 7,90 7,94 -0,04
50 6,20 6,54 -0,34
60 5,00 5,84 -0,84
80 4,00 4,44 -0,44
100 3,50 3,04 0,46
120 2,00 1,64 0,36
22
• Standard deviation of random errors
– Small
– Values close to line
Number
Units
(x)
Cost
per
unit (y)
Predicted
cost per
unit (ŷ)
Difference ei
= yi - ŷi
10 10,00 9,34 0,66
20 8,80 8,64 0,16
30 7,90 7,94 -0,04
50 6,20 6,54 -0,34
60 5,00 5,84 -0,84
80 4,00 4,44 -0,44
100 3,50 3,04 0,46
120 2,00 1,64 0,36
2
54,695 ( 0,07)( 751,75)
8 2
0,588
yy xy
e
S bSS
n
CONFIDENCE INTERVAL FOR PREDICTION
• Different samples from the same population will
give different point estimates
• Likely that different samples from same
population will give different estimated
regression lines
• Therefore need to construct a confidence
interval for Y based on one sample that will give
a more reliable estimate of Y
• Generally called a PREDICTION INTERVAL
23
24
• Confidence interval for prediction
– Point estimate for 60 units
• ŷ = 10,04 – 0,07(60)=R5,84
– Rather calculate a confidence interval for the
mean value of y for a given x value
– Use the t-distribution
– Confidence interval for the mean of y, given x = x0
0 02
0
| 0 2 ; 11
2
02
| e
XX
1where
S
y x y xn
y x
CONF a bx t s
x xS s
n
25
• Confidence interval for prediction –
0 02
0
| 0 2 ; 11
2
02
| e
XX
2
2
1where
S
60 58,7510,588
8 10687,5
0, 2080
y x y xn
y x
CONF a bx t s
x xS s
n
26
• Confidence interval for prediction – 95% confidence interval if x = 60
– 95% sure mean cost for 60 units will be
between R5,33 an R6,35
0 02| 0 2 ; 11
8 2;1 0,025
10,04 0,07(60) 0,2080
5,84 2,447(0,2080)
5,84 0,508976
5,33 ; 6,35
y x y xnCONF a bx t s
t
27
• Inferences about β (population slope)
– b point estimate of β
– T-distribution used to make inferences about β
– Confidence interval for β
– If confidence interval includes 0 – no linear relation
– If confidence interval not includes 0 – might be a linear relation
2
2 ; 11
where
bn
eb
xx
CONF b t s
ss
s
28
• Inferences about β (population
slope)
– Confidence interval for β
2
2 ; 11
0,588where 0,00569
10687,5
bn
eb
xx
CONF b t s
ss
s
29
• Inferences about β (population slope)
– Confidence interval for β
– 95% sure population slope will be between -0,0839 and -0,0561
– Interval does not include 0
– Might be a linear relation
22 ; 11
0,07 2,447(0,00569
0,0839 ; 0,0561
bnCONF b t s
30
• Inferences about β (population slope)
– Hypothesis test concerning β
Testing H0: β = 0 for n < 30
Alternative
hypothesis
Decision rule:
Reject H0 if Test statistic
H1: β ≠ 0 |t| ≥ tn - 2;1- α/2
H1: β > 0 t ≥ tn-2;1- α
H1: β < 0 t ≤ -tn-2;1- α
with s
b
eb
xx
bt
s
s
s
31
• Solution
– H0 : β = 0
– H1 : β ≠ 0
– α = 0,05
–
– Reject H0
0,5880,00569
10687,5
0,0712,346
0,00569
eb
xx
b
ss
s
bt
s
At α = 0,05 the slope is not zero –
there is a linear relation between
number of units and cost per unit
Reject H0 Accept H0 Reject H0
-2,447 +2,447
If H1 : β > 0 - test for positive slope
If H1 : β < 0 - test for negative slope
32
• Correlation Analysis – Strength of linear relationship
– Direction of linear relationship • Positive
• Negative
– Population correlation coefficient ρ (rho)
– Sample correlation coefficient r
– r always between -1 and +1 • r = 1 perfect positive
• r = -1 perfect negative
• r = 0 no relationship
• near 0 weak relationship
• near -1 or +1 strong relationship
33
Coefficient of correlation
• The coefficient of correlation is used to measure
the strength of association between two
variables.
• The coefficient values range between -1 and 1.
– If r = -1 (negative association) or r = +1
(positive association) every point falls on the
regression line.
– If r = 0 there is no linear pattern.
• The coefficient can be used to test for linear
relationship between two variables.
34
X
Y
X
Y
X
Y
X
Y
X
Y
X
Y
Perfect positive
r = +1
High positive
r = +0,9
Low positive
r = +0,3
Perfect negative
r = -1
High negative
r = -0,8
No Correlation
r = 0
35
• Correlation coefficient r
– Strong negative
relationship
Number
Units (x)
Cost per
unit (y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
∑x = 470 ∑y = 47,4
∑x2 = 38300 ∑y2 = 335,54
∑xy = 2033
58,75x 5,925y
1 2
8
1 2
8
1
8
S =38300 (470) 10687,5
S =335.54 (47, 4) 54,695
S =2033
751,75
10687,5(
(470)
54
47, 4
,695)
0,98
751,75
xy
x
xx
yy
xy
x yy
Sr
s s
36
• Coefficient of determination
r2
– Measures proportion of
changes in the dependent
variable y that can be
explained by the
independent variable x
– % of total variation in y that
is explained by the
regression model
Number
Units (x)
Cost per
unit (y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
∑x = 470 ∑y = 47,4
∑x2 = 38300 ∑y2 = 335,54
∑xy = 2033
58,75x 5,925y 2 20,98 96,04%r
– 96% of the variation in the cost of units is explained by the variation in the number of units produced
– 4% is unexplained
37
• Hypothesis test concerning the
correlation coefficient ρ
Testing H0: ρ = 0 for n < 30
Alternative
hypothesis
Decision rule:
Reject H0 if Test statistic
H1: ρ ≠ 0 |t| ≥ tn - 2;1- α/2 21
2
rt
r
n
38
• Solution
– H0 : ρ = 0
– H1 : ρ ≠ 0
– α = 0,05
–
– Reject H0
2 2
0,9812,06
1 1 ( 0,98)
2 8 2
rt
r
n
At α = 0,05 the correlation coefficient is
not zero – there is a linear relation
between number of units and cost per unit
Reject H0 Accept H0 Reject H0
-2,447 +2,447