MA Statistics Tutorial
-
Upload
nabil-lahham -
Category
Documents
-
view
221 -
download
0
Transcript of MA Statistics Tutorial
-
8/21/2019 MA Statistics Tutorial
1/61
STATISTICS TUTORIAL
FOR ECON MA STUDENTS
-
8/21/2019 MA Statistics Tutorial
2/61
This tutorial offers a chance for students with limited statistics background a concise
review of and introduction to fundamental topics in the MA program. It also
provides a refresher for students with more extensive statistics backgrounds.
To encourage a practical understanding, topics are presented using actual data for airtravel data and Excel screenshots of statistical results.
There is a self-test at the end of each section to help each student evaluate grasp of the
material.
No one will grade these self tests; responsibility rests with the student Students are advised to review incorrect answers and seek additional assistance in
understanding incorrect answers if needed. Students may [email protected]
with questions.
Additional, concise sources of information on the topics presented are available fromHyperstatshttp://davidmlane.com/hyperstat/
Statsoft Electronic Textbook -- http://www.statsoft.com/textbook/stathome.html
mailto:[email protected]://davidmlane.com/hyperstat/http://davidmlane.com/hyperstat/mailto:[email protected] -
8/21/2019 MA Statistics Tutorial
3/61
Section I
Descriptive Statistics and
Measures of Sampling Error
-
8/21/2019 MA Statistics Tutorial
4/61
Air Travel Data
For 21 cities, the following data have been
recorded or computed:
City = city identifying code
Fare = cheapest coach fare from Nashville to
city in $ on Orbitz on given a given day
Distance = distance in miles for the routeFare per Mile = Fare divided by distance
-
8/21/2019 MA Statistics Tutorial
5/61
Excel Screen Shot of Data
-
8/21/2019 MA Statistics Tutorial
6/61
Distribution of Fare per Mile
The histogram has a normal (bell-shaped)distribution curve superimposed
The distribution of fare per mile is similar to
the normal after smoothing out the rectangles,
but is just slightly right-tilted or skewed
This graphic was produced by the statisticalsoftware package, SPSS
-
8/21/2019 MA Statistics Tutorial
7/61
Table 1-- Descriptive Statistics for Fare per Mile
Key Univariate Descriptive Statistics
Mean= average of 28 cents per mile
Median = middle value (50thpercentile); so half ofthe values are above 28.8 cents per mile and half are
below; the median is a better measure of the center
of the data set when the data are highly skewed
Standard deviation= average distance or
variability from the mean fare for the observations;
in this case, the 21 observations differ from the
mean by an average of 9.6 cents per mile
Range= difference between the minimum and
maximum values
Skewness= degree of asymmetry; zero is perfectly
symmetric; large positive values (1.0 or larger)
indicate a leaning to the right; large negative values
indicate a leaning to the left; the value of 0.559
indicates a slight rightward skew as shown in the
graph on the prior page
-
8/21/2019 MA Statistics Tutorial
8/61
Sample Statistics v. Population Parameters
The statistics reported in Table 1 are sample statisticstheysummarize the 21 observations in the sample
The full set of all possible fares between all cities of interest would
represent the population of fares and fares per mile
Population Parameterrefers to a summary measure using all possible
data; for example, the population mean or population standarddeviation
The sample statisticsreported in Table 1 provide estimates of these
population parameters
Table 1 also provides numerical estimates of the accuracy and
reliability of the sample mean in estimating the population mean (seenext slide)
-
8/21/2019 MA Statistics Tutorial
9/61
Table 1-- Estimates of Sampling Error
Key Univariate Descriptive Statistics
Standard Error (of sample mean)= estimate ofthe likely sampling error between the sample mean
and the population mean; 0.021 implies that
repeated samples of the same size could easily find
sample means 2.1 cents higher or lower;
Confidence Levels (95%) = roughly two times the
standard error; (for 99%, it is roughly 2.5 times thestandard error); as such, it provides a figure similar
to the standard error, but with a wider margin for
error; 0.044 with a 95% Confidence level implies
that about 95 out of 100 samples of this size would
likely result in sample means within 4.4 cents of the
estimated value
-
8/21/2019 MA Statistics Tutorial
10/61
How Reliable are Sample Error Estimates?
Standard Errors and Confidence Intervals estimate sampling error Sampling Error is error arising because one is using less than the entire
population
To accurately estimate population parameters and sampling error,
samples must be representative of the population
Randomly selected samples are the best (though not foolproof way) ofassuring this
Error not related to sampling selection (question bias, response bias,
dishonest responses, data entry errors, ) must be small relative to the
size of the sampling error
This kind of error is called non-sampling error
-
8/21/2019 MA Statistics Tutorial
11/61
Using Sampling Error in Testing Claims
(Hypothesis Testing)
Estimates of sampling error permit a claims or conjectures (hypotheses)concerning population parameters to be tested with sample statistics while
taking into account a margin for error
Testing a claim for the population mean:
Suppose someone thinks that the mean fare per mile for the full population is 30
cents or higher
Given the sample mean (0.282) and the standard error of 0.21, it is quite likely that
another sample would yield an estimate of 30 cents or higher
If we double the standard error to get a 95% confidence interval and margin for
error of 0.042, we see that the claim of 30 cents or higher is quite likely
In contrast, if someone were to claim that the mean is 35 cents or higher, the
standard error and confidence interval suggests that such a figure is not very likely
-
8/21/2019 MA Statistics Tutorial
12/61
Testing Claims with P-values
Put briefly, a p-value shows thelikelihood of obtaining the sampleestimate by chance if the null hypothesiswere true
Take the claim of a mean of .30 testedhere (using SPSS software) given thesample mean of 0.282 and s.e. of 0.021
The estimated p-value (called Sig.-2tailed) is 0.425
The chance of finding such a value bychance is 42.5 percent
Typically, we reject the null only if thisp-value is below a 5 percent threshold
Note: our test is really 1-tailed since weare testing greater than 0.30. We shouldcut the p-value in half to 21.25, but this isstill well above 0.05
One-Sample Test
-.814 20 .425 -.01714 -.0611 .0268farepermilet df Sig. (2-tailed)
Mean
Difference Lower Upper
95% ConfidenceInterval of the
Difference
Test Value = .30
-
8/21/2019 MA Statistics Tutorial
13/61
Testing Claims with P-values
Now, test a mean of .35 or higher: The estimated p-value (called Sig.-2
tailed) is 0.005
The chance of finding such a value by
chance is 0.5 percent which is far below
the 5 percent threshold even before
cutting it in half for a 1-tailed test The p-value indicates that there is only a
0.5 percent chance of finding our mean of
0.282 if the true mean were 0.35 or
higher
The null hypothesis of a mean of 0.35 or
higher is rejected
One-Sample Test
-3.189 20 .005 -.06714 -.1111 -.0232farepermile
t df Sig. (2-tai led)
Mean
Di ffe re nce L owe r Up pe r
95% Confidence
Interval of the
Difference
Test Value = .35
-
8/21/2019 MA Statistics Tutorial
14/61
Sidebar on Hypothesis Testing
In the previous slide, the proposition that the
coefficient was equal to zero was tested using thep-value Any time that a p-value appears, a null hypothesis is
being tested
The proposition being examined is called the nullhypothesis
Using p-values from the output of software is thesimplest way of testing a hypothesis
With small data sets, especially with small effects being
tested, a p-value may not be below 0.05. This does not mean that the null hypothesis is true It may indicate that the test lacks Power to reject a false null
(due to lack of data); See Statsoft textbook under xxxxxxx forfurther information
-
8/21/2019 MA Statistics Tutorial
15/61
Sidebar on Hypothesis Testing
In addition to p-values, t-statistics andconfidence intervals (all derived from
standard errors) can also test a hypothesis
As a rule-of-thumb, t-values greater than 2 in
absolute value are equivalent to p-values below
0.05
-
8/21/2019 MA Statistics Tutorial
16/61
Self Test Section I
The self test uses a data set on 5K running times;the raw data appears on the next slide; variablesare
Time = 5k time in minutes (decimals are fractions
of minutes) Age = age in years
Intervals = 1 if hard interval workouts were usedand 0 if not;
Miles Per Week = number of miles per week intraining at peak of training
-
8/21/2019 MA Statistics Tutorial
17/61
-
8/21/2019 MA Statistics Tutorial
18/61
Self-Test for Section I1. The measure that provides the middle or 50thpercentile observation is
a. 19.30
b. 19.50
c. 0.800
d. 19.00
2. The statistic that indicates how spread out the individual 5k times are from theaverage time is
a. 3.250
b. 0.160
c. 0.800
d. 0.192
3. Based on the data, you can say that the times are
a. Nearly symmetric
b. Highly skewed to the right
c. Highly skewed to the left
d. Not enough information
-
8/21/2019 MA Statistics Tutorial
19/61
Self-Test for Section I4. The likely sampling error for the mean is The measure that provides the middle or
50thpercentile observation is
a. 0.160
b. 0.192
c. 0.800
d. 3.250
5. The 95% confidence interval for the mean is computed by
a. Multiplying the standard error by about 2.0
b. Multiplying the standard deviation by 95%
c. Dividing the range by about 10
d. Dividing the mean by the sample size
6. The value for Age for the second observation is
a. 42
b. 21
c. 22
d. 44
-
8/21/2019 MA Statistics Tutorial
20/61
Self-Test for Section I7. In the output, a test of the mean is provided. The null hypothesis being tested is
a. That the population mean equals 19.3
b. That the population mean equals -4.373
c. That the population mean equals -0.700
d. That the population mean equals 20
8. The results in the table provide a 2-tailed test. To compute a 1-tailed test, youwould
a. Double the p-value
b. Divide the t-statistic by two
c. Divide the p-value by two
d. Double the size of the confidence interval
9. Which of the following indicates that the null hypothesis should be rejected?
a. t = -4.373
b. p-value (Sig. 2-tailed) = 0.000
c. Both a and b
d. Neither a or b One-Sample Test
-4.373 24 .000 -.70000 -1.0304 -.3696Timet df Sig. (2-tailed)
Mean
Difference Lower Upper
95% Confidence
Interval of the
Difference
Test Value = 20
-
8/21/2019 MA Statistics Tutorial
21/61
Correct Answers to Self-Test Section I
1. A2. C
3. A (the skewness statistic is very small, 0.192, indicating only a
slight amount of positive skew; 0 would be perfectly symmetric;
above or below 1.0/-1.0 would indicate substantial asymmetry)
4. A
5. A
6. C (go back to the original data sheet for this)
7. D (this is indicated by the Test Value = 20 in the SPSS output)
8. C (the test provided is 2-tailed because it tests whether the meanequals 20 or not; 1-tailed would test whether it was 20 or more)
9. C (the p-value is less than the typical 0.05 threshold for rejecting the
null hypothesis; the t-values absolute value is greater than 2.0)
-
8/21/2019 MA Statistics Tutorial
22/61
Section II
Regression Analysis
-
8/21/2019 MA Statistics Tutorial
23/61
Relationships Between Variables
In economics investigators are frequently interested in how one variable interacts with
another; Example: sales and income
Often, one of the variables causes changes in the other such as higher incomes causingmore sales.
The causal variable is referred to as the X, Independent, or Explanatory Variable
The responding variable is referred to as the Y or Dependent Variable
Sometimes the relationship is not causal but merely one of association because of links
to a third variable
Example: SAT & ACT cores, which are both caused by academic ability and achievement
The most frequently used statistical technique used to examine relationships between
variables is Regression Analysisor some technique that is very similar to regression
analysis.
Regression analysis can be used for all kinds data and relationships including
Linear relationships and Curved relationships
Quantitative data and Qualitative data
Cross-sectional and time series data
The following slides present the simplest form of Regression Analysis
A quantitative dependent variable (Air Fare) and one quantitative, independent
variable (Distance)
The relationship is treated as linear
-
8/21/2019 MA Statistics Tutorial
24/61
Scatterplot for Fare & Distance
The Scatterplot presented in Figure 1
depicts the 21 Fare (Y-axis) and
Distance (X-axis) combinations in the
data set
The graph shows that as distance
increases, fare also tends to increase,
but that the relationship is not perfect;
otherwise, it would lay on a straight line
Fig. 2 -- Scatterplot of Fare (Y) and Distance (X)
0
50
100
150
200
250
300
350
400
0 500 1000 1500 2000 2500
Distance
Fare
-
8/21/2019 MA Statistics Tutorial
25/61
Regression from a Visual Standpoint
Figure 3. Scatterplot and Regression
Plot for Fare-Distance
0
100
200
300
400
0 500 1000 1500 2000 2500Distance
Fare
Figure 3 adds another element to the
plota straight line of points (a line
connecting the pink points)
These points represent the regressionline that Excel chose as the straight line
that best fit the scatterplot points
Software chooses the line to minimize
the sum of the (squared) distances
between the blue points and the pinklinethis method is called the Least
Squares or Ordinary Least Squares
(OLS) method and is widely used
-
8/21/2019 MA Statistics Tutorial
26/61
Fare-Distance Regression as Tabular Output
-
8/21/2019 MA Statistics Tutorial
27/61
Regression in Table Form
Table R1 presents the same regression results Figure 3
The Regression Statistics and ANOVA parts of the table
evaluate the overall performance of the regression in predicting
Fare to different cities The bottom part with Coefficients for Intercept and Distance
presents the regression line as numbers that can be put into an
equation along with estimates of sampling error
The following slides breakdown the different parts of the table
-
8/21/2019 MA Statistics Tutorial
28/61
Regression output always implies an equation written generally as
y = b0 + b1*X
b0 = y-intercept
b1 = slope (change in Y over change in X)
b0 and b1 are referred to as regression coefficients or intercept coefficient
and slope coefficient
The pink line in Figure 3 can be written down as an equation
Recall, the slope-intercept form of a line (y=mx+b) from basic algebraif you
draw a line through the pink points in Figure 3, and extend it to where Distance
(X) = 0, the intercept should be obvious
The equation for this line is
Fare = 157 + 0.084 * Distance + Error(Intercept) (Slope)
Coefficients
Intercept 157.614
Distance 0.084
-
8/21/2019 MA Statistics Tutorial
29/61
Slope & Intercept Meaning
The slope indicates that for every 1 mile Distance, the Fare is
increasing by 0.084 (or about 8 cents). The slope produced in regression analysis always shows the amount of
increase in Y (or decrease if negative) for a 1 unit increase in X
To correctly interpret the slope for a regression, it is critical to know the
units in which X and Y are measured; here, the units are miles and dollars
A 100 mile increase implies an $8.40 (100 x 0.084) increase in Fare
The y-intercept indicates that if distance were 0, the fare would be 157
The intercept in this case is not an economically meaningful number
because there are no flights of 0 miles
The intercept merely extends the line to the X-axis for statistical purposes Be aware of the relevant range (min, max) of the X-variable
-
8/21/2019 MA Statistics Tutorial
30/61
Regression Line Errors (Residuals)
Using the regression equation, Y-values for
given X-values can be calculated Predicted Y= intercept + slope*(X-value)
Example: Observation 1 is Dallaswith a
distance of 600 miles:
Predicted Fare = 157.6 + 0.084*(600) = 208
(Excels prediction is 208.310werounded)
The regression Error (residual)=
Actual Y valuePredicted Yvalue
For Dallas (observation 1), the actual farewas $250, so we calculate
Residual = 250208.310 = 41.690
Each observation has a predicted fare and
error associated with it
-
8/21/2019 MA Statistics Tutorial
31/61
R Squarereports the percent of the Y-variable explained by the
X-variable
In other words, expresses (as a percent) how close the regression
line points come to predicting the actual scatterplot points
The maximum R-square is 1.0 (100%) and the minimum is 0.
In this case, Distance, by itself, can account for 48.6% of the Fare
differences between cities
In a 2-variable regression like this one, the Multiple R is the same
thing as the Correlation Coefficient between X and Y.
The R-square is the squared correlation coefficient in such cases. Its maximum is 1.0 or -1.0 (perfectly correlated) and 0 is the min
It can take on positive or negative values depending on the direction
of the relationship between the two variables
Multiple R 0.697
R Square 0.486
Adjusted R Square 0.459
Standard Error 43.294
Observations 21.000
-
8/21/2019 MA Statistics Tutorial
32/61
Regression Coefficient Accuracy
Just like the sample mean, the regression coefficients are sample statistics
that are usually used to estimate what the true relationship would be if all
possible data were used
Regression coefficients, therefore, also have standard errorsthat estimate
their sampling error
The slope coefficient for distance (0.08) has a standard error of 0.02
This implies that the population parameter (regression coefficient using
all possible data) may easily be 2 cents higher or lower than the 0.08
coefficient estimated by this sample For a wider (apx. 95%) margin for error, this standard error can be
multiplied by about 2.0
-
8/21/2019 MA Statistics Tutorial
33/61
More on Regression Coefficient Accuracy
The t-statandp-valueare also ways of assessing the reliability of the
coefficient
They test whether the coefficient is significantly different from zero
As a rule of thumb, if the t-statistic is > 2.0 (< - 2.0), this is viewed as
significantly different from zero
The t-Stat on Distance is 4.239, so it is statistically significant
The p-value estimates the likelihood of finding the coefficient of 0.084
by mere chance if the true value were zero
The p-value of 0.000 indicates that this would be very unlikely, also
showing a statistically significant result
In scientific research, p-values below 5 percent (0.05) are taken as
statistically significant
In other settings, the cutoff level for the p-value may vary
-
8/21/2019 MA Statistics Tutorial
34/61
Expanded Regression Analysis
In most situations in economics, investigators look at the effects of multiple
variables on a dependent variables when using regression analysis
Example: price and income effects on sales
Such regressions are sometimes called multiple regression analysis and
involve only slight modifications of the earlier points
Also, economists widely use qualitative variables as independent variables.
When these take on only two values (male, female) they are usually coded as
(1,0) and called binary or dummy variables
In the Air Travel data, we have such a variable, Direct SWA, that indicateswhether Southwest Airlines flies this route directly (1) or not (0). This
variable is added to the regression analysis, resulting in the following Excel
output:
-
8/21/2019 MA Statistics Tutorial
35/61
Fare Regression with Distance and Direct SWA
-
8/21/2019 MA Statistics Tutorial
36/61
The regression equation is now
Fare = 193 + 0.08*Distance66*Direct SWA + Residual
The slope coefficient for Distance is still about 0.08
The y-intercept coefficient was 157; It is now 193
The Direct SWA variable has these effects:
When SWA = 0 (when SWA does not fly that route), the regression equation is
Fare = 193 + 0.081*Distance ; because -66*(0) = 0
When SWA =1 (when SWA flies the route), the regression equation is
Fare = 193 + 0.081* Distance66*(1) = 127 + 0.081*Distance
Note that the SWA dummy variable only influences the y-intercept
The SWA variable does not influence the slope for distance (see next slide)
Coefficients Standard Error t Stat P-value
Intercept 193.032 14.411 13.395 0.000
Distance 0.081 0.012 6.698 0.000
Direct SWA -66.779 11.446 -5.834 0.000
-
8/21/2019 MA Statistics Tutorial
37/61
Distance Line Fit Plot
050
100
150
200
250300
350
400
0 500 1000 1500 2000 2500
Distance
Fare
The line connecting the upper pink dots
shows the regression line when
SWA= 0
The line connecting the lower pink dotsshows the regression line when SWA=1
The Fare-Distance slope for both lines
is 0.08
Table R2 Regression with Multiple X-
-
8/21/2019 MA Statistics Tutorial
38/61
Another important difference that results from adding the SWA variable is the
increase in the R-Square value
It is now 82.2 (it was about 48% when using only Distance)
The combination of Distance and Direct SWA account for 82.2% of the differences
in Fares across cities.
Adding SWA increased this value by about 36%
Table R2. Regression with Multiple X-
Regression Statistics
Multiple R 0.907
R Square 0.822
Adjusted R Square 0.802
Standard Error 26.161Observations 21.000
-
8/21/2019 MA Statistics Tutorial
39/61
From the regression predictions and errors, Excel (and other software) compute an Analysis ofVariance or ANOVA
The F-Statisticis the most important number here; itcomputes the ratio of the mean regression sum ofsquares by the mean residual sum of squares
Unlike the R-Square value, the F-statistic adjusts for the number of variables used
The Significance F is simply a p-valuetesting the null hypothesis that the F-statistic equals zero;With this data, this null hypothesis is rejected because the p-value is very low
In effect, the F-statistic tests whether the X-variables, as a group matter in explaining the Y-variable
The SS above refers to Sum of Squares.
The Residual SS simply squares the individual errors and adds them up. MS refers to mean sum ofsquares which divides the SS by the number of observations (minus the number of variables in theregression).
The Predicted sum of squares computes differences in the actual and predicted values for Fare andthen adds them up
The Total sum of squares adds the Predicted and Residual together
The R-Square is simply the regression sum of squares divided by the total
The Adjusted R-squared, like the F-statistic, adjusts for the number of variables used
ANOVA
df SS MS F Significance F
Regression 2.000 56977.083 28488.541 41.627 0.000
Residual 18.000 12318.727 684.374
Total 20.000 69295.810
-
8/21/2019 MA Statistics Tutorial
40/61
Regression Pointers
Regressions that are well done have residuals that have no obvious
patterns and are roughly bell shaped; Checking the residuals for theseand other characteristics is called Residual Analysis
Regressions that leave out key explanatory (X) variables can yieldmisleading slopesthis is called the Omitted Variables Bias;
Regressions leaving out key variables should be viewed as exploratoryor preliminary in nature
There is no magical R-squared value to be obtained; if a model is puttogether well, then a low R-squared is fine; if a model has key flaws init, then a high R-Squared value does not make it good
Only humans can determine if a regression is causal (Income-Sales) ormerely associative (SAT-ACT); the software treats both cases the same
-
8/21/2019 MA Statistics Tutorial
41/61
Self Test Section II
The self test again uses a data set on 5Krunning times shown on the next slide
Time = 5k time in minutes (decimals are
fractions of minutes) Age = age in years
Intervals = 1 if hard interval workouts wereused and 0 if not;
Miles Per Week = number of miles perweek in training at peak of training
-
8/21/2019 MA Statistics Tutorial
42/61
F Th Q ti R f t thi O t t
-
8/21/2019 MA Statistics Tutorial
43/61
For These Questions, Refer to this Output
-
8/21/2019 MA Statistics Tutorial
44/61
1. The regression equation depicted by the table is
a. 5k Time = 0.731 + Age + Intervals + Residual
b. 5k Time = 17.554 + Age*Intervals + Residual
c. 5k Time = 17.554+ 0.071*(-0863)*Age*Intervals + Residuald. 5k Time = 17.554 + 0.071*Age0.863*Intervals + Residual
2. The percent of 5k time differences accounted for by Age andIntervals in the regression model is
a. 0.731b. 17.554
c. 12.660
d. 0.535
3. The slope coefficient for Age isa. 0.071
b. 0.731
c. 17.554
d. 0.016
4 Th lik l li i h l ffi i f A i
-
8/21/2019 MA Statistics Tutorial
45/61
4. The likely sampling error in the slope coefficient for Age is
a. 0.071
b. 0.731
c. 17.554
d. 0.016
5. The slope coefficient for Age implies that
a. For each 1 minute increase in Time, Age increases by 0.071 years
b. For each 1 year increase in Age, Time increases by 1 minute
c. For each 1 year increase in Age, Time increases by 0.071 minutes
d. For each 1 year increase in Time, Age increases by about 53%
6. The regression results imply that if Age were 0, then Time would bea. 0.731
b. 12.660
c. 24.000
d. 17.554
7 The value in the preceding question
-
8/21/2019 MA Statistics Tutorial
46/61
7. The value in the preceding question
a. Means that a newborn baby would be predicted to run this time in a 5k
b. Means that the value is really only a hypothetical extension of theregression line because none of the actual data go back to zero years of Age
c. Means that the regression is not reliable at any values
d. Means that babies should compete in the Olympics
8. The coefficient for Intervals implies that
a. When interval equals 1, the Age slope is reduced by 0.863
b. When interval equals 0, the y-intercept value is reduced by 0.863 minutes
c. When interval equals 1, the Age slope is the same but the entire regressionline shifts down by 0.863 minutes
d. When interval equals 0, the Age slope is the same but the entire regressionline shifts down by 0.863 minutes
9. If you wanted to compute the effects of 10 more years of Age on thepredicted 5k Time, you should multiply
a. 0.10 x 0.071
b. 10 x 0.071
c. 10 x 1.0
d. 100 x 0.071
10 Th di t d l f 5k Ti h i 47 d i i t l i
-
8/21/2019 MA Statistics Tutorial
47/61
10. The predicted value for 5k Time when a person is 47 and using intervals in
training would be found by which of the following equations?
a. Predicted 5k Time = 17.554 + 0.071*(47)
b. Predicted 5k Time = 0.731 + 0.072*470.863*((1)
c. Predicted 5k Time = 17.554 + 0.071*(47)0.863*(1)d. Predicted 5k Time = 0.071*(47) - 0.863*(1)
11. Using the data sheet provided earlier, compute the residual for the first
observation. (Note: you will first have to compute the predicted time)
a. -0.545b. 0.631
c. -.034
d. 1.232
12. The data provided on the accuracy of the coefficients indicates thata. All are not significantly different from zero
b. Age is significantly different from zero but not Intervals
c. Intervals is significantly different from zero but not Age
d. All are significantly different from zero
-
8/21/2019 MA Statistics Tutorial
48/61
Correct Answers Section II Self Test
1. D
2. D
3. A
4. D
5. C
6. D7. B
8. C
9. B (the slope for a 1 unit (year) change in time is 0.071, a 10 year change is
simply 10 x slope)
10. D11. A (Predicted Time = 17.554+0.071*(21)0.86*(0) = 19.045;
Residual = ActualPredicted = 18.5019.045)
12. D (All of the p-values for the coefficients are below the 0.05 threshold for
significance; All of the t-statistics are above 2.0 in absolute valuethe
rule-of-thumb value for significance
-
8/21/2019 MA Statistics Tutorial
49/61
Section III
Statistical Software
-
8/21/2019 MA Statistics Tutorial
50/61
Overview
Personal computers and software make it possible for almost anyone to
complete complicated or lengthy computations needed for statisticsknowing what to do with them is the hard part
Excel contains many useful statistical and graphing capabilities; theseare introduced in the next few slides
Software dedicated to statistical operations vastly expands the breadthof procedures possible as well as doing some much easier than inExcel. Some commonly used statistical software includes SAS (www.sas.com); the company offers many varieties; JMP is a point-click
product; SAS is available in some places at WKU
SPSS (www.spss.com); This software is available in most computer labs oncampus; it is not as widely used by economists as SAS but contains most of thesame features, especially for basic purposes
Stata (www.stata.com) is widely used by economists and contains broad and verypowerful tool; Eviews (www.eviews.com) is also very powerful and especiallyuseful for time series and forecasting applications; both provide point-clickfunctionality
http://www.sas.com/http://www.spss.com/http://www.stata.com/http://www.eviews.com/http://www.eviews.com/http://www.stata.com/http://www.spss.com/http://www.sas.com/ -
8/21/2019 MA Statistics Tutorial
51/61
Excel Stat Introduction 1
Making Application While there is no self-test with this section, you are strongly encouraged
to practice on Excel; even if you use other software in later classes, the
practice in Excel will be helpful
One of the main differences in Excel and spreadsheets in statistical
software is that Excel is address driven (each cell has an address),
whereas the stat software is variable drivenonce a column of data
exists for a variable, the entire column can be manipulated simply by
referring to the name
-
8/21/2019 MA Statistics Tutorial
52/61
Excel Stat Introduction 2
Click the Tools menu in Excel; if Data Analysis appears as anoption you may skip to the next slide; if not then
Select the Add-Ins option under the Tools menu
Check the box for Analysis Tool Pak
The Data Analysis option should now appear under the Tools menu
(Note: If you opened Excel from your desktop, the procedures above
should work; if you happened to open Excel by opening an Excel-
based spreadsheet while browsing on the internet, it may not work)
-
8/21/2019 MA Statistics Tutorial
53/61
Excel Stat Introduction 3
Take one of the data sheets, Air Travel or 5k Times, used in thistutorial and enter the data into Excel. The instruction here proceed
using the Air Travel data.
To compute descriptive statistics for a variable
Select the Tools menu Select the Data Analysis option
Select the Descriptive Statistics option
Click on the icon next to the blank for Input Range
Highlight the column for Fare including the label
Check the Labels in the First Row box Check the Summary Statistics box
Check the Confidence Interval for the Mean box
Check the OK button
-
8/21/2019 MA Statistics Tutorial
54/61
Excel Stat Introduction 4
You should now have an output table on a new sheet One disadvantage of Excel is that statistical output table like this one tend
to be collapsed or condensed and need to be formatted
Formatting the output table (this is something you should always do in
Excel) Highlight the columns with the table
Select the Format menu
Select the Column and AutoFit Selection options
Again, select the Format menu
Select the Cells options In the Number menu, choose the Number option
Pick a number for the Decimal Places box (the number of decimal
places depends somewhat on the data3 will be fine here)
Make sure to do this step in Excel; tables with a lot of insignificant decimal
places are very messy to read
-
8/21/2019 MA Statistics Tutorial
55/61
Excel Stat Introduction 5
Return to the original data sheet Create a regression analysis:
Select the Tools menu and the Data Analysis option
Select the Regression Analysis option in the window
Select the icon next to the Input Y Range blank and highlight the data
containing Fare including the label Select the icon next to the Input X Range and highlight the data
containing Distance and Direct SWA including the labels
(Note: if you try to highlight the whole columns you may get an error)
Check the Labels box, the Residuals box, and the Line Fit box
Select the OK button and reformat the output tables as before You will also need to reformat the Line Fit plot (another small hassle in
Excel); just expand it using the mouse
-
8/21/2019 MA Statistics Tutorial
56/61
Excel Stat Introduction 6
Return to the original data sheet Charts in Excel
Excel can also be used to create scatterplots, histograms, and other types
of plots
This is an area where statistical software is much easier to use
If you want to tinker some, click on the Chart Wizard icon that shouldappear below the top level menus
The icon has the appearance of a bar chart
Also, under the Data menu, there is a Pivot Table and Pivot Chart
option that provides further capabilities
If you would like a hands-on introduction to other statistical software,
please contact Brian Goff [email protected]. Also, other several
other economics professors can provide assistance in becoming
acquainted with software.
mailto:[email protected]:[email protected] -
8/21/2019 MA Statistics Tutorial
57/61
Probability Distributions
A final topic briefly introduced here is that of probability distributions(PD)
A PD is a formula (often presented as a graphic or table) that links
values of a variable with the probability of those values
PDs are used in many ways; for statistics, one of the key uses is to
assess hypotheses including the use of t-statistics and p-values
Statistical software makes an extensive knowledge of PDs not
necessary because the relevant information about the PD is stored bythe computer and used as needed; however, a few basic points are
worthwhile even for basic statistics users
-
8/21/2019 MA Statistics Tutorial
58/61
Probability Distributions 2 PDs have a center, dispersion, and symmetry or skew (asymmetry)
measures of location of center include mean & median
measures of dispersion include the standard deviation and range
PDs have tails (the ends), measured by the amount of kurtosis
Normal (Probability) Distribution
Most widely known due to its bell-shape
Many real life situations are approximately (though not perfectly) distributedNormal
The mother of PDs in that many other distributions are related to it or convergeto it with large samples or other conditions
t-Distribution
Also bell-shaped
Is wider in its tails than the normal but converges to it with large samples
Binomial Distributiondeals with 2 outcome situations
F-Distribution, Chi-Square Distributioncommonly used distribution whenthe topic is variability
Excel permits PDs to be used directly if desired
-
8/21/2019 MA Statistics Tutorial
59/61
Excel permits PDs to be used directly if desired
Click on the function icon (the script f) just below the top menus
Select Statistical in the window and scroll to the desired distribution such as
NORMDIST for normal
We can now produce probabilities for a variable assumed to be normal or near
normal
Example: Lets assume that male height is apx. Normal with a mean of 70 inches
and a standard deviation of 2 inches, what is the probability of finding someone
taller than 74? In the NORMDIST window, plug in 74 for X, 72 for Mean, and 2 for
Standard Deviation
In the Cumulative box, put True.
Excel will produce a number that is the probability of being 74 or less (that is, the
cumulative probability)
This number is 0.977
The probability of being taller than 74 is 1-.977 = .023 or 2.3%
The same or similar procedures can be used for 2 outcome (binomial) problems
or many others and opens up a wide array of uses
Clockwise from Left Corner:
-
8/21/2019 MA Statistics Tutorial
60/61
Clockwise from Left Corner:
Normal, t-, F-, and Chi-Square Distributions
-
8/21/2019 MA Statistics Tutorial
61/61
A gallery of PDs and more background is
offered at the Engineering Statistics
Handbookhttp://www.itl.nist.gov/div898/handbook/eda/sect
ion3/eda366.htm