Stat_AMBA_600_Problem Set3

8
Tyler Anton 1 Spring 2014 Problem Set #3 Hypothesis Testing 1. University of Maryland University College is concerned that out of state students may be receiving lower grades than Maryland students. Two independent random samples have been selected: 165 observations from population 1 (Out of state students) and 177 from population 2 (Maryland students). The sample means obtained are X1(bar)=86 and X2(bar)=87. It is known from previous studies that the population variances are 8.1 and 7.3 respectively. Using a level of significance of .01, is there evidence that the out of state students may be receiving lower grades? Fully explain your answer. H 0 : 1 > 2 H 1 : 1 < 2 [Rejection Region in lower (left) tail] Level of Significance = 0.01 @ one-tailed test (Appendix B.5) *Critical Value (infinite df) = (-) 2.326; less than = (-) Critical Value via one-tail; Rejection Region in lower (left) tail Thus, reject H 0 if z < - 2.326 Population Variance = 1 ^2 Z = (86-87) / SQRT [(8.1/165) + (7.3/177)] Z = (-1/0.3005558965) Z = -3.327168129 Explanation The Z test statistic (-3.327) is lower than the critical value (-2.326) and the one-tail rejection region is pointing towards the left (lower tail). This implies that we reject H 0 , and accept H 1 . Thus, there is evidence that out-of-state students receive lower grades than Maryland students. Reject Ho if P-value < Level of significance (0.01) *P-value = [0.5 – 0.4990] = 0.0010; Thus, reject H 0 ; small likelihood Ho is true *0.4990 derived from Appendix 3.B; Area under the curve corresponding to 3.327 is 0.4990

Transcript of Stat_AMBA_600_Problem Set3

Page 1: Stat_AMBA_600_Problem Set3

Tyler Anton

1

Spring 2014

Problem Set #3

Hypothesis Testing 1. University of Maryland University College is concerned that out of state students may be receiving lower grades than Maryland students. Two independent random samples have been selected: 165 observations from population 1 (Out of state students) and 177 from population 2 (Maryland students). The sample means obtained are X1(bar)=86 and X2(bar)=87. It is known from previous studies that the population variances are 8.1 and 7.3 respectively. Using a level of significance of .01, is there evidence that the out of state students may be receiving lower grades? Fully explain your answer. H0: 1 > 2

H1: 1 < 2 [Rejection Region in lower (left) tail]

Level of Significance = 0.01 @ one-tailed test (Appendix B.5) *Critical Value (infinite df) = (-) 2.326; less than = (-) Critical Value via one-tail; Rejection Region in lower (left) tail Thus, reject H0 if z < - 2.326 Population Variance = 1^2 Z = (86-87) / SQRT [(8.1/165) + (7.3/177)] Z = (-1/0.3005558965) Z = -3.327168129 Explanation The Z test statistic (-3.327) is lower than the critical value (-2.326) and the one-tail rejection region is pointing towards the left (lower tail). This implies that we reject H0, and accept H1. Thus, there is evidence that out-of-state students receive lower grades than Maryland students. Reject Ho if P-value < Level of significance (0.01) *P-value = [0.5 – 0.4990] = 0.0010; Thus, reject H0; small likelihood Ho is true *0.4990 derived from Appendix 3.B; Area under the curve corresponding to 3.327 is 0.4990

Page 2: Stat_AMBA_600_Problem Set3

Tyler Anton

2

Simple Regression

2. A CEO of a large pharmaceutical company would like to determine if the company should be placing more money allotted in the budget next year for television advertising of a new drug marketed for controlling diabetes. He wonders whether there is a strong relationship between the amount of money spent on television advertising for this new drug called DIB and the number of orders received. The manufacturing process of this drug is very difficult and requires stability so the CEO would prefer to generate a stable number of orders. The cost of advertising is always an important consideration in the phase I roll-out of a new drug. Data that have been collected over the past 20 months indicate the amount of money spent of television advertising and the number of orders received. The use of linear regression is a critical tool for a manager's decision-making ability. Please carefully read the example below and try to answer the questions in terms of the problem context. The results are as follows: NOTE: If you do not have the Data Analysis option under Tools you must install it. You need to go to Tools select Add-ins and then choose the 2 data toolpak options. It should take about a minute.

Month Advertising Cost Number of Orders 1 $74,430.00 2,856,000

2 62,620 1,800,000

3 67,580 1,299,000

4 53,680 1,510,000

5 69,180 1,367,000

6 73,140 2,611,000

7 85,370 3,788,000

8 76,880 2,935,000

9 66,990 1,955,000

10 77,230 3,634,000

11 61,380 1,598,000

12 62,750 1,867,000

13 63,270 1,899,000

14 86,190 3,245,000

Page 3: Stat_AMBA_600_Problem Set3

Tyler Anton

3

15 60,030 1,934,000

16 79,210 2,761,000

17 67,770 1,625,000

18 84,530 3,778,000

19 79,760 2,979,000

20 84,640 3,814,000

a. Set up a scatter diagram and calculate the associated correlation

coefficient. Discuss how strong you think the relationship is between the amount of money spent on television advertising and the number of orders received.

Please use the Correlation procedures within Excel under Tools > Data Analysis.

Implication: The number of orders received is related to the advertising costs/budget.

Dependent Variable = [Number of Orders]

Independent Variable = [Advertising Costs]

y = 0.0097x + 47895R² = 0.776

$0

$20,000

$40,000

$60,000

$80,000

$100,000

1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000

Adve

rtis

ing

Cost

s (y

)

Orders Received (x)

Advertising Cost & Orders Received Comparison

Page 4: Stat_AMBA_600_Problem Set3

Tyler Anton

4

Correlation Coefficient (r) 0.880931435 The scatter plot and correlation coefficient (r) of 0.8809 indicates that there is a strong positive correlation. A value of (r) near 1 indicates a direct or positive linear relationship between the two variables – advertising costs and number of orders. As advertising costs increase, the number of orders received will follow. A positive correlation exists. So far, the CEO should consider increasing the advertising budget. There is a relatively direct or strong relationship between the amount of money spent on television advertising for this new drug, called DIB, and the number of orders received.

b. Assuming there is a statistically significant relationship, use the least squares method to find the regression equation to predict the advertising costs based on the number of orders received. Please use the regression procedure within Excel under Tools > Data Analysis to construct this equation. Least Squares Regression Equation: y = 0.00971950x + 47895 R2 = 0.776 c. Interpret the meaning of the slope, b1, in the regression equation.

The coefficient for the ‘Number of Orders Received’ (x) is 0.00971950. For every increase in the firm’s ‘Number of Orders Received’, there is an anticipated 0.00971950 increase in ‘Advertising Costs’ respectively - (Just under 1 cent)

B. Regression

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.880931435R Square 0.776040194Adjusted R Square0.763597982Standard Error4704.512237Observations 20

ANOVAdf SS MS F Significance F

Regression 1 1380434618 1380434618 62.3715644 2.943E-07Residual 18 398383837 22132435.39Total 19 1778818455

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%Intercept 47894.77763 3208.26531 14.92855891 1.3962E-11 41154.4623 54635.0929 41154.4623 54635.0929X Variable 1 0.00971951 0.0012307 7.897566989 2.943E-07 0.00713391 0.01230511 0.00713391 0.01230511

Note that R Squared here is the same (.776) as we got on the chart.Also the equation coefficients are identical (47895 and .00971)

Page 5: Stat_AMBA_600_Problem Set3

Tyler Anton

5

d. Predict the monthly advertising cost when the number of orders is 2,300,000. (Hint: Be very careful with assigning the dependent variable for this problem)

y = dependent variable being estimated. In part d, Advertising Costs are forecasted; hence, Advertising Costs are the dependent variable.

y = 0.00971950x + 47895

y (Advertising Costs) = 0.00971950(2300000) + 47895

Monthly Advertising Cost (When x = 2,300,000 orders): $70,250

e. Compute the coefficient of determination, r2, and interpret its meaning.

R2 = 0.776 = % of Total variation (SS Total) explained by the regression equation (SSR) 77.6% of the total variation in Advertising Costs (y) is explained by the number of orders received (x). Thus, the data is scattered around the best least squares regression line and there will be error in the predictions – actual vs. predicted (y)’s. 22.4% of the total variation in the dependent variable is error/residual (Unexplained) variation - standard deviation or dispersion of actual (y)’s from the predicted (y)’s on the linear regression line.

f. Compute the standard error of estimate, and interpret its meaning.

Sy.x = standard error for y (advertising costs – depend.) for a given value of x (number of orders).

Sy.x OR STEYX = 4704.51; or [4704.51/1000] = 4.70451 {Simplified}

The standard error of a predicted y-value for each x in the regression is 4.70451 (simplified). This implies the standard error for our forecasted monthly advertising costs is 4.70451.

The predicted dependent variable is located at an x-value corresponding to the regression line; however, an actual data point may be above or below that line.

Standard error of estimate (SEE): A measure of how inaccurate an estimate might be. It is essentially the standard deviation or dispersion of actual (y)’s from the predicted (y)’s on the linear regression line. This is a measure of how well regression line represents the scattered data. The SEE is the standard deviation of the errors (or residuals). More simply put, the difference between the actual (y) and the predicted (y) is the error or residual.

The greater the dispersion, the larger the SEE. A larger sample size could be used to reduce the SEE.

Page 6: Stat_AMBA_600_Problem Set3

Tyler Anton

6

scatter/dispersion of the observed values around the line of regression for a given value of (x)

g. Do you think that the company should use these results from the regression to base any corporate decisions on?….explain fully. Yes. SEE & r2 are the best measures to evaluate the predictive ability of the regression equation.

The scatter plot and correlation coefficient (r) of 0.8809 indicates that there is a strong positive correlation. A value of (r) near 1 indicates a direct or positive linear relationship between the two variables – advertising costs and number of orders. This (r) indicates that there is a very strong predictive model.

As for r2, 77.6% of the variation in Advertising Costs (y) is explained by the number of orders received (x). However, 22.4% of the total variation in the dependent variable is error/residual (unexplained) variation - standard deviation or dispersion of actual (y)’s from the predicted (y)’s on the linear regression line.

The standard error of a predicted y-value for each x in the regression is 4.70451 (simplified). This implies the standard error for our forecasted monthly advertising costs is 4.70451 – quite small considering the following: The correlation coefficient is large (0.8809) since the scattered points tend to be close to the linear regression line. The correlation coefficient and SEE are inversely related. Thus, as the strength of the linear relationship between the 2 variables increases, the SEE decreases.

Due to high correlation between the independent and dependent variables, there is less erratic scatter/dispersion - indicating the regression equation is sufficient and accounts for over 2/3rds of total variation. A larger sample size, however, such as 3 or 4 years of data, could be used to reduce this SEE.

This regression model can be used to predict future values with great certainty; high degree of statistical significance.

Page 7: Stat_AMBA_600_Problem Set3

Tyler Anton

7

Hypothesis Testing on Multiple Populations

3. Dr. Michaella Evans, a statistics professor at the University of Maryland University College, drives from her home to the school every weekday. She has three options to drive there. She can take the Beltway, or she can take a main highway with some traffic lights, or she can take the back road, which has no traffic lights but is a longer distance. Being as data-oriented as she is, she is interested to know if there is a difference in the time it takes to drive each route.

As an experiment she randomly selected the route on 21 different days and wrote down the time it took her for the round trip, getting to work in the morning and back home in the evening.

At the .01 significance level, can she conclude that there is a difference between the driving times using the different routes?

Time (in minutes) it took to get to work and back using:

Beltway Main highway Back road

88 79 86 94 86 78 91 75 79 88 83 96 98 74 97 84 72 73 90 68 77

You can check your critical value with the following table: http://www.statsoft.com/textbook/distribution-tables Pg 391 & 751 H0: 1=2=3 H1: The mean scores are not equal Level of Significance = 0.01 Test Statistic = F distribution df in numerator = (k-1) or 3-1 = 2 df in denominator = (n-k) or 21-3 = 18 Appendix B.6 @ 0.01 F dist = 6.013 (intersection value); Reject H0 if computed F>6.013 Reject Ho if P-value < Level of significance (0.01) Reject Ho if F > 6.0129 According to the Anova data analysis below, F<6.013 and P-value (0.071) > Level of significance (0.01). Thus, we reject H1 and conclude that there is NOT a difference between the driving times using the different routes. This P-value indicates that there is a high probability that if we rejected H0, we would have committed a type 1 error.

Page 8: Stat_AMBA_600_Problem Set3

Tyler Anton

8

Since 3.0683<6.0129 we can conclude that the null hypothesis Ho should not be rejected. There is enough evidence to conclude that there is no difference in the driving times between the three routes

Anova: Single Factor (Single Driver, not multiple like in Two-Factor W/O Replication on pg 402)

SUMMARY

Groups Count Sum Average Variance Beltway 8 710 88.75 40.21429 Main highway 6 469 78.16667 30.16667 Back road 7 577 82.42857 122.9524

ANOVA

Source of Variation SS df MS F P-value F crit Between Groups 398.9047619 2 199.4524 3.068373 0.071341785 6.012905 Within Groups 1170.047619 18 65.00265

Total 1568.952381 20