Complete Notes

download Complete Notes

If you can't read please download the document

Transcript of Complete Notes

STATISTICS FOR BUSINESS ISTAT 371 Course Notes

SPRING 2011

JOCK [email protected]

Statistics 371 R.J. MacKay, University of Waterloo 2009

Index

Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Appendix 1 Appendix 2 Appendix 3 Statistical Tables Solutions to Exercises Old Midterms and Exams

The Need for Statistics in Business Models linking explanatory and response variates Making Inferences from Regression Models The Analysis of Variance Assessing Model Fit Model Building Sample Survey Issues Probability Sampling Ratio and Regression Estimation with SRS Stratified Random Sampling R Properties of vectors and matrices of random variables Gaussian Quantile-Quantile Plots

Please email me with any errors or points of clarification. These notes are a work in progress.

Data Sets You can download all data sets in the notes and exercises from the file stat371.zip on the Angel course web page. You can access the individual files at the same site.

Statistics 371 R.J. MacKay, University of Waterloo 2009

Chapter 1 The Need for Statistics in Business There is no substitute for knowledge W. Edwards Deming The greatest obstacle to discovery is not ignorance it is the illusion of knowledge Daniel Boorstin The purpose of Stat 371 and 372 is to provide a unified set of strategies and tools to apply Statistical Method in business and industry. In particular, the goal is to learn how to: pose clear questions collect the right data efficiently and effectively a good plan provide useful conclusions communicate the conclusions and the method by which they are reached to a nontechnical audience

Statistics, or better Statistical Method, is a powerful, widely applicable process that we can use to learn about business processes and markets (populations). Statistical Method is empirical, that is, based on observational and experimental investigations. By collecting and analyzing the right data, we can increase our knowledge of the market, the products and services we produce (and plan to produce) and the processes we use in this production. We may then use this knowledge to make better decisions to improve the business. Example 1 The maker of frost-free refrigerators in temperate New Zealand decided to expand their market to tropical south-east Asia. There were immediately numerous complaints about frost build-up in the fridges from the new market. The company interviewed 25 recent purchasers in each of the two markets and found that there were large differences in ambient environmental conditions (temperature and humidity) and usage (frequency of door openings, amount of food introduced at one time) in the two markets (investigation 1). They were convinced that these factors were the cause of the frost build-up in the tropical market. To solve the problem, they decided to try to redesign the fridge to make it more robust to ambient environmental conditions and usage factors. In a experimental investigation, they built 8 prototype fridges in which four design inputs were changed simultaneously. They then tested each prototype under two conditions defined by the extremes of the environmental and usage factors. The response variate was the temperature of the cooling plate in the fridge after 30 minutes operation low constant values mean that there will be no frost build-up. The experimental plan and data are

Statistics 371 R.J.MacKay, University of Waterloo, 2009

I-1

Treatment 1 2 3 4 5 6 7 8

D1 new new new new original original original original

D2 new new original original new new original original

D3 new original new original new original new original

D4 new original original new original new new original

Environmental Conditions Normal Extreme 0.7 2.1 2.9 4.8 2.4 9.6 3.8 5.9 1.9 4.0 -0.2 0.1 -0.1 3.5 0.2 7.2

Looking at the data, we can see that there are several promising designs (e.g. treatment 6). After further analysis and a review of the costs, the company adopted the combination in treatment 6 as the new design. The complaints about frost buildup disappeared. Example 2 Municipal taxes in Ontario are based the market value of the property. Where possible, the market or assessed value is determined by predicting the market value of the property using the prices from recent sales of comparable properties. A property owner may choose to appeal the assessed value. A large company felt that the assessed value of its very large property (an automobile assembly plant) was too high. To argue their case, they collected data on 38 large plants that had been sold in the last 10 years throughout Canada and the USA. The first few records aresize (sq ft /10^6) 0.848 1.813 1.297 1.747 age(years) 35 37 50 23 percent office 5.8 3.2 19.0 10.2 build/land ratio 26.6 17.3 45.1 13.3 location usa usa usa usa value $/sq ft 4.32 6.74 6.36 5.95

The idea was to predict the value of the building in question using a model constructed from the data and the known values of the explanatory variates size, age etc. Here the prediction was a failure as there were many problems with the data and how it was collected. We use PPDAC (Problem, Plan, Data, Analysis, Conclusion) to describe Statistical Method, the process we use to learn empirically. The purpose of each stage is: Problem: Plan: Data: Analysis: Develop clear questions about attributes of the population/process of interest Develop a plan to answer the questions posed Execute the Plan to collect the required data Analyze the data based on the Plan and a model to address the question

Statistics 371 R.J.MacKay, University of Waterloo, 2009

I-2

Conclusion:

Answer the questions and report uncertainties and limitations

The following should remind you of the language of PPDAC and how we apply the process.

Target Population

Study Population

Sample

Measured variate values

Conclusions

(Model-based) analysis

PPDAC is a process that we use to plan and execute empirical investigations so that we get reliable conclusions at a reasonable cost. There must be a good reason to undertake the investigation in the first place and resolve to take action and make decisions based on the Conclusions. Governments are famous for avoiding decisions by saying that another study is required. The two course are organized by the nature of the Plan and the models used in the analysis. In Stat 371 we concentrate on applications of regression models and sample surveys. In Stat 372, we look at issues of data collected over time (time series, control charting) and the use of experimental plans. Exercises 1. (A true story, believe it or not) To improve the shifting of the transmission, an automobile manufacturer organizes a clinic in which about 100 people evaluate the feel of 6 transmissions on different models from low to (very) high cost. Each person is asked to rate each transmission on several dimensions. The idea is to use the data to help design a new transmission that will have good feel to improve the perceived quality of the vehicle and hence improve sales/market share. To save money in organizing the clinic, the company uses the engineers at its development

Statistics 371 R.J.MacKay, University of Waterloo, 2009

I-3

center of which 90% are males under the age of 35. What changes to this plan would you recommend? Why? 2. Write a brief description of the 6 Sigma program. Where does Statistical Method fit in 6 Sigma? What advantages and disadvantages can you see in an organization adopting such a program? 3. What is a software usability trial? What are two key issues in the design of such a trial? How does Statistical Method fit into a usability trial? 4. Give two examples of the how you might use of Statistical Method in market research.

Statistics 371 R.J.MacKay, University of Waterloo, 2009

I-4

Chapter 2 Models linking explanatory and response variates In this chapter, we look at regression models and how to fit them to a set of data. Suppose we have a set of n units selected from a population and, for each unit i, we have the values of the response variate yi and p explanatory variates xi1 , xi 2 ,..., xip . The statistical problem is to fit a regression model of the form yi = 0 + 1 xi1 +...+ p xip + ri where the parameters 0 , 1 ,..., p and the residuals ri , i = 1,..., n are unknown. There are many applications of such models. We give three here. Example 1 The CAPM model is used to measure the risk of a single asset relative to that of a portfolio. For example, suppose we want to assess the relative risk of an IBM share relative to the S&P 500 index. The theoretical CAPM model describes the excess return (actual return risk free return) for an IBM share over a period of time as a constant times the excess return of the portfolio. If we model the excess IBM return as a random variable Y and the portfolio excess return as a random variable X , then we have Y = X and stdev(Y ) = . That is, the parameter measures the relative volatility of the IBM excess stdev( X ) return. The common interpretation is that > 1 corresponds to an asset riskier than the portfolio. In many empirical applications, percentage returns are collected for the asset yi and the portfolio xi over a number of periods (e.g. days, months) and a linear model of the form yi = 0 + 1 xi + ri is fit to the data. The risk-free return is not included in the model. The month over month returns from Jan 2001 to March 2003 for IBM and the S&P 500 are given in the file IBM.txt. The variate names are sp.ret and ibm.ret. We see a scatterplot of the data on the next page, created with the R code plot(sp.ret,ibm.ret,xlab='SP500 return',ylab='IBM return',main='IBM vs. S&P 500 Monthly Returns') From the plot, the model should provide a reasonable fit to the data. The purpose of this modeling is to estimate an attribute of the population of monthly returns. There are many issues about the time period (months) and the sampling period. Note we can fit a model that includes the risk-free return if the data were available. Since this return is small, measured on a monthly basis, the fit will not change markedly. Example 2 In Chapter 1, we introduced the problem of determining the market value of a property that is not sold, using known explanatory variates and the market values of similar Stat 371 R.J. MacKay, University of Waterloo 2009 II-1

properties that were actual sales. The data are in the file assessment.txt. There are 38 units (large sales)with 5 explanatory variates size, age, office, ratio and location and the response variate value ($ per square ft). In this example, we first fit a regression model using the data from the actual sales. We then use the model to predict the market value of the property that was not sold. The values for the explanatory variates for the unsold building are size=13,825, age=21, percent office =3.8, building/land ratio=53, location= 0 (Canada). There are issues about which properties to include in the data set and which explanatory variates to include in the model . In Ontario, there is a private organization that makes extensive use of regression modeling to provide market values to municipalities for all properties that provide the basis for property taxes. There are many applications of regression where the object is to predict the unknown response variate for a given set of values of the explanatory variates. Example 3 A service organization has 24 offices. In the planning of an audit, the accountant looks at the stated overhead from the current and past year for each office. He also has access to the office size and age, the number of employees and clients, and the relative cost of living in the city where the office is located. The data are in the file analytic.txt. The auditor plans to fit a model relating overhead to the explanatory variates in order to look for outliers, offices for which the relationship is different. He will devote more audit resources to any such office. This is an example of an analytic method in auditing. Another similar application is to look at salaries of employees relative to the work they do. The goal is to look for exceptional cases for which the relationship between the explanatory variate and the response is very different. Fitting the Model Least Squares. By "fitting the model", we mean that we estimate the unknown model parameters using the data. To do so, we represent the data model in terms of vectors and matrices. Let y be a n 1 column vector containing the response variate values, x j a column vector containing the values of the jth ( j = 1,..., p ) explanatory variate and r a column vector of the unknown residuals. Also let 1 = (1,..., 1)t be a column vector of n 1s and X = (1, x1 ,..., x p ) an n (1 + p) matrix with columns corresponding to the explanatory variates. Finally, let = ( 0 , 1 ,..., p )t be a (1 + p) 1 column vector of the unknown coefficients. Then we can write the model in terms of these vectors as

Stat 371 R.J. MacKay, University of Waterloo 2009

II-2

y = 0 1 + 1 x1 +...+ p x p + r = (1, x1 ,..., x p ) + r

or more compactly as

y = X + r

We have written y as the sum of two vectors. We can picture the model in R n as shown below. y

r

0 1+ 1 x1 +...+ p x p

span(1, x1 ,..., x p )

The span(1, x1 ,..., x p ) is the subspace of R n spanned by the columns of X . We assume that this subspace has dimension p+1 or equivalently that the columns of X are linearly independent. To fit the model, we use least squares. That is, we find the value for that minimizes the function W ( ) = ( yi 0 1 xi1 ... p xi p ) 2i

=|| y X ||2 =|| r||2 To minimize the squared length of r , we project y orthogonally onto span(1, x1 ,..., x p ) as shown below. y

r

0 1 + 1 x1 +...+ p x pspan(1, x1 ,..., x p )Stat 371 R.J. MacKay, University of Waterloo 2009 II-3

The estimated residual vector r = y X is orthogonal to the span(1, x1 ,..., x p ) or equivalently, to every column of X. That is we have 1t r = 0, x1t r = 0,..., x tp r = 0 We can write these equations more compactly as X t r = 0 . Substituting for r , we get X t ( y X ) = 0 , and after rearrangement, = ( X t X ) 1 X t y . Note that X t X has an inverse because we assume that X has full rank (i.e. p+1 linearly independent columns). We label the projection (called the vector of fitted values) as , so = 0 1 + 1 x1 +...+ p x p

= X = X ( X t X ) 1 X t y = Hy and the estimated residual by r = y X = (I H)y where the matrix H = X ( X t X ) 1 X t is called the hat-matrix and is the projection onto the subspace span(1, x1 ,..., x p ) . H has several interesting properties - see the exercises. Note that we have decomposed the vector y = Hy + ( I H ) y into two orthogonal components.Example We use R to fit the empirical CAPM model to the IBM returns vs S&P 500 returns. The following code produces the given output.

a|t|) (Intercept) 20.34243 3.00833 6.762 6.75e-08 *** age -0.45498 0.09971 -4.563 5.66e-05 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 2 The residual sum of error: 6.084 on 36 degrees of .freedom 36 degrees of freedom. Residual standard squares is 36(6.084) = 1332 542 with Multiple R-Squared: 0.3664, Adjusted R-squared: 0.3488 F-statistic: 20.82 on 1 and 36 DF, p-value: 5.665e-05 Stat 371 R.J. MacKay University of Waterloo 2009 IV-3

Step 3: The second estimate of 2 , assuming the hypothesis is true, is1332.542 1149.314 = 45.81 5 1 45.81 Step 4: The discrepancy measure is f = = 1.275 35.92

Step 5: Using the R function 1-pf(1.275, 4,32) or the Tables in the Appendix, we see that Pr( F4,32 1.275) = 0.30 . Since the p-value is so large, there is no evidence against the hypothesis. In other words, there is no evidence that the model is improved by adding all of the other explanatory variates, once age is included in the model. Notes 1. F distribution and F tables Mathematically, an Fnum, den random variable with num and den degrees of freedom is defined as K2 2 / num F = num ( = num ) 2 K den 2 / den den where the numerator and denominator are independent. You may have wondered at Step 2 why we could not have used the estimate of 2 produced from fitting the reduced model directly, rather than the estimate in Step 3 based on the change in the residual sum of squares. The reason is that by subtraction, we get independent estimators which are required for the F distribution. An F random variable is always positive and has mean close to 1. There are tables (in the same format as the t tables) in the Appendix. For each tail probability, there is one page of tables with a column for the numerator degrees of freedom and a row for the denominator degrees of freedom. 2. Calculations with R We can use R to perform all of the calculations. We fit both the full model and the reduced model and then apply the anova() function. For Example 1, the code is b