Regression Analysis Project

9
Life Expectancy by Country Regression Analysis Project Michael Wallace & Brandon Berube 12/12/2013 For our project, my partner, Brandon Berube and I decided to use regression analysis to see what type of variables affect how long a person will live due to the country that they inhibit. We were able to utilize various variable screening methods in order to run an effective model that allowed us to not only make inferences about how long people of a country will live, but also show how accurate our model is.

Transcript of Regression Analysis Project

Page 1: Regression Analysis Project

Life Expectancy by Country

Regression Analysis Project

Michael Wallace & Brandon Berube

12/12/2013

For our project, my partner, Brandon Berube and I decided to use regression analysis to see what type of variables affect how long a person will live due to the country that they inhibit. We were able to utilize various variable screening methods in order to run an effective model that allowed us to not only make inferences about how long people of a country will live, but also show how accurate our model is.

Page 2: Regression Analysis Project

There were 20 Randomly Selected Countries.

- We randomly selected the countries by giving each country a number from 1-213 and using Random.org to pick 20 random numbers which relate to the corresponding countries.

1. Iraq2. Mongolia3. Lesotho4. Canada5. Mauritius6. Oman7. Samoa8. Mali9. Bangladesh10. Suriname11. Tonga12. Qatar13. Bulgaria14. Micronesia, Fed. Sts.15. Spain16. Trinidad And Tobago17. Papua New Guinea18. Tanzania19. Austria20. Sao Tome & Principe

- We picked variables that we assume would affect life expectancy the most.

Response Variable (Y): Life Expectancy at Birth, Male and Female

Life expectancy at birth indicates the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life.

http://data.worldbank.org/indicator/SP.DYN.LE00.IN

X1 : Access to improved sanitation facilities (measured as % of total population)

percentage of the population using improved sanitation facilities which are flush/pour facilities

http://data.worldbank.org/indicator/SH.STA.ACSN.UR/countries/1W? display=default

X2 : Health expenditure per capita – (measured in current US dollars)

Page 3: Regression Analysis Project

Sum of public and private health expenditures as a ratio of total population http://data.worldbank.org/indicator/SH.XPD.PCAP

X3 Access to improved water source (measured as % of total population)

Percentage of the population using an improved drinking water source. Improved water includes piped water, protected dug wells, and protected springs.

http://data.worldbank.org/indicator/SH.H2O.SAFE.ZS

X4 Food Production Index

Food production index covers food crops that are considered edible and that contain nutrients.

http://data.worldbank.org/indicator/AG.PRD.FOOD.XD

X5 : Air Quality (CO2 emissions (kt))

Carbon Dioxide emissions are burning of fossil fuels and the manufacture of cement.

http://data.worldbank.org/indicator/EN.ATM.CO2E.KT/countries 1 If greater or equal to 62938.95 0 if less than 62938.95

- NOTE*: These variables are not listed in order of importance; because intuition does not result in appropriate mathematical procedure, until we utilize stepwise regression, we are unable to determine which variable are useful.

- We started to insert our data for our variables. We ran into an issue with some countries not having the most recent information, so in order to keep a sample size of 20, we disregarded the countries that did not have complete data and picked replacement countries randomly.

1. In order to project a complete regression analysis experience, we are disregarding information without complete data sets.

- To clarify our variables, we write them as:

X1 = Sanitation Score X2 = healthcare

X3 = Water Quality X4 = Food Production Index X5 = Air Quality

Page 4: Regression Analysis Project

1 If greater or equal to 62938.95 0 if less than 62938.95

Model One: Linear First Order

We are using a basic model, just to see how we land as a starting point.

Ho: β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 = 0 (Model Isn’t Useful)

Ha: β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 ≠ 0 (Model Has Utility)

The fitted model is:

Y= β0 + β1X1 + β2X2 + β3X3 + β4X4

Where

Y = Life Expectancy β0 = Intercept β1 = Coefficient for sanitation scoreβ2 = Coefficient for healthcare production indexβ3 = Coefficient for access to improved water qualityβ4 = Coefficient for food production index

The regression equation isLife Expectancy = 36.2 + 0.233 Sanitation Score + 0.00139 Healthcare + 0.0632 Water Quality + 0.0763 Food

Predictor Coef SE Coef T PConstant 36.16 11.79 3.07 0.008Sanitation Score 0.23250 0.05535 4.20 0.001Healthcare 0.0013937 0.0006336 2.20 0.044Water Quality 0.06319 0.08017 0.79 0.443Food 0.07626 0.07634 1.00 0.334

S = 4.02304 R-Sq = 83.3% R-Sq(adj) = 78.9%

Analysis of Variance

Source DF SS MS F PRegression 4 1212.43 303.11 18.73 0.000Residual Error 15 242.77 16.18Total 19 1455.20

Important observations:

1. Our overall model’s p-value is .000 2. Our adjusted R2a is 78.9%3. We reject our H0, which states that our model is useful in predicting life expectancy of

Page 5: Regression Analysis Project

Model Two: Higher Order Model with Qualitative variable

Here we believe that we can see some possible trends when we plot points on a scatter plot, thus we decided on incorporating higher order terms.

Ho: β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X3X4 + β6X5 + β7X32 + β8X2

2 = 0 (Model Isn’t Useful)

Ha: β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X3X4+ β6X5 + β7X32 + β8X2

2 ≠ 0 (Model Has Utility)

The fitted model is:

Y= β0 + β1X1 + β2X2 + β3X3 + β4X4

Where

Y = Life Expectancy β0 = Intercept β1 = Coefficient for Sanitation ScoreΒ2 = Coefficient for Healthcare Production IndexΒ3 = Coefficient for Access to Improved Water QualityΒ4 = Coefficient for Food Production IndexΒ5 = Coefficient for Interaction of Food Production Index and Water Quality Β6 = Coefficient for Air QualityΒ7 = Coefficient for Water Quality SquaredΒ8 = Coefficient for Healthcare Squared

Note* we are utilizing Air Quality as a qualitative variable.

We have calculated the average CO2 emissions of the sample, and decided that it would be a good qualitative variable. The average number was 62938.95, thus any country with a number greater than or equal to the average will get represented by a ‘1’, and any number with a value lower than 62938.95 will be with a 0.

- 1 if X ≥ 62938.95- 0 if X < 62938.95

The regression equation isLife Expectancy = 67.7 + 0.185 Sanitation Score + 0.00080 Healthcare - 0.927 Water Quality + 0.078 Food + 0.00045 Food*Water + 1.28 Air Quality + 0.00686 Waterquality ^2 + 0.000000 Healthcare ^2

Page 6: Regression Analysis Project

Predictor Coef SE Coef T PConstant 67.70 53.76 1.26 0.234Sanitation Score 0.18515 0.06870 2.69 0.021Healthcare 0.000800 0.004100 0.20 0.849Water Quality -0.9275 0.8215 -1.13 0.283Food 0.0784 0.4459 0.18 0.864Food*Water 0.000445 0.005229 0.09 0.934Air Quality 1.279 4.078 0.31 0.760Waterquality ^2 0.006860 0.004546 1.51 0.160Healthcare ^2 0.00000000 0.00000063 0.00 0.998

S = 4.24554 R-Sq = 86.4% R-Sq(adj) = 76.5%

Analysis of Variance

Source DF SS MS F PRegression 8 1256.93 157.12 8.72 0.001Residual Error 11 198.27 18.02Total 19 1455.20

Important observations:

1. Our overall model’s p-value is .001, which is an increase from the other model.2. Our adjusted R2a is 76.5%, which is lower than our initial first order linear model. 3. We reject our H0, which states that our model is useful in predicting life expectancy

Model Three: Reduced Model with Qualitative variable

We are using a nested (reduced) model, to try to be more straightforward, due to the fact that our R2a has dropped.

Ho: β0 + β1X1 + β2X2 + β3X3 + β4X4 β5 X3X4 + β6X5 = 0 (Model Isn’t Useful)

Ha: β0 + β1X1 + β2X2 + β3X3 + β4X4 β5X5 + β6X6 ≠ 0 (Model Has Utility)

The fitted model is:

Y= β0 + β1X1 + β2X2 + β3X3 + β4X4 β5X5 + β6X6

The regression equation isLife Expectancy = 49.0 + 0.228 Sanitation Score + 0.00131 Healthcare - 0.082 Water Quality - 0.033 Food + 0.00128 Food*Water + 0.47 Air Quality

Page 7: Regression Analysis Project

Predictor Coef SE Coef T PConstant 49.00 53.12 0.92 0.373Sanitation Score 0.22756 0.06163 3.69 0.003Healthcare 0.0013147 0.0009828 1.34 0.204Water Quality -0.0821 0.6107 -0.13 0.895Food -0.0325 0.4462 -0.07 0.943Food*Water 0.001275 0.005276 0.24 0.813Air Quality 0.470 3.673 0.13 0.900

S = 4.30674 R-Sq = 83.4% R-Sq(adj) = 75.8%

Analysis of Variance

Source DF SS MS F PRegression 6 1214.08 202.35 10.91 0.000Residual Error 13 241.12 18.55Total 19 1455.20

Y = Life Expectancy β0 = Intercept β1 = Coefficient for Sanitation ScoreΒ2 = Coefficient for Healthcare Production IndexΒ3 = Coefficient for Access to Improved Water QualityΒ4 = Coefficient for Food Production IndexΒ5 = Coefficient for Interaction of Food Production Index and Water Quality Β6 = Coefficient for Air Quality

Important observations:

1. Our overall model’s p-value is .001, which is an increase from the other model.2. Our adjusted R2a is 76.5%, which is lower than our initial first order linear model. 3. We reject our H0, which states that our model is useful in predicting life expectancy

Conclusion:

In conclusion, we have developed an equation that holds a relatively high regression score. The first model is our best model in determining an equation for life expectancy.