mahritaharahap.files.wordpress.com€¦ · Web viewYou will have three 1.5 hour labs to work...

42
32931 ITRM Quantitative Labs (using R) You will have three 1.5 hour labs to work thorough the following material at your own pace. The material is divided by the lecture in which each topic is presented. Why learn R/R Studio? R is a versatile statistical language, which provides a wide and ever increasing suite of statistical and graphical techniques. R is a programming language, which means it is not menu-driven like SPSS. All commands are case sensitive and are written and executed in the console window at the prompt. However, there are certain tasks which can be implemented through the menus, like installing new packages. Data in R are organised as named structures: vectors and matrices. They can be numerical data (like height and weight) or categorical factors (like gender and type of diet). R treats factors and numerical data differently, and can combine them in a data frame. Each vector must contain elements of only one type, while a data frame can contain columns of different types. R Studio is an integrated user interface for R. When you open up R Studio, it automatically runs R. Contents Lecture 1...................................................... 3 How to enter Questionnaire Data in Excel then read in R.......3 Descriptive Statistics........................................5 Lecture 2..................................................... 11 Setting up Hypothesis Test: Type I and Type II Error.........11 One Sample T-Test and Non-Parametric Equivalent..............11 One sample Proportion Test...................................16 32931 Autumn 2017 Page 1 of 42

Transcript of mahritaharahap.files.wordpress.com€¦ · Web viewYou will have three 1.5 hour labs to work...

32931 ITRM Quantitative Labs

(using R)

You will have three 1.5 hour labs to work thorough the following material at your own pace. The material is divided by the lecture in which each topic is presented.

Why learn R/R Studio?R is a versatile statistical language, which provides a wide and ever increasing suite of statistical and graphical techniques.R is a programming language, which means it is not menu-driven like SPSS. All commands are case sensitive and are written and executed in the console window at the prompt. However, there are certain tasks which can be implemented through the menus, like installing new packages. Data in R are organised as named structures: vectors and matrices. They can be numerical data (like height and weight) or categorical factors (like gender and type of diet). R treats factors and numerical data differently, and can combine them in a data frame. Each vector must contain elements of only one type, while a data frame can contain columns of different types.R Studio is an integrated user interface for R. When you open up R Studio, it automatically runs R.

Contents

Lecture 1....................................................................................................................................... 3

How to enter Questionnaire Data in Excel then read in R.........................................................3

Descriptive Statistics..................................................................................................................5

Lecture 2..................................................................................................................................... 11

Setting up Hypothesis Test: Type I and Type II Error...............................................................11

One Sample T-Test and Non-Parametric Equivalent...............................................................11

One sample Proportion Test....................................................................................................16

Power and Sample Size (in Minitab)........................................................................................17

Lecture 3..................................................................................................................................... 20

ANOVA Test.............................................................................................................................20

Multiple Comparisons Test......................................................................................................24

Nonparametric Equivalent: Kruskal-Wallis Test......................................................................25

Independent 2-Sample T-Test.................................................................................................25

Nonparametric Equivalent: Mann-Whitney U Test.................................................................27

32931 Autumn 2017 Page 1 of 37

Lecture 4..................................................................................................................................... 29

Chi-Square Test....................................................................................................................... 29

Regression Analysis.................................................................................................................31

NOTE: Although we do not cover questionnaire design in the lectures, you will be required to enter data from a questionnaire in your assignment. To give you the practice and skills you need, we have incorporated data entry for questionnaires in the section on descriptive statistics.

32931 Autumn 2017 Page 2 of 37

Lecture 1

How to enter Questionnaire Data in Excel then read in R

In our “Attitudes towards the Library” survey data there are four variables. We will be entering our data so that each row will represent a case (person) or questionnaire, and the columns will represent variables or questions. We will enter the data into Excel first and save the file as a .csv file named “Attitudes.csv”. For Question 1, the label Sex is a categorical variable with 2 categories. Question 2, Service is also a categorical variable with 5 categories. Question 3 is a numerical variable, Visits. Question 4 is a multiple response variable. The most efficient way to code this variable is to allocate a column for each of the five options in question 4. Within each column, we have a dichotomous variable – i.e. two possibilities only, Yes or No. Code Yes as 1 and No as 0 for each of these variables.

Open R Studio and set working directory

To set working directory, click on Session>Set Working Directory>Choose Directory.Your working directory should be the folder in which you have stored the data file “Attitudes.csv”.

32931 Autumn 2017 Page 3 of 37

Read the .csv data file into R using read.csv function in R

# Get and set your working directory (this is where you saved the file, syntax is forward slash in the address)getwd()setwd("C:/Users/Mahrita/Google Drive/32931 TRM/32931 Autumn 2017")

# Read data in Rattitudes<-read.csv("Attitudes.csv" , header=T)# head function in R previews the first few lines of the datasethead(attitudes)

## Sex Service Visits ClosedReserve StudyAreas SearchFacilities## 1 M Agree 2 Yes No No## 2 F Strongly Agree 5 No Yes Yes## 3 F Agree 0 No No No## 4 M Neutral 1 Yes Yes Yes## 5 F Agree 3 No Yes Yes## 6 F Neutral 2 No Yes Yes## JournalCollections Photocopiers## 1 No No## 2 Yes Yes## 3 No No## 4 No Yes## 5 No Yes## 6 No Yes

# str function in R looks at the structure of the variables in the datasetstr(attitudes)

Notice that Question 2 will have five Labels. You can label these as you choose, but there is a convention to order this categorical variable as this is an ordinal variable. Initially, the default in R is to order them alphabetically. We would change the order so that Strongly Agree would be the most positive response, and Strongly Disagree is the least negative response.

32931 Autumn 2017 Page 4 of 37

# Encode variable as categorical in the specified order attitudes$Service<-factor(attitudes$Service, levels = c("Strongly Agree", "Agree","Neutral","Disagree","Strongly Disagree"),ordered=TRUE)

# Question mark in front of a function opens up R help support documentation on that specific function?factor

Question 3 is a quantitative variable, and so we have no values to code.

Descriptive Statistics

Investigate each variable separately.

For example: Question 2, Service, which is a categorical variable. We can produce frequency table using the table function, a pie chart using the pie function and a bar chart using the barplot function in R. What can you say using these displays?# Frequency tableservice.freq<-table(attitudes$Service)service.freq

## Strongly Agree Agree Neutral Disagree ## 1 4 5 2 ## Strongly Disagree ## 0

# Relative Frequency Tableprop.table(service.freq)

## ## Strongly Agree Agree Neutral Disagree ## 0.08333333 0.33333333 0.41666667 0.16666667 ## Strongly Disagree ## 0.00000000

# Pie Chartpie(service.freq,col = c("purple","red","green","orange"))title(main = "The library offers good service")

32931 Autumn 2017 Page 5 of 37

# Bar Chartbarplot(service.freq,xlab="Satisfaction",ylab="Frequency",main = "The library offers good service")

We see that most people either agree or strongly agree. We also see that none of those people surveyed strongly disagreed with the statement.

How many percent of respondents strongly agreed that the library offers good service?

32931 Autumn 2017 Page 6 of 37

We can see that 8.3% selected “strongly agree”.

For Question 3, Visits, which is a quantitative variable, we can look at the summary statistics using the summary function, boxplot using the boxploy function in R and a histogram using the histogram function in the lattice package in R. Describe what you see.# Summary Statisticssummary(attitudes$Visits)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 1.000 2.000 2.167 3.000 5.000

# Boxplotboxplot(attitudes$Visits, main="Boxplot of Visits")

# Histogram (need to install the lattice package first to use histogram)install.packages(“lattice”)library(lattice)histogram(attitudes$Visits, main="Histogram of Visits", xlab= “how many times did you visit?”)

01

23

45

Boxplot of Visits

Histogram of Visits

how many times did you visit?

Per

cent

of T

otal

0

10

20

30

40

50

0 1 2 3 4 5

It appears that respondents visited the library between 0 and five times, with a mean of 2.17. The distribution of responses is reasonably symmetric.

Question 4 is a multiple response question. We will deal with this question in a separate section.

Investigate relationships between variables.

We noticed from our investigation of Question 2 that two out of the twelve people in the survey disagreed with the statement “The library offers good service”. We might like to investigate whether there are any differences between the opinions of males and females.

32931 Autumn 2017 Page 7 of 37

So we are looking at the relationship between two categorical variables: Service and Sex. We can produce a cross tabulation of these two categorical variables using the table function. We can also produce a clustered bar graph to represent this relationship using the barplot function. Describe what you see in the crosstab and the graph.# Crosstabsmytable <- table(attitudes$Sex,attitudes$Service) # A will be rows, B will be columns mytable

## ## Strongly Agree Agree Neutral Disagree Strongly Disagree## F 1 3 2 0 0## M 0 1 3 2 0

prop.table(mytable) # cell percentages

## ## Strongly Agree Agree Neutral Disagree Strongly Disagree## F 0.08333333 0.25000000 0.16666667 0.00000000 0.00000000## M 0.00000000 0.08333333 0.25000000 0.16666667 0.00000000

prop.table(mytable, 1) # row percentages out of Sex

## ## Strongly Agree Agree Neutral Disagree Strongly Disagree## F 0.1666667 0.5000000 0.3333333 0.0000000 0.0000000## M 0.0000000 0.1666667 0.5000000 0.3333333 0.0000000

prop.table(mytable, 2) # column percentages out of Service Level

## ## Strongly Agree Agree Neutral Disagree Strongly Disagree## F 1.00 0.75 0.40 0.00 ## M 0.00 0.25 0.60 1.00

#Clustered Bar Chartsbarplot(mytable,main="Service vs. Sex",legend.text = TRUE )

32931 Autumn 2017 Page 8 of 37

From our table and graph we can see that in general the females were more positive than the males, and that the two respondents who disagreed with the statement were males.

Multiple Response Variables

We have to set up our five columns relating to Question 4 as a multiple response set.

Produce a frequency chart and a pie graph for the multiple response question. What can you tell? What is the difference between Percent of Responses and Percent of Cases?

# The following 11 lines of code is how you should construct your frequency table when you have multiple response variables (use the same code for your assignment if you have to do this) # Find the total number of responses=34 b = sum(attitudes[,4:8] != 0)# Find the totals for each question d = colSums(attitudes[,4:8] != 0)# Find the number of respondents=11 e = sum(rowSums(attitudes[,4:8]) !=0)# This is your overfall frequency a<-data.frame(freq = d, percentofresponses = (d/b)*100, percentofcases = (d/e)*100 )a

freq percentofresponses percentofcases1 ClosedReserve 6 17.64706 54.545452 StudyAreas 7 20.58824 63.63636

32931 Autumn 2017 Page 9 of 37

3 SearchFacilities 7 20.58824 63.636364 JournalCollections 4 11.76471 36.363645 Photocopiers 10 29.41176 90.90909pie(a$percentofresponses, labels = c("ClosedReserve", "StudyAreas", "SearchFacilities", "JournalCollections", "Photocopiers"),main="Features used")

ClosedReserve

StudyAreas

SearchFacilities

JournalCollections Photocopiers

Features used

Percent of responses refers to the proportion of selections (ticks) that were made for that feature (so number of selections for closed reserve ÷ total number of selections=6/34=17.6%). The percent of cases refers to the proportion of respondents that selected that particular feature (so number of selections for closed reserve ÷ total number of “valid” respondents=6/11).

Say we would like a cross tabulation of answers to question 4 (features) and answers to Question 1 (sex). Question 4 is our multiple response set, but question 1 is an ordinary variable.# Cross table of Sex vs. Features (where Features is a multiple response variable)CR<-table(attitudes$Sex, attitudes$ClosedReserve)SA<-table(attitudes$Sex, attitudes$StudyAreas)SF<-table(attitudes$Sex, attitudes$SearchFacilities)JC<-table(attitudes$Sex, attitudes$JournalCollections)P<-table(attitudes$Sex, attitudes$Photocopiers)# this next line combines all the yes counts for each response in a dataframemult.resp<-cbind(CR[,2],SA[,2],SF[,2],JC[,2],P[,2])colnames(mult.resp) <- c("ClosedReserve", "StudyAreas", "SearchFacilities", "JournalCollections", "Photocopiers")mult.resp ClosedReserve StudyAreas SearchFacilities JournalCollections Photocopiers

32931 Autumn 2017 Page 10 of 37

F 1 5 4 2 5M 5 2 3 2 5

What percentage of males selected search facilities? What percentage of people who stated that they used the search facilities were male? ofwhichsex<-table(attitudes$Sex, attitudes$SearchFacilities)colnames(ofwhichsex) <- c("No", "Yes")ofwhichsex No Yes F 2 4 M 3 3prop.table(ofwhichsex, 1) # row percentages out of gender No Yes F 0.3333333 0.6666667 M 0.5000000 0.5000000prop.table(ofwhichsex, 2) # column percentages out of whether or not they used the facilities No Yes F 0.4000000 0.5714286 M 0.6000000 0.428571450% of males selected the search facilities. 42.9% of people who selected the search facilities were male.

Selecting Cases (Selecting parts of a dataframe)

In many situations we would like to be able to select a subset of the data for analysis. For example, question 4 is only relevant if the respondent has visited the library in the last week. We might like to be able to produce tables and graphs based just on the people who did visit the library in the last week. We can select certain rows, based on logical tests on the values of one or more variables.

attitudes[attitudes$Visits>0,]

attitudes[attitudes$Visits=="2",] # To select the rows in which Visits=2

attitudes[,1:3] # To select all the rows of the first three columns

attitudes[1:10,] # To select the first 10 rows for all columns

32931 Autumn 2017 Page 11 of 37

Lecture 2

Setting up Hypothesis Test: Type I and Type II ErrorA high school student is accused of cheating so the principal must investigate whether or not the accusation is true. If she finds the student guilty of cheating she will expel the student. If she finds no evidence of cheating the student will be allowed to graduate with their current grade.

a) State your null and alternative hypothesisH0: The student is not cheatingH1: The student is cheating

b) Describe a Type I error in this context. Reject a true null hypothesis. The student is not cheating but the principal decides the student is. So the innocent student is expelled.

c) Describe a Type II error in this context. Fail to reject a false null hypothesis.

The student is cheating but the principal decides the student is not. So the guilty student is allowed to graduate.

One Sample T-Test and Non-Parametric Equivalent

These questions have been taken from P. Petocz, Statistical laboratory exercises using Minitab: a guide to understanding data, Jacaranda Wiley, 1998.

The Earth's atmosphere

The composition of the Earth's atmosphere has seen some major changes in the past. One source of information about these changes is bubbles of gas inside ancient amber, a tree resin that has hardened and been trapped in rocks. Measurements on specimens from the late Cretaceous era (75 to 95 million years ago) gave these percentages of nitrogen:

63.4 65.0 64.4 63.3 54.8 64.5 60.8 49.1 51.0

(Data from Berner, R. A. and Landis, G. P. 1988, 'Gas bubbles in fossil amber as possible indicators of the major gas composition of ancient air', Science, vol. 239, pp. 1406-9.)

Open R use the lattice package to use the histogram function. Obtain the histogram, boxplot and a normal probability plot for Nitrogen. Comment on the normality, skewness and outliers in the data.

nitrogen<-c(63.4,65,64.4,63.3,54.8,64.5,60.8,49.1,51)install.packages("lattice")library(lattice)histogram(nitrogen,main="Histogram of Nitrogen",xlab="Nitrogen")

32931 Autumn 2017 Page 12 of 37

boxplot(nitrogen,main="Boxplot of Nitrogen")summary(nitrogen) Min. 1st Qu. Median Mean 3rd Qu. Max. 49.10 54.80 63.30 59.59 64.40 65.00library(moments)skewness(nitrogen)[1] -0.7588322qqnorm(nitrogen); qqline(nitrogen, col = 2)

32931 Autumn 2017 Page 13 of 37

It is difficult to be too confident of the distribution with so few samples, but the distribution appears to be left skewed.

We can formally conclude that the data are normally distributed by doing a 1 sample KS.

install.packages("nortest")library(nortest)lillie.test(moist)lillie.test(nitrogen)

Lilliefors (Kolmogorov-Smirnov) normality test

data: nitrogenD = 0.27906, p-value = 0.04183

H0: data are normally distributedH1: data are not normally distributedK-S=0.279Since p = 0.042< 0.05, we reject the null hypothesis.We conclude that the Nitrogen levels are not normally distributed.

Suppose the normality assumption in the data had not been violated. The present value for nitrogen in the air is 78.1%. Is this significantly different from the late Cretaceous air? Write down null and alternative hypotheses. Then carry out the test and answer the question using t-test choosing the appropriate alternative.> t.test(nitrogen,mu=78.1,alternative="two.sided")

One Sample t-test

data: nitrogent = -8.8778, df = 8, p-value = 2.049e-05alternative hypothesis: true mean is not equal to 78.195 percent confidence interval: 54.78065 64.39713sample estimates:mean of x 59.58889

H0: µ = 78.1H1: µ ≠ 78.1t-test=-8.878Since p =2.049e-05≈0.000 < 0.05, we reject the null hypothesis.We conclude that the mean nitrogen level from the late Cretaceous period is significantly different to that of present day air.

Since the normality assumption in the data had been violated and the use of the one sample t-test is deemed inappropriate. We have to use a Nonparametric test, specifically a 1-sample Wilcoxon test. The present value for nitrogen in the air is 78.1%. Is this significantly different from the late Cretaceous air? Write down null and alternative hypotheses. Then carry out the test and answer the question using Analyze > Nonparametric Tests > One Sample, Under Settings choose the

32931 Autumn 2017 Page 14 of 37

appropriate Hypothesized median.> wilcox.test(nitrogen, mu = 78.1,alternative="two.sided")

Wilcoxon signed rank test

data: nitrogenV = 0, p-value = 0.003906alternative hypothesis: true location is not equal to 78.1

H0: median= 78.1H1: median ≠ 78.1Since p = 0.003< 0.05, we reject the null hypothesis. We conclude that the mean nitrogen level from the late Cretaceous period is significantly different to that of present day air.

Another source of information about changes to the Earth's atmosphere is obtained by analysing bubbles from samples of ice from beneath the Arctic and Antarctic icecaps. Data for this section have been extracted from two graphs, one in Graedel, T. E. and Crutzen, P.J. 1989, 'The changing atmosphere', Scientific American, Sept., pp. 28-36; the other in Houghton, R. A. and Woodwell, G. M. 1989, 'Global climatic change', Scientific American, April, pp. 18-26.

Get the measurements from the data disk from the worksheet “methane.sav”. Methane values are given in parts per billion (ppb) for several time periods. Here, we will concentrate on the values in the variable M1-3th (methane levels 1000 to 3000 years ago). Use an appropriate graph to comment on the shape of the data. Test whether the data is normally distributed. methane<-c(710,600,520,540,590,670,660,630,820,660,630,630,670,700,640,600,630,670,700,800,630,620,670,700,750,710,660,700)methanelillie.test(methane)

Lilliefors (Kolmogorov-Smirnov) normality test

data: methaneD = 0.12478, p-value = 0.3199

By constructing a histogram we see that visually the data appears normally distributed.We then formally conclude that the data are normally distributed by doing a 1 sample KS.

H0: data are normally distributedH1: data are not normally distributedK.S=0.125Since p = 0.3199 > 0.05, we do not reject the null hypothesis.We conclude that the methane levels 1000-3000 years ago are normally distributed.

32931 Autumn 2017 Page 15 of 37

Find a 95% confidence interval for mean methane levels 1000 to 3000 years ago. Use t.test , and enter a test value of 0. Explain exactly what this interval represents. Would a 99% confidence interval be wider or narrower than the 95% interval? t.test(methane,mu=0,alternative="two.sided",conf.level = 0.95)

One Sample t-test

data: methanet = 52.988, df = 27, p-value < 2.2e-16alternative hypothesis: true mean is not equal to 095 percent confidence interval: 635.4732 686.6696sample estimates:mean of x 661.0714

95% CI = (635.47, 686.67). We are 95% confident that the mean methane levels 1000 to 3000 years ago were between 635.47 and 686.67 ppb.

We would expect the 99% confidence interval to be wider than the 95% confidence interval since we are more confident that the population mean is in the range.

99% CI = (626.50, 695.64)t.test(methane,mu=0,alternative="two.sided",conf.level = 0.99)

One Sample t-test

data: methanet = 52.988, df = 27, p-value < 2.2e-16alternative hypothesis: true mean is not equal to 099 percent confidence interval: 626.5050 695.6379sample estimates:mean of x

32931 Autumn 2017 Page 16 of 37

661.0714

The present methane level is about 1500 ppb, obviously a much higher value than in the past. But it is generally held that 10 000 years ago methane levels were about 700 ppb. Do these data indicate that levels of methane dropped before they rose to present levels? Write down null and alternative hypotheses, then carry out the test and answer the question. Use t.test choosing the appropriate Test value=700 and Confidence Interval Percentage=95%.t.test(methane,mu=700,alternative="l")

One Sample t-test

data: methanet = -3.1203, df = 27, p-value = 0.002135alternative hypothesis: true mean is less than 70095 percent confidence interval: -Inf 682.3213sample estimates:mean of x 661.0714

H0: µ = 700H1: µ < 700t-test=-3.12Since this is a one tailed test, we should divide the P-value by 2. Since P-value = 0.004/2=0.002 < 0.05, we reject the null hypothesis.

We conclude that the levels of methane did drop before they rose to the present level.

One sample Proportion TestGoing back to the “Attitudes towards the Library” survey that we used last week (Attitudes.csv), the library manager assumes that greater than 50% of their users use the study rooms and therefore they should not scrap the feature and replace it with a multimedia room to cater for their users. To test whether the library manager is right, conduct a one sample proportion test by prop.test and select the appropriate hypothesis and alternative.setwd("….")attitudes<-read.csv("Attitudes.csv")prop.test(sum(attitudes$StudyAreas), length(attitudes$StudyAreas), p = 0.5,alternative = "greater")

1-sample proportions test with continuity correction

data: sum(attitudes$StudyAreas) out of length(attitudes$StudyAreas), null probability 0.5X-squared = 0.083333, df = 1, p-value = 0.3864alternative hypothesis: true p is greater than 0.595 percent confidence interval: 0.3201442 1.0000000sample estimates: p 0.5833333

H0: p = 0.5H1: p > 0.5

32931 Autumn 2017 Page 17 of 37

Since P-value = 0.38 > 0.05, we do not reject the null hypothesis.

We do not have enough statistical evidence, to prove that the proportion of users who do use the study rooms is greater than 50%. Therefore the library shall scrap this feature and replace it with a multimedia room.

Power and Sample Size (in Minitab)

Design of Experiments is the area of statistics that examines plans on how to gather data to achieve good (or optimal) inference. Sample size determination is just one aspect of good design of experiments. Here, we will focus on the question of how large does a sample need to be so that a hypothesis test will have a low p-value if a certain alternative hypothesis is true?

When setting up a study of our own, we need to select an appropriate number of samples to collect. If we collect too few samples, then we do not have the power to conclude that there is no significant effect in the case where we do not reject the null hypothesis. If we collect too many samples, then we are wasting time and money. The selection of sample size can be performed in several of the statistical packages, including Minitab (but R cannot show Power Curves as yet). We will use Minitab to select an appropriate sample size for each of these scenarios.

To open Minitab at home, go to http://www.itd.uts.edu.au/services_facilities/virtuallabs.html, and select Login Virtual Lab. Enter your student number and UTS Email password. Select the virtual lab folder, and then the Minitab icon. Minitab will then open once the client is started (which may take a while). If this does not work, then try searching for “web based power and sample size calculation” in your preferred search engine.

A researcher at a university wants to determine whether there has been an increase in the average cost per student of textbooks. Previous data suggests that the average cost per student per semester was $350 with a standard deviation of $80. The researcher considers that an increase of $50 to $400 per semester would be considered of practical importance, and he would like to be 90% sure of being able to detect this difference should it exist. Assuming a type one error of 5% and a population standard deviation of $80, what sample size would he need? Go to Stat > Power and Sample Size > 1 Sample t, and enter 50 in differences, 0.9 in power and 80 in standard deviation (leave sample sizes blank). Select the Options button and select Greater than as the Alternative Hypothesis and a significance level of 0.05. Select OK twice.

n=24

Power and Sample Size

1-Sample t Test

Testing mean = null (versus > null)Calculating power for mean = null + differenceAlpha = 0.05 Assumed standard deviation = 80

Sample TargetDifference Size Power Actual Power 50 24 0.9 0.907420

32931 Autumn 2017 Page 18 of 37

32931 Autumn 2017 Page 19 of 37

Power Curve for 1-Sample t Test

In R:install.packages("pwr")library(pwr) ?pwr.t.test # syntax of this functiond=50/80 # difference divided by standard deviation gives effect sizepwr.t.test(d=d,power=0.90,sig.level=0.05,type="one.sample",alternative="greater")

One-sample t test power calculation

n = 23.34488 d = 0.625 sig.level = 0.05 power = 0.9 alternative = greater

A pollster wishes to predict the change in primary vote for a particular political party. At the last election, the party obtained 39% of the vote, and the pollster would like to detect a 2% change in vote (either way) with probability 90%. The pollster would like to have a 5% level of significance. Go to Stat > Power and Sample Size > 1 Proportion, and enter 0.41 in Alternative values of p, 0.9 in power and 0.39 in Hypothesized p. Select the Options button and select Not equal as the Alternative Hypothesis and a significance level of 0.05. Select OK twice.

n=6291Power and Sample Size

Test for One Proportion

Testing proportion = 0.39 (versus not = 0.39)Alpha = 0.05

Alternative Sample Target Proportion Size Power Actual Power 0.41 6291 0.9 0.900012

32931 Autumn 2017 Page 20 of 37

Power Curve for Test for One Proportion

In R:?pwr.p.testh<-ES.h(0.39,0.41) # estimation of effect sizepwr.p.test(h=h,power=0.90,sig.level=0.05,alternative="two.sided")

proportion power calculation for binomial distribution (arcsine transformation)

h = 0.04082802 n = 6303.466 sig.level = 0.05 power = 0.9 alternative = two.sided

(sample size estimation in Minitab is slightly off to R because Minitab uses a different formula for effect size)

32931 Autumn 2017 Page 21 of 37

Lecture 3

These questions have been taken from P. Petocz, Statistical laboratory exercises using Minitab: a guide to understanding data, Jacaranda Wiley, 1998.

ANOVA Test

Flicker Rates

An individual's critical flicker frequency is the highest frequency at which the flicker in a flickering light source can be detected. At frequencies above the critical frequency, the light source appears to be continuous even though it is actually flickering. This investigation recorded critical flicker frequency and iris colour of the eye for 19 subjects. These results were published in “The effect of iris color on critical flicker frequency”. Journal of General Psychology (1973), 91-95.

Read in the data set flicker.csv. Obtain a multiple boxplot and numerical summaries of the critical frequency for each of the three iris colours using boxplot and by functions. Summarise your observations.# Flicker Rates - ANOVA testsetwd("….")flicker<-read.csv("Flicker.csv",header=TRUE)head(flicker)str(flicker)flicker$Colour<-factor(flicker$Colour)# Multiple Boxplotlevels(flicker$Colour) <- c("Brown","Green","Blue")boxplot(flicker$Flicker~flicker$Colour,main="Boxplot of Flicker")?byby(flicker$Flicker,flicker$Colour, summary)flicker$Colour: Brown Min. 1st Qu. Median Mean 3rd Qu. 23.70 24.72 25.35 25.59 26.42 Max. 27.90 --------------------------------- flicker$Colour: Green Min. 1st Qu. Median Mean 3rd Qu. 24.20 26.40 26.90 26.92 28.00 Max. 29.10 --------------------------------- flicker$Colour: Blue Min. 1st Qu. Median Mean 3rd Qu. 25.70 27.48 28.40 28.17 29.17 Max. 29.90 by(flicker$Flicker,flicker$Colour, sd)flicker$Colour: Brown[1] 1.365323--------------------------------- flicker$Colour: Green[1] 1.843095

32931 Autumn 2017 Page 22 of 37

--------------------------------- flicker$Colour: Blue[1] 1.527962

It appears that people with blue eyes have the highest critical frequency (mean = 28.2), followed by those with green eyes (mean = 26.9), and then brown eyes (mean = 25.6). It also appears that the three groups have similar variation in the critical frequencies observed.

Since there are more than 2 groups, if we wanted to compare the means between the groups we should undertake a one-way ANOVA test. Remember that the ANOVA test has three assumptions.1. normality of group observations (K-S test on each group)2. normality of residuals (K-S test on the residuals)3. and equality of variances (Levene’s test)

Test whether each group has normally distributed observations formally by using the Kolmogorov Smirnov test in R (lillie.test). Include the null and alternative hypothesis, p-value, decision and conclusion in your answer. Is analysis of variance appropriate in this case?

# Normality test of each grouplibrary(nortest)by(flicker$Flicker,flicker$Colour, lillie.test)flicker$Colour: Brown

Lilliefors (Kolmogorov-Smirnov)normality test

data: dd[x, ]D = 0.16651, p-value = 0.7394

---------------------------------

32931 Autumn 2017 Page 23 of 37

flicker$Colour: Green

Lilliefors (Kolmogorov-Smirnov)normality test

data: dd[x, ]D = 0.18892, p-value = 0.8064

--------------------------------- flicker$Colour: Blue

Lilliefors (Kolmogorov-Smirnov)normality test

data: dd[x, ]D = 0.20144, p-value = 0.6245

For each group,

H0: The data is normally distributedH1: The data is not normally distributed

p-value > α(0.05) we do not reject H0. We conclude that the critical frequencies for each colour group do not deviate significantly from normally distributed data.

We have met one of the assumptions for ANOVA.

Test whether the residuals are normally distributed. Include the null and alternative hypothesis, p-value, decision and conclusion in your answer. Is analysis of variance appropriate in this case?# Normality test of residualsattach(flicker)qqnorm(Flicker)qqline(Flicker)mod<-lm(Flicker~Colour,data=flicker)flicker$resids <- residuals(mod)lillie.test(flicker$resids)

Lilliefors (Kolmogorov-Smirnov) normality test

data: flicker$residsD = 0.086228, p-value = 0.9693

H0: The residuals are normally distributedH1: The residuals are not normally distributedD=0.086228p-value = 0.9693 > α(0.05) we do not reject H0. We conclude that the residuals do not deviate significantly from normally distributed data.

We have met another assumption for ANOVA.

32931 Autumn 2017 Page 24 of 37

32931 Autumn 2017 Page 25 of 37

Test the homogeneity of variances using the levene test in R (leveneTest function). Is analysis of variance appropriate in this case?

# Test of Homogeneity of Varianceslibrary(car)leveneTest(Flicker~Colour, data=flicker,center=mean)Levene's Test for Homogeneity of Variance (center = mean) Df F value Pr(>F)group 2 0.0942 0.9106 16

H0: The variances of the three groups are equal.H1: The variances of the three groups are not all equal.

Since p-value = 0.9106> α(0.05) we do not reject H0, we can conclude that the variances of the critical frequencies are not significantly different between the iris colours.

We have met the final assumption for ANOVA. It is appropriate to carry out an ANOVA because all three assumptions (normality of group observations, normality of residuals, and equality of variances) have been met.

Write down the null and alterative hypotheses and carry out a test for the equality of the group means (ANOVA test). What is the result of the test?

# ANOVA testmod<-lm(flicker$Flicker~flicker$Colour)anova(mod)Analysis of Variance Table

Response: flicker$Flicker Df Sum Sq Mean Sq F value Pr(>F) flicker$Colour 2 22.997 11.4986 4.8023 0.02325*Residuals 16 38.310 2.3944 H0: μBlack = μGreen = μBrown

H1: Not all μi are equal.p-value = 0.023 < α(0.05) therefore we reject H0. We conclude that there is a significant difference in the mean critical frequencies amongst the three different iris colours.

Multiple Comparisons TestRecall that if we reject the null hypothesis in an analysis of variance, we can use a multiple comparisons test to determine where the differences lie. Use aov and TukeyHSD function in R to see between which groups are there significant differences?

# Multiple Comparisons Testaov.out<-aov(flicker$Flicker~flicker$Colour)TukeyHSD(aov.out) Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = flicker$Flicker ~ flicker$Colour)

$`flicker$Colour`

32931 Autumn 2017 Page 26 of 37

diff lwr upr p adjGreen-Brown 1.332500 -0.9437168 3.608717 0.3124225Blue-Brown 2.579167 0.4228360 4.735497 0.0183579Blue-Green 1.246667 -1.1710626 3.664396 0.3994319Since 0 does not lie in the 95% confidence interval (or p-value=0.018 < 0.05), we conclude that there is a significant difference between the blue and brown eyes, but none of the other pairs of iris colours have significantly different mean critical frequencies.

Nonparametric Equivalent: Kruskal-Wallis TestSay if one of the three assumptions (normality of group observations, normality of residuals, and equality of variances) have not been met. Then the ANOVA test would not be appropriate. Then we should undertake the alternative equivalent nonparametric test, the Kruskal Wallis test.

# Nonparametric Equivalent - Kruskal Wallis Testkruskal.test(flicker$Flicker~flicker$Colour)

Kruskal-Wallis rank sum test

data: flicker$Flicker by flicker$ColourKruskal-Wallis chi-squared = 7.1944, df = 2, p-value = 0.0274

H0: medBlack = medGreen = medBrown

H1: Not all medi are equal.

p-value = 0.027 < α(0.05) therefore we reject H0. We conclude that there is a significant difference in the median critical frequencies amongst the three different iris colours.

Independent 2-Sample T-Test

Crossbills data

On page 48 of the text Understanding data is a set of data on the lengths (in mm) of the left and right legs of a sample of common crossbills, some of them left-billed, some of them right-billed. The data were recorded to determine whether the method of feeding, i.e. the bill orientation, places more strain on one leg than the other.

Open the worksheet crossbill.csv to get the measurements. How do we know that the data are paired?

Each left leg length had a paired right leg length because each pair of legs belonged to a particular bird.

Compute the differences between the left and right leg lengths using the numeric expression LLeg – Rleg. Obtain a boxplot of Diff to compare the left-billed crossbills and the right-billed crossbills Write a brief summary of the information shown by this graph.# Read the data in Rcrossbill<-read.csv("Crossbill.csv")head(crossbill)crossbill$diff<-crossbill$lleg-crossbill$rleghead(crossbill)boxplot(crossbill$diff~crossbill$bill,main="Boxplot of Crossbill")

32931 Autumn 2017 Page 27 of 37

It appears that the difference is positive for left-billed crossbills, meaning their left leg is longer than their right leg. The difference looks slightly negative for right-billed crossbills, meaning their right leg is longer than their left leg. It also appears that, with the exception of three outliers, that the differences are less variable for right-billed crossbills than for the left-billed crossbills.

The first question to ask is whether the right and left legs of crossbills have significantly different lengths. Explain why this question must be answered for the two groups of crossbills separately. Then use a one-sample t test to test whether the mean is zero. Repeat this process for the right-billed birds. Why do we use H0 µ = 0 as the null hypothesis value? Should the alternative be one-sided or two-sided? What are your conclusions from the tests? crossbill[crossbill$bill=="left",] t.test(crossbill[crossbill$bill=="left",5],mu=0,alternative="two.sided")

One Sample t-test

data: crossbill[crossbill$bill == "left", 5]t = 2.4864, df = 11, p-value = 0.03023alternative hypothesis: true mean is not equal to 095 percent confidence interval: 0.04305078 0.70694922

crossbill[crossbill$bill=="right",] t.test(crossbill[crossbill$bill=="right",5],mu=0,alternative="two.sided")

One Sample t-test

data: crossbill[crossbill$bill == "right", 5]t = -3.1168, df = 11, p-value = 0.009806alternative hypothesis: true mean is not equal to 095 percent confidence interval: -0.6540300 -0.1126367

We need to consider the two groups separately, since we suspect that they might have different

32931 Autumn 2017 Page 28 of 37

signs for the difference. If we combine the groups, then we lose useful information. We need to use μ = 0 as the null hypothesis since μ = 0 represents the left and right legs being of equal length (left –right = 0). We choose a 2-sided alternative since we do not really have enough information about which leg we expect to be longer for which billed crossbill. The legs of the crossbills have significantly different lengths, both for the left-billed crossbills (p-value = 0.03<α) and the right-billed crossbills (p-value = 0.0098<α).

Do the values of Diff show differences between the two groups of crossbills? Write down null and alternative hypotheses for the question. Is the alternative hypothesis one-sided or two-sided? Write down the p-value, decision and conclusion.First we need to look at Levene’s test to see whether we should assume variances are equal or not.leveneTest(diff~bill,data=crossbill,center=mean)Levene's Test for Homogeneity of Variance (center = mean) Df F value Pr(>F)group 1 0.713 0.4075H0: σleft-billed = σright-billed

H1: σleft-billed ≠ σright-billed

F-test=0.713p-value = 0.408 > α(0.05), we can assume that the variances for the two groups are not significantly different. So we look at the top p-value for the test for equality of means.

t.test(diff~bill,data=crossbill)

Welch Two Sample t-test

data: diff by billt = 3.8967, df = 21.144, p-value =0.0008229alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: 0.3537906 1.16287610: μleft-billed = μright-billed

H1: μleft-billed ≠ μright-billed

t-test=3.897p-value = 0.001< α(0.05) therefore we reject H0. We can conclude that there is a significant difference between the mean values of diff between the two groups of Crossbills.

Nonparametric Equivalent: Mann-Whitney U TestTest whether the two groups are normally distributed, include the null and alternative hypothesis, p-value, decision and conclusion in your answer. Was the two sample independent t-test appropriate in this case? If not, use the appropriate equivalent nonparametric test:

by(crossbill$diff,crossbill$bill, lillie.test)crossbill$bill: left

Lilliefors (Kolmogorov-Smirnov)normality test

data: dd[x, ]D = 0.30707, p-value = 0.002639

32931 Autumn 2017 Page 29 of 37

---------------------------------- crossbill$bill: right

Lilliefors (Kolmogorov-Smirnov)normality test

data: dd[x, ]D = 0.41087, p-value = 3.857e-06

H0: The data is normally distributedH1: The data is not normally distributedp-value = 0.03< α(0.05) we reject H0. We conclude that the mean difference of left billed Crossbills is not normally distributed.

H0: The data is normally distributedH1: The data is not normally distributedp-value = 0.00< α(0.05) we reject H0. We conclude that the mean difference of right-billed Crossbills is not normally distributed.

The equivalent nonparametric test to the independent t-test is the Mann Whitney test. Use the wilcox.test function in R.

wilcox.test(diff~bill,data=crossbill)

Wilcoxon rank sum test with continuitycorrection

data: diff by billW = 141, p-value = 7.096e-05alternative hypothesis: true location shift is not equal to 0

H0: medianleft-billed = medianright-billed

H1: medianleft-billed ≠ medianright-billed

p-value = 0.000< α(0.05) therefore we reject H0. We can conclude that there is a significant difference between the median values of diff between the two groups of Crossbills.

32931 Autumn 2017 Page 30 of 37

Lecture 4

These questions have been taken from P. Petocz, Statistical laboratory exercises using Minitab: a guide to understanding data, Jacaranda Wiley, 1998.

Chi-Square Test

Aspirin use and heart attacksThe table below reports on the relationship between aspirin use and heart attacks in the Physicians'Health Study carried out in 1988. This double-blind study was carried out to determine whether aspirin taken regularly reduces mortality due to heart attacks.

Fatal Attack Non-Fatal Attack No attackPlacebo 18 171 10 845Aspirin 5 99 10 933

Enter the frequencies in R. See code below to see how to input two way tables in R.

setwd("C:/Users/andy/Google Drive/32931 TRM/32931 Autumn 2017/Week 4")heart<-matrix(c(18,171,10845,5,99,10933),ncol=3,byrow=TRUE)colnames(heart)<-c("Fatal Attack","Non Fatal Attack","No Attack")rownames(heart)<-c("Placebo","Aspirin")heart <- as.table(heart)heart Fatal Attack Non Fatal Attack No AttackPlacebo 18 171 10845Aspirin 5 99 10933

Find the row and column percentages:

prop.table(heart,1) # % within treatment (columns) Fatal Attack Non Fatal Attack No AttackPlacebo 0.0016313214 0.0154975530 0.9828711256Aspirin 0.0004530217 0.0089698288 0.9905771496prop.table(heart,2) # % within outcome (row) Fatal Attack Non Fatal Attack No AttackPlacebo 0.7826087 0.6333333 0.4979796Aspirin 0.2173913 0.3666667 0.5020204

Examine the row and column percentages, and write an informal summary of the information it tells you.

It appears that a greater proportion of people who took aspirin did not have a heart attack(99.1%) than those who took the placebo (98.3%).

In situations such as this, we need a method to compare proportions across all categories. The chi-square test can be used to investigate the association between the row and column variables. Carry out a chi-square test to investigate the association between aspirin use and heart attacks.

32931 Autumn 2017 Page 31 of 37

State your null and alternative hypotheses, give the value of the chi-square statistic, give the p-value, decision and report your conclusion.

summary(heart)Number of cases in table: 22071 Number of factors: 2 Test for independence of all factors:

Chisq = 26.903, df = 2, p-value = 1.439e-06

H0: There is no association between the treatment and the outcome.H1: There is an association between the treatment and the outcome.

P-value = 1.439e-06≈0.000 < 0.05, so we reject the null hypothesisTherefore there is a statistically significant relationship between aspirin use and the incidence and severity of heart attacks.

32931 Autumn 2017 Page 32 of 37

Regression Analysis

Brain and body weights of mammals

Is there a relationship between the brain weights and the body weights of mammals, and if so, what type of relationship? Data that can help answer this question were collected as part of a large study into sleep in mammals, and are found in Weisberg. S. 1985, Applied Linear Regression, 2nd edn, John Wiley & Sons. They show the average body weights (kg) and brain weights (g) for 62 mammals.

Retrieve the data from ‘mammals.sav’. Explain why it would be reasonable to use brain weight as the response and body weight as the predictor variable. Use plot to obtain a graph of brain weight versus body weight. What problem is apparent?

mammals<-read.csv("mammals.csv")head(mammals) ï..Name BrainWt BodyWt1 Arctic_fox 44.500 3.3852 Owl_monkey 15.499 0.4803 Beaver 8.100 1.3504 Cow 423.012 464.9835 Gray_wolf 119.498 36.3286 Goat 114.996 27.660plot(mammals$BodyWt,mammals$BrainWt,xlab="Body Weight of Mammal (kg)",ylab="Brain Weight of Mammal (g)",main="Body Weight vs. Brain Weight of Mammals")

For some mammals, it is going to be much easier to weigh their brains than it will be to weigh their entire body (think of an elephant). But we are interested in using an animal’s body weight to predict their brain weight- so being able to predict brain weight of live animals based on their body weight.

32931 Autumn 2017 Page 33 of 37

It appears that there are many points close to the origin (x = 0, y = 0) and a few points that are far away from the origin. These points far away from the origin will be very influential in determining the regression line (which is not good).

For data such as these, that are spread over several orders of magnitude, a useful transformation is the logarithm. Get the natural logarithms of body weights, Logbod. Likewise, calculate Logbrain, the natural logarithms of the brain weights. Obtain a graph of Logbrain versus Logbody. Does a linear relationship seem reasonable?lnbod<-log(mammals$BodyWt)lnbrain<-log(mammals$BrainWt)plot(lnbod,lnbrain,xlab="Log Body Weight of Mammal (kg)",ylab="Log Brain Weight of Mammal (g)",main="Log Body Weight vs. Log Brain Weight of Mammals")

This is much improved. The linear relationship is very clear, and reasonably strong.

Use lm function to obtain the least-squares regression line of Logbrain(dependent) on Logbody (Independent). Look at the residual plots (at the end of the output). Do you think that this model fits the data well?

Weight<-lm(lnbrain~lnbod)abline(Weight) # adds the regression line in the scatterplotsummary(Weight)

Call:lm(formula = lnbrain ~ lnbod)

Residuals: Min 1Q Median 3Q Max -1.71550 -0.49228 -0.06162 0.43598 1.94833

Coefficients: Estimate Std. Error t value(Intercept) 2.13479 0.09604 22.23lnbod 0.75169 0.02846 26.41 Pr(>|t|) (Intercept) <2e-16 ***lnbod <2e-16 ***

32931 Autumn 2017 Page 34 of 37

---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6943 on 60 degrees of freedomMultiple R-squared: 0.9208, Adjusted R-squared: 0.9195 F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16

plot(Weight)

Since the points in the normal probability plot seem to be close to the line, and the points on the residual vs. fits plot seem to be randomly scattered, we can say that the model fits the data well.

Using the output from previous question write down the equation of the regression line and interpret the coefficients.

The equation of the regression line is:Log(Brain weight in g) = 2.135 + 0.752 × Log(Body weight in kg). A one unit increase in logbody weight will increase logbrain weight by an average of 0.752 units.If a mammal’s body weight was 0, then their predicted log brain weight would be on average 2.135 units. However, this interpretation of the intercept is nonsensical since it is not reasonable for the body weight of an animal to be zero.

Is there a significant relationship between the log of the brain weights and the log of the body weights?

H0: β=0 (Regression is not significant)H1: β≠0 (Regression is significant) F=697.424P-value = 0.000 < 0.05, so we reject the null hypothesis.

32931 Autumn 2017 Page 35 of 37

There is a significant relationship between the log of the mammals body weight and the log of the mammals brain weight.

32931 Autumn 2017 Page 36 of 37

Interpret the R2 and determine whether this model be appropriate for making new predictions. Explain why.

On average, 92.1% of the variation in the log body weight can be explained by the log brain weight through the regression model.

This model is appropriate for making predictions within the range of the brain weights sampled, since the regression is significant, residual plots state that model fits the data well and because R2 = 92.1% > 75%.

32931 Autumn 2017 Page 37 of 37