> integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating...
Transcript of > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating...
> integrating R into > introductory statistics
Mine Çetinkaya-Rundel - Duke UniversityAndrew Bray - UCLA
JSM 2012 - August 2, 2012
> experience
> STA 101 AT DUKE
> first course in statistics for non-majors, mostly social sciences majors
> STA 101 AT DUKE
> first course in statistics for non-majors, mostly social sciences majors
> 80-120 students in lecture, 25 students / lab section
> STA 101 AT DUKE
> first course in statistics for non-majors, mostly social sciences majors
> 80-120 students in lecture, 25 students / lab section
> weekly lab sessions using R> labs designed for an interdisciplinary introductory course,
can be modified for discipline-specific courses> can also be used in a first data-analysis course for stats
majors, ideally by reducing step-by-step instructions
> STA 101 AT DUKE
> R
unlike most software designed specifically for courses at this level, R is> free and open-source> powerful and flexible> relevant beyond the introductory statistics
classroom
> WHY R?
> WHY NOT R?
> perceived challenge of teaching programming in addition to teaching statistical concepts> labs and activities that try to find the right balance
of standard and custom functions> consistent syntax highlighting helps
> WHY NOT R?
> perceived challenge of teaching programming in addition to teaching statistical concepts> labs and activities that try to find the right balance
of standard and custom functions> consistent syntax highlighting helps
> working with a command line tends to be more intimidating than traditional GUI based tools> GUI tools also have a learning curve > a user-friendly IDE (like RStudio)
> WHY NOT R?
> RSTUDIO
> what it helps resolve:> loading and viewing data > saving code> code history> workspace organization> plot history
> RSTUDIO
> what it helps resolve:> loading and viewing data > saving code> code history> workspace organization> plot history
> what still remains a challenge:> working with a command line
> RSTUDIO
> balance
> BALANCE
> teach coding as a way of introducing/reinforcing concepts, especially those that are otherwise difficult to convey without computation> simulations
> sampling distributions > confidence levels
> bootstrapping > randomization tests > ...
> BALANCE
> teach coding as a way of introducing/reinforcing concepts, especially those that are otherwise difficult to convey without computation> simulations
> sampling distributions > confidence levels
> bootstrapping > randomization tests > ...
> hide implementation issues that are outside the scope of the course
> BALANCE
> teach coding as a way of introducing/reinforcing concepts, especially those that are otherwise difficult to convey without computation> simulations
> sampling distributions > confidence levels
> bootstrapping > randomization tests > ...
> hide implementation issues that are outside the scope of the course
> minimize coding for repeated mechanics that can be unified
> BALANCE
concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)
> TEACH
concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)
> TEACH
Lab 4B: Foundations for statistical inference - Confidence levels
Sampling from Ames, Iowa
If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation isthere in sizes of houses?”. If you have access to only a sample of the population, as is often the case, thetask becomes more complicated. What is your best guess for the typical size if you only know the sizes ofseveral dozen houses? This sort of situation requires that you use your sample to make inference on whatyour population looks like.
The data
In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loadingthat data set.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
In this lab we’ll start with just a sample from the population, which is a more realistic situation. Specifically,this is a simple random sample of size 60. Note that the data set has information on many variables onthese houses, but for the first portion of the lab we’ll focus on the size of the house, represented by thevariable Gr.Liv.Area.
population <- ames$Gr.Liv.Area
sample <- sample(population, 60)
Exercise 1 Describe the distribution of your sample. What would you say is the “typical” sizewithin your sample? Also state precisely what you interpreted “typical” to mean.
Exercise 2 Now compare your distribution to your neighbor’s. Do they look similar? Are theyidentical? Why, or why not?
Confidence intervals
One of the most common ways to describe the typical or central value of a distribution is to use the mean.In this case we can calculate the mean age size our sample:
sample_mean <- mean(sample)
Return for a moment to the question that first motivated this lab: based on this sample, what can weinfer about the population? More specifically, what is our best estimate of the mean size of the houses in
This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported(http://creativecommons.org/licenses/by-nc-nd/3.0/). This lab was written for OpenIntro by Andrew Bray and Mine Cetinkaya-Rundel.
1
concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)
> TEACHHere is the rough outline:
(1) Obtain a random sample.
(2) Calculate its means and standard deviation.
(3) Use these statistics to calculate a confidence interval.
(4) Repeat steps (1)-(3) 50 times.
But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.
for (i in 1:50) {
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
Lastly, we construct the confidence intervals.
lower <- samp_mean - 1.96 * samp_sd/sqrt(n)
upper <- samp_mean + 1.96 * samp_sd/sqrt(n)
Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.
c(lower[1], upper[1])
But browsing these intervals one-by-one will get tedious...
3
Lab 4B: Foundations for statistical inference - Confidence levels
Sampling from Ames, Iowa
If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation isthere in sizes of houses?”. If you have access to only a sample of the population, as is often the case, thetask becomes more complicated. What is your best guess for the typical size if you only know the sizes ofseveral dozen houses? This sort of situation requires that you use your sample to make inference on whatyour population looks like.
The data
In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loadingthat data set.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
In this lab we’ll start with just a sample from the population, which is a more realistic situation. Specifically,this is a simple random sample of size 60. Note that the data set has information on many variables onthese houses, but for the first portion of the lab we’ll focus on the size of the house, represented by thevariable Gr.Liv.Area.
population <- ames$Gr.Liv.Area
sample <- sample(population, 60)
Exercise 1 Describe the distribution of your sample. What would you say is the “typical” sizewithin your sample? Also state precisely what you interpreted “typical” to mean.
Exercise 2 Now compare your distribution to your neighbor’s. Do they look similar? Are theyidentical? Why, or why not?
Confidence intervals
One of the most common ways to describe the typical or central value of a distribution is to use the mean.In this case we can calculate the mean age size our sample:
sample_mean <- mean(sample)
Return for a moment to the question that first motivated this lab: based on this sample, what can weinfer about the population? More specifically, what is our best estimate of the mean size of the houses in
This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported(http://creativecommons.org/licenses/by-nc-nd/3.0/). This lab was written for OpenIntro by Andrew Bray and Mine Cetinkaya-Rundel.
1
concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)
> TEACHHere is the rough outline:
(1) Obtain a random sample.
(2) Calculate its means and standard deviation.
(3) Use these statistics to calculate a confidence interval.
(4) Repeat steps (1)-(3) 50 times.
But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.
for (i in 1:50) {
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
Lastly, we construct the confidence intervals.
lower <- samp_mean - 1.96 * samp_sd/sqrt(n)
upper <- samp_mean + 1.96 * samp_sd/sqrt(n)
Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.
c(lower[1], upper[1])
But browsing these intervals one-by-one will get tedious...
3
Here is the rough outline:
(1) Obtain a random sample.
(2) Calculate its means and standard deviation.
(3) Use these statistics to calculate a confidence interval.
(4) Repeat steps (1)-(3) 50 times.
But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.
for (i in 1:50) {
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
Lastly, we construct the confidence intervals.
lower <- samp_mean - 1.96 * samp_sd/sqrt(n)
upper <- samp_mean + 1.96 * samp_sd/sqrt(n)
Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.
c(lower[1], upper[1])
But browsing these intervals one-by-one will get tedious...
3
Lab 4B: Foundations for statistical inference - Confidence levels
Sampling from Ames, Iowa
If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation isthere in sizes of houses?”. If you have access to only a sample of the population, as is often the case, thetask becomes more complicated. What is your best guess for the typical size if you only know the sizes ofseveral dozen houses? This sort of situation requires that you use your sample to make inference on whatyour population looks like.
The data
In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loadingthat data set.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
In this lab we’ll start with just a sample from the population, which is a more realistic situation. Specifically,this is a simple random sample of size 60. Note that the data set has information on many variables onthese houses, but for the first portion of the lab we’ll focus on the size of the house, represented by thevariable Gr.Liv.Area.
population <- ames$Gr.Liv.Area
sample <- sample(population, 60)
Exercise 1 Describe the distribution of your sample. What would you say is the “typical” sizewithin your sample? Also state precisely what you interpreted “typical” to mean.
Exercise 2 Now compare your distribution to your neighbor’s. Do they look similar? Are theyidentical? Why, or why not?
Confidence intervals
One of the most common ways to describe the typical or central value of a distribution is to use the mean.In this case we can calculate the mean age size our sample:
sample_mean <- mean(sample)
Return for a moment to the question that first motivated this lab: based on this sample, what can weinfer about the population? More specifically, what is our best estimate of the mean size of the houses in
This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported(http://creativecommons.org/licenses/by-nc-nd/3.0/). This lab was written for OpenIntro by Andrew Bray and Mine Cetinkaya-Rundel.
1
concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)
> TEACHHere is the rough outline:
(1) Obtain a random sample.
(2) Calculate its means and standard deviation.
(3) Use these statistics to calculate a confidence interval.
(4) Repeat steps (1)-(3) 50 times.
But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.
for (i in 1:50) {
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
Lastly, we construct the confidence intervals.
lower <- samp_mean - 1.96 * samp_sd/sqrt(n)
upper <- samp_mean + 1.96 * samp_sd/sqrt(n)
Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.
c(lower[1], upper[1])
But browsing these intervals one-by-one will get tedious...
3
Here is the rough outline:
(1) Obtain a random sample.
(2) Calculate its means and standard deviation.
(3) Use these statistics to calculate a confidence interval.
(4) Repeat steps (1)-(3) 50 times.
But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.
for (i in 1:50) {
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
Lastly, we construct the confidence intervals.
lower <- samp_mean - 1.96 * samp_sd/sqrt(n)
upper <- samp_mean + 1.96 * samp_sd/sqrt(n)
Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.
c(lower[1], upper[1])
But browsing these intervals one-by-one will get tedious...
3
Here is the rough outline:
(1) Obtain a random sample.
(2) Calculate its means and standard deviation.
(3) Use these statistics to calculate a confidence interval.
(4) Repeat steps (1)-(3) 50 times.
But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.
for (i in 1:50) {
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
Lastly, we construct the confidence intervals.
lower <- samp_mean - 1.96 * samp_sd/sqrt(n)
upper <- samp_mean + 1.96 * samp_sd/sqrt(n)
Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.
c(lower[1], upper[1])
But browsing these intervals one-by-one will get tedious...
3
Lab 4B: Foundations for statistical inference - Confidence levels
Sampling from Ames, Iowa
If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation isthere in sizes of houses?”. If you have access to only a sample of the population, as is often the case, thetask becomes more complicated. What is your best guess for the typical size if you only know the sizes ofseveral dozen houses? This sort of situation requires that you use your sample to make inference on whatyour population looks like.
The data
In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loadingthat data set.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
In this lab we’ll start with just a sample from the population, which is a more realistic situation. Specifically,this is a simple random sample of size 60. Note that the data set has information on many variables onthese houses, but for the first portion of the lab we’ll focus on the size of the house, represented by thevariable Gr.Liv.Area.
population <- ames$Gr.Liv.Area
sample <- sample(population, 60)
Exercise 1 Describe the distribution of your sample. What would you say is the “typical” sizewithin your sample? Also state precisely what you interpreted “typical” to mean.
Exercise 2 Now compare your distribution to your neighbor’s. Do they look similar? Are theyidentical? Why, or why not?
Confidence intervals
One of the most common ways to describe the typical or central value of a distribution is to use the mean.In this case we can calculate the mean age size our sample:
sample_mean <- mean(sample)
Return for a moment to the question that first motivated this lab: based on this sample, what can weinfer about the population? More specifically, what is our best estimate of the mean size of the houses in
This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported(http://creativecommons.org/licenses/by-nc-nd/3.0/). This lab was written for OpenIntro by Andrew Bray and Mine Cetinkaya-Rundel.
1
> HIDE
concept: confidence levelsplot these confidence intervals and highlight those that do not contain the true population parameter (custom function)
> HIDE
concept: confidence levelsplot these confidence intervals and highlight those that do not contain the true population parameter (custom function)
On your own
1. Using the following custom function, plot all intervals. What proportion of your confidence intervalsinclude the true population mean? Is this proportion exactly equal to the confidence level? If no,explain why.†
plot_ci(lower, upper, mean(population))
2. Pick a confidence level of your choosing. What is the appropriate critical value?
3. Calculate 50 confidence intervals at this confidence level. You do not need to obtain new samples,simply calculate new intervals based on the samples you have already collected. Using the plot ci
function plot all intervals and calculate the proportion of intervals that include the true populationmean. How does this percentage compare to the confidence level you picked?
4. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered inthe textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs,or homework problems? Be specific in your answer.
†This figure should look familiar, if not, see Section 4.2.2.
4
> HIDEConfidence levels
Resample from the population many times and construct manyconfidence intervals (loops)Plot these confidence intervals and highlight those that do notcontain the true population parameter (custom function)
plot.ci(lower, upper, mean(pop))
mu = 1499.6904
Cetinkaya-Rundel, Bray Integrating R into Introductory Statistics useR! 2012 - June 13, 2012 6 / 15
concept: confidence levelsplot these confidence intervals and highlight those that do not contain the true population parameter (custom function)
On your own
1. Using the following custom function, plot all intervals. What proportion of your confidence intervalsinclude the true population mean? Is this proportion exactly equal to the confidence level? If no,explain why.†
plot_ci(lower, upper, mean(population))
2. Pick a confidence level of your choosing. What is the appropriate critical value?
3. Calculate 50 confidence intervals at this confidence level. You do not need to obtain new samples,simply calculate new intervals based on the samples you have already collected. Using the plot ci
function plot all intervals and calculate the proportion of intervals that include the true populationmean. How does this percentage compare to the confidence level you picked?
4. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered inthe textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs,or homework problems? Be specific in your answer.
†This figure should look familiar, if not, see Section 4.2.2.
4
> MINIMIZE
concept: statistical inference
> MINIMIZE
> traditional curriculum for an introductory statistics course includes various statistical inference techniques
concept: statistical inference
> MINIMIZE
> traditional curriculum for an introductory statistics course includes various statistical inference techniques
> when introduced as disconnected topic these can be overwhelming to students
concept: statistical inference
> MINIMIZE
> traditional curriculum for an introductory statistics course includes various statistical inference techniques
> when introduced as disconnected topic these can be overwhelming to students
> to help unify inferential concepts, use one function that does it all, but still requires students to think about the nature of the data and encourages them to conduct exploratory data analysis
concept: statistical inference
> EXAMPLE
concept: statistical inferencefunction: inference() - theoretical and simulation based inference
inference <- function(data, group = NULL, est = c("mean", "median", "proportion"),
success = NULL, order = NULL, nsim = 10000, conflevel = 0.95, null = NULL,
alternative = c("less", "greater", "twosided"), type = c("ci", "ht"),
method = c("theoretical", "simulation"), drawlines = "yes", simdist = FALSE){
...
}
3
> EXAMPLE
concept: statistical inferencefunction: inference() - theoretical and simulation based inference
inference <- function(data, group = NULL, est = c("mean", "median", "proportion"),
success = NULL, order = NULL, nsim = 10000, conflevel = 0.95, null = NULL,
alternative = c("less", "greater", "twosided"), type = c("ci", "ht"),
method = c("theoretical", "simulation"), drawlines = "yes", simdist = FALSE){
...
}
3
> data: response variable, quantitative or categorical
> group: explanatory variable, categorical for grouping (optional)
> type: confidence interval or hypothesis test
> method: theoretical or simulation
> ...
> EXAMPLE
concept: statistical inferencefunction: inference() - theoretical and simulation based inference
question: compare birth weights of babies born to smoker and nonsmoker mothers (source: north carolina births)
> USE IN LABS
Exercise 2 Make a side-by-side boxplot of habit and weight. What does the plot tell us aboutthe relationship between the two variables habit and weight?
The side-by-side box plots show how the medians of the two distributions compare, and we can alsocompare the means of the distributions. The following command gives the mean weights of babies bornto smoker and non-smoker mothers.
by(nc$weight, nc$habit, mean)
There is clearly an observed difference, but is this difference statistically significant? In order to answerthis question we need to conduct a hypothesis test.
Exercise 3 Check if the conditions necessary for inference are satisfied? Note that you willneed to obtain sample sizes to check the conditions.
Exercise 4 Write the hypotheses for testing if the average weights of babies born to smokerand non-smoker mothers are different.
Next, we introduce a new function, inference, that we will use for conducting hypothesis tests andconstructing confidence intervals.
inference(data = nc$weight, group = nc$habit, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
Let’s pause for a moment to go through the arguments of this custom function.
• The first argument is data, this is the response variable that we are interested in: weight
• The second argument is the grouping variable, group, this is the variable that we use to split the datainto two groups, smokers and nonsmokers: habit.
• The third argument (est) is the parameter we’re interested in: mean (other options are median, orproportion.)
• Next we decide on the type of inference we want: a hypothesis test (ht) or a confidence interval (ci).
• When doing a hypothesis test we also need to supply the null value, which in this case is 0, sincethe null hypothesis sets the two population means equal to each other.
• The alternative hypothesis can be less, greater, twosided.
• Lastly, the method of inference can be theoretical or simulation based.
Exercise 5 Change the type argument to ci construct a confidence interval for the differencebetween the weights of babies born to smoker and non-smoker mothers.
By default the function reports an interval for (µnonsmoker
� µsmoker
), we can easily change this order:
inference(data = nc$weight, group = nc$habit, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical", order = c("smoker", "nonsmoker"))
2
input:
question: compare birth weights of babies born to smoker and nonsmoker mothers (source: north carolina births)
> USE IN LABS
Exercise 2 Make a side-by-side boxplot of habit and weight. What does the plot tell us aboutthe relationship between the two variables habit and weight?
The side-by-side box plots show how the medians of the two distributions compare, and we can alsocompare the means of the distributions. The following command gives the mean weights of babies bornto smoker and non-smoker mothers.
by(nc$weight, nc$habit, mean)
There is clearly an observed difference, but is this difference statistically significant? In order to answerthis question we need to conduct a hypothesis test.
Exercise 3 Check if the conditions necessary for inference are satisfied? Note that you willneed to obtain sample sizes to check the conditions.
Exercise 4 Write the hypotheses for testing if the average weights of babies born to smokerand non-smoker mothers are different.
Next, we introduce a new function, inference, that we will use for conducting hypothesis tests andconstructing confidence intervals.
inference(data = nc$weight, group = nc$habit, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
Let’s pause for a moment to go through the arguments of this custom function.
• The first argument is data, this is the response variable that we are interested in: weight
• The second argument is the grouping variable, group, this is the variable that we use to split the datainto two groups, smokers and nonsmokers: habit.
• The third argument (est) is the parameter we’re interested in: mean (other options are median, orproportion.)
• Next we decide on the type of inference we want: a hypothesis test (ht) or a confidence interval (ci).
• When doing a hypothesis test we also need to supply the null value, which in this case is 0, sincethe null hypothesis sets the two population means equal to each other.
• The alternative hypothesis can be less, greater, twosided.
• Lastly, the method of inference can be theoretical or simulation based.
Exercise 5 Change the type argument to ci construct a confidence interval for the differencebetween the weights of babies born to smoker and non-smoker mothers.
By default the function reports an interval for (µnonsmoker
� µsmoker
), we can easily change this order:
inference(data = nc$weight, group = nc$habit, est = "mean", type = "ci", null = 0,
alternative = "twosided", method = "theoretical", order = c("smoker", "nonsmoker"))
2
input:
nonsmoker smoker
24
68
1012
-0.32 0 0.32nonsmoker smoker
24
68
1012
-0.32 0 0.32
One quantitative and one categorical variableDifference between two meansn_nonsmoker = 873 ; n_smoker = 126 Observed difference between means = 0.3155H0: mu_nonsmoker - mu_smoker = 0 HA: mu_nonsmoker - mu_smoker != 0 Standard error = 0.134 Test statistic: Z = 2.359 p-value: 0.0184
output:
question: compare birth weights of babies born to smoker and nonsmoker mothers (source: north carolina births)
> USE IN LABS
question: compare numbers of sexual partners of males and females (source: national survey of family growth)
> USE IN PROJECTS
question: compare numbers of sexual partners of males and females (source: national survey of family growth)
input:inference(data = partners, group = gender, type = "ci", est = "mean", method =
"theoretical")
3
> USE IN PROJECTS
question: compare numbers of sexual partners of males and females (source: national survey of family growth)
input:inference(data = partners, group = gender, type = "ci", est = "mean", method =
"theoretical")
3
output:
female male
01
23
45
67
One quantitative and one categorical variableDifference between two meansn_female = 12190 ; n_male = 10397 Observed difference between means = -0.432Standard error = 0.0361 95 % Confidence interval = ( -0.5 , -0.36 )
> USE IN PROJECTS
> resources
> RESOURCES
openintro.org/stat/labs.php
> LAB 2 - PROBABILITY
> LAB 8 - MULTIPLE REGRESSION
> reactions
> STUDENT REACTIONS - LABS
positive:> ``I like them. I feel like in the real world we’ll be using software
to do stats, so I’m glad we’re learning how to use it.’’
> ``I LOVE the labs. They really help cement basic statistic ideas, and I especially love that you can finish them in class.’’
> ``The labs are a lot of fun. It’s great being able to create our own simulations and watch R Studio calculate everything. I also enjoy learning some code.’’
negative:> ``The labs are alright. Sometimes I feel like I’m just plugging in
stuff and I feel disconnected from what I’m really doing. It’s also frustrating when the code doesn’t work.’’
> ``Wish other students focused more.’’
> STUDENT REACTIONS - R
positive:> ``Super useful and powerful software. It’s exciting to be introduced to it.
Once again, don’t always feel comfortable writing code/ understanding what I’m doing.’’
> ``I like it! I kind of know MATLAB, which has helped with the coding a bit, but it’s a little more intuitive/easier, and very helpful.’’
> ``I am not a computer person at all, but I find RStudio very easy to use.’’> ``I like it better than STATA which we used for [another class]. The user
interface is easy and there is plenty of help for it online. Overall, it’s pretty good.’’
negative:> ``I am not a fan of coding in general. I used Python before and RStudio is
better (for me) than Python was, but I am not a fan of either.’’> ``Easy to use, language is not too hard to understand although error
messages could be more informative.’’> ``I don’t think RStudio will have any use to me outside of this class.’’
> also...
> ADDITIONAL CONSIDERATIONS
> ADDITIONAL CONSIDERATIONS
> Labs should be fully integrated with the curriculum
> ADDITIONAL CONSIDERATIONS
> Labs should be fully integrated with the curriculum``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’
> ADDITIONAL CONSIDERATIONS
> Labs should be fully integrated with the curriculum``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’
> Works best in a classroom environment where students can collaborate with each other and get immediate support
> ADDITIONAL CONSIDERATIONS
> Labs should be fully integrated with the curriculum``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’
> Works best in a classroom environment where students can collaborate with each other and get immediate support
``Collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.”
> ADDITIONAL CONSIDERATIONS
> Labs should be fully integrated with the curriculum``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’
> Works best in a classroom environment where students can collaborate with each other and get immediate support
``Collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.”
> TAs need to be familiar and comfortable with the material
> thank you
stat.duke.edu/~mc301
openintro.org/stat/labs.php
contact :
web :
labs :