> integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating...

> integrating R into > introductory statistics

Mine Çetinkaya-Rundel - Duke UniversityAndrew Bray - UCLA

JSM 2012 - August 2, 2012

> experience

> STA 101 AT DUKE

> first course in statistics for non-majors, mostly social sciences majors

> STA 101 AT DUKE


> 80-120 students in lecture, 25 students / lab section

> STA 101 AT DUKE


> 80-120 students in lecture, 25 students / lab section

> weekly lab sessions using R> labs designed for an interdisciplinary introductory course,

can be modified for discipline-specific courses> can also be used in a first data-analysis course for stats

majors, ideally by reducing step-by-step instructions

> STA 101 AT DUKE

unlike most software designed specifically for courses at this level, R is> free and open-source> powerful and flexible> relevant beyond the introductory statistics

classroom

> WHY R?

> WHY NOT R?

> perceived challenge of teaching programming in addition to teaching statistical concepts> labs and activities that try to find the right balance

of standard and custom functions> consistent syntax highlighting helps

> WHY NOT R?

> perceived challenge of teaching programming in addition to teaching statistical concepts> labs and activities that try to find the right balance

of standard and custom functions> consistent syntax highlighting helps

> working with a command line tends to be more intimidating than traditional GUI based tools> GUI tools also have a learning curve > a user-friendly IDE (like RStudio)

> WHY NOT R?

> RSTUDIO

> what it helps resolve:> loading and viewing data > saving code> code history> workspace organization> plot history

> RSTUDIO

> what it helps resolve:> loading and viewing data > saving code> code history> workspace organization> plot history

> what still remains a challenge:> working with a command line

> RSTUDIO

> balance

> BALANCE

> teach coding as a way of introducing/reinforcing concepts, especially those that are otherwise difficult to convey without computation> simulations

> sampling distributions > confidence levels

> bootstrapping > randomization tests > ...

> BALANCE




> hide implementation issues that are outside the scope of the course

> BALANCE




> hide implementation issues that are outside the scope of the course

> minimize coding for repeated mechanics that can be unified

> BALANCE

concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)

> TEACH


> TEACH

Lab 4B: Foundations for statistical inference - Confidence levels

Sampling from Ames, Iowa

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation isthere in sizes of houses?”. If you have access to only a sample of the population, as is often the case, thetask becomes more complicated. What is your best guess for the typical size if you only know the sizes ofseveral dozen houses? This sort of situation requires that you use your sample to make inference on whatyour population looks like.

The data

In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loadingthat data set.

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")

load("ames.RData")

In this lab we’ll start with just a sample from the population, which is a more realistic situation. Specifically,this is a simple random sample of size 60. Note that the data set has information on many variables onthese houses, but for the first portion of the lab we’ll focus on the size of the house, represented by thevariable Gr.Liv.Area.

population <- ames$Gr.Liv.Area

sample <- sample(population, 60)

Exercise 1 Describe the distribution of your sample. What would you say is the “typical” sizewithin your sample? Also state precisely what you interpreted “typical” to mean.

Exercise 2 Now compare your distribution to your neighbor’s. Do they look similar? Are theyidentical? Why, or why not?

Confidence intervals

One of the most common ways to describe the typical or central value of a distribution is to use the mean.In this case we can calculate the mean age size our sample:

sample_mean <- mean(sample)

Return for a moment to the question that first motivated this lab: based on this sample, what can weinfer about the population? More specifically, what is our best estimate of the mean size of the houses in

This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported(http://creativecommons.org/licenses/by-nc-nd/3.0/). This lab was written for OpenIntro by Andrew Bray and Mine Cetinkaya-Rundel.

1


> TEACHHere is the rough outline:

(1) Obtain a random sample.

(2) Calculate its means and standard deviation.

(3) Use these statistics to calculate a confidence interval.

(4) Repeat steps (1)-(3) 50 times.

But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.

samp_mean <- rep(NA, 50)

samp_sd <- rep(NA, 50)

n <- 60

Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.

for (i in 1:50) {

samp <- sample(population, n) # obtain a sample of size n = 60 from the population

samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean

samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd

}

Lastly, we construct the confidence intervals.

lower <- samp_mean - 1.96 * samp_sd/sqrt(n)

upper <- samp_mean + 1.96 * samp_sd/sqrt(n)

Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.

c(lower[1], upper[1])

But browsing these intervals one-by-one will get tedious...

3




The data



load("ames.RData")











1










n <- 60


for (i in 1:50) {




}







3

Here is the rough outline:








n <- 60


for (i in 1:50) {




}







3




The data



load("ames.RData")











1










n <- 60


for (i in 1:50) {




}







3









n <- 60


for (i in 1:50) {




}







3









n <- 60


for (i in 1:50) {




}







3




The data



load("ames.RData")











1

> HIDE

concept: confidence levelsplot these confidence intervals and highlight those that do not contain the true population parameter (custom function)

> HIDE


On your own

1. Using the following custom function, plot all intervals. What proportion of your confidence intervalsinclude the true population mean? Is this proportion exactly equal to the confidence level? If no,explain why.†

plot_ci(lower, upper, mean(population))

2. Pick a confidence level of your choosing. What is the appropriate critical value?

3. Calculate 50 confidence intervals at this confidence level. You do not need to obtain new samples,simply calculate new intervals based on the samples you have already collected. Using the plot ci

function plot all intervals and calculate the proportion of intervals that include the true populationmean. How does this percentage compare to the confidence level you picked?

4. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered inthe textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs,or homework problems? Be specific in your answer.

†This figure should look familiar, if not, see Section 4.2.2.

4

> HIDEConfidence levels

Resample from the population many times and construct manyconfidence intervals (loops)Plot these confidence intervals and highlight those that do notcontain the true population parameter (custom function)

plot.ci(lower, upper, mean(pop))

mu = 1499.6904

Cetinkaya-Rundel, Bray Integrating R into Introductory Statistics useR! 2012 - June 13, 2012 6 / 15


On your own

1. Using the following custom function, plot all intervals. What proportion of your confidence intervalsinclude the true population mean? Is this proportion exactly equal to the confidence level? If no,explain why.†

plot_ci(lower, upper, mean(population))

2. Pick a confidence level of your choosing. What is the appropriate critical value?

3. Calculate 50 confidence intervals at this confidence level. You do not need to obtain new samples,simply calculate new intervals based on the samples you have already collected. Using the plot ci

function plot all intervals and calculate the proportion of intervals that include the true populationmean. How does this percentage compare to the confidence level you picked?

4. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered inthe textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs,or homework problems? Be specific in your answer.

†This figure should look familiar, if not, see Section 4.2.2.

4

> MINIMIZE

concept: statistical inference

> MINIMIZE

> traditional curriculum for an introductory statistics course includes various statistical inference techniques


> MINIMIZE


> when introduced as disconnected topic these can be overwhelming to students


> MINIMIZE


> when introduced as disconnected topic these can be overwhelming to students

> to help unify inferential concepts, use one function that does it all, but still requires students to think about the nature of the data and encourages them to conduct exploratory data analysis


> EXAMPLE

concept: statistical inferencefunction: inference() - theoretical and simulation based inference

inference <- function(data, group = NULL, est = c("mean", "median", "proportion"),

success = NULL, order = NULL, nsim = 10000, conflevel = 0.95, null = NULL,

alternative = c("less", "greater", "twosided"), type = c("ci", "ht"),

method = c("theoretical", "simulation"), drawlines = "yes", simdist = FALSE){

...

}

3

> EXAMPLE


inference <- function(data, group = NULL, est = c("mean", "median", "proportion"),

success = NULL, order = NULL, nsim = 10000, conflevel = 0.95, null = NULL,

alternative = c("less", "greater", "twosided"), type = c("ci", "ht"),

method = c("theoretical", "simulation"), drawlines = "yes", simdist = FALSE){

...

}

3

> data: response variable, quantitative or categorical

> group: explanatory variable, categorical for grouping (optional)

> type: confidence interval or hypothesis test

> method: theoretical or simulation

> ...

> EXAMPLE


question: compare birth weights of babies born to smoker and nonsmoker mothers (source: north carolina births)

> USE IN LABS

Exercise 2 Make a side-by-side boxplot of habit and weight. What does the plot tell us aboutthe relationship between the two variables habit and weight?

The side-by-side box plots show how the medians of the two distributions compare, and we can alsocompare the means of the distributions. The following command gives the mean weights of babies bornto smoker and non-smoker mothers.

by(nc$weight, nc$habit, mean)

There is clearly an observed difference, but is this difference statistically significant? In order to answerthis question we need to conduct a hypothesis test.

Exercise 3 Check if the conditions necessary for inference are satisfied? Note that you willneed to obtain sample sizes to check the conditions.

Exercise 4 Write the hypotheses for testing if the average weights of babies born to smokerand non-smoker mothers are different.

Next, we introduce a new function, inference, that we will use for conducting hypothesis tests andconstructing confidence intervals.

inference(data = nc$weight, group = nc$habit, est = "mean", type = "ht", null = 0,

alternative = "twosided", method = "theoretical")

Let’s pause for a moment to go through the arguments of this custom function.

• The first argument is data, this is the response variable that we are interested in: weight

• The second argument is the grouping variable, group, this is the variable that we use to split the datainto two groups, smokers and nonsmokers: habit.

• The third argument (est) is the parameter we’re interested in: mean (other options are median, orproportion.)

• Next we decide on the type of inference we want: a hypothesis test (ht) or a confidence interval (ci).

• When doing a hypothesis test we also need to supply the null value, which in this case is 0, sincethe null hypothesis sets the two population means equal to each other.

• The alternative hypothesis can be less, greater, twosided.

• Lastly, the method of inference can be theoretical or simulation based.

Exercise 5 Change the type argument to ci construct a confidence interval for the differencebetween the weights of babies born to smoker and non-smoker mothers.

By default the function reports an interval for (µnonsmoker

� µsmoker

), we can easily change this order:

inference(data = nc$weight, group = nc$habit, est = "mean", type = "ci", null = 0,

alternative = "twosided", method = "theoretical", order = c("smoker", "nonsmoker"))

2

input:


> USE IN LABS

Exercise 2 Make a side-by-side boxplot of habit and weight. What does the plot tell us aboutthe relationship between the two variables habit and weight?

The side-by-side box plots show how the medians of the two distributions compare, and we can alsocompare the means of the distributions. The following command gives the mean weights of babies bornto smoker and non-smoker mothers.

by(nc$weight, nc$habit, mean)

There is clearly an observed difference, but is this difference statistically significant? In order to answerthis question we need to conduct a hypothesis test.

Exercise 3 Check if the conditions necessary for inference are satisfied? Note that you willneed to obtain sample sizes to check the conditions.

Exercise 4 Write the hypotheses for testing if the average weights of babies born to smokerand non-smoker mothers are different.

Next, we introduce a new function, inference, that we will use for conducting hypothesis tests andconstructing confidence intervals.

inference(data = nc$weight, group = nc$habit, est = "mean", type = "ht", null = 0,

alternative = "twosided", method = "theoretical")

Let’s pause for a moment to go through the arguments of this custom function.

• The first argument is data, this is the response variable that we are interested in: weight

• The second argument is the grouping variable, group, this is the variable that we use to split the datainto two groups, smokers and nonsmokers: habit.

• The third argument (est) is the parameter we’re interested in: mean (other options are median, orproportion.)

• Next we decide on the type of inference we want: a hypothesis test (ht) or a confidence interval (ci).

• When doing a hypothesis test we also need to supply the null value, which in this case is 0, sincethe null hypothesis sets the two population means equal to each other.

• The alternative hypothesis can be less, greater, twosided.

• Lastly, the method of inference can be theoretical or simulation based.

Exercise 5 Change the type argument to ci construct a confidence interval for the differencebetween the weights of babies born to smoker and non-smoker mothers.

By default the function reports an interval for (µnonsmoker

� µsmoker

), we can easily change this order:

inference(data = nc$weight, group = nc$habit, est = "mean", type = "ci", null = 0,

alternative = "twosided", method = "theoretical", order = c("smoker", "nonsmoker"))

2

input:

nonsmoker smoker

24

68

1012

-0.32 0 0.32nonsmoker smoker

24

68

1012

-0.32 0 0.32

One quantitative and one categorical variableDifference between two meansn_nonsmoker = 873 ; n_smoker = 126 Observed difference between means = 0.3155H0: mu_nonsmoker - mu_smoker = 0 HA: mu_nonsmoker - mu_smoker != 0 Standard error = 0.134 Test statistic: Z = 2.359 p-value: 0.0184

output:


> USE IN LABS

question: compare numbers of sexual partners of males and females (source: national survey of family growth)

> USE IN PROJECTS


input:inference(data = partners, group = gender, type = "ci", est = "mean", method =

"theoretical")

3

> USE IN PROJECTS


input:inference(data = partners, group = gender, type = "ci", est = "mean", method =

"theoretical")

3

output:

female male

01

23

45

67

One quantitative and one categorical variableDifference between two meansn_female = 12190 ; n_male = 10397 Observed difference between means = -0.432Standard error = 0.0361 95 % Confidence interval = ( -0.5 , -0.36 )

> USE IN PROJECTS

> resources

> RESOURCES

openintro.org/stat/labs.php

http://www.openintro.org/stat/labs.php


> LAB 2 - PROBABILITY

> LAB 8 - MULTIPLE REGRESSION

> reactions

> STUDENT REACTIONS - LABS

positive:> ``I like them. I feel like in the real world we’ll be using software

to do stats, so I’m glad we’re learning how to use it.’’

> ``I LOVE the labs. They really help cement basic statistic ideas, and I especially love that you can finish them in class.’’

> ``The labs are a lot of fun. It’s great being able to create our own simulations and watch R Studio calculate everything. I also enjoy learning some code.’’

negative:> ``The labs are alright. Sometimes I feel like I’m just plugging in

stuff and I feel disconnected from what I’m really doing. It’s also frustrating when the code doesn’t work.’’

> ``Wish other students focused more.’’

> STUDENT REACTIONS - R

positive:> ``Super useful and powerful software. It’s exciting to be introduced to it.

Once again, don’t always feel comfortable writing code/ understanding what I’m doing.’’

> `Ì like it! I kind of know MATLAB, which has helped with the coding a bit, but it’s a little more intuitive/easier, and very helpful.’’

> `Ì am not a computer person at all, but I find RStudio very easy to use.’’> `Ì like it better than STATA which we used for [another class]. The user

interface is easy and there is plenty of help for it online. Overall, it’s pretty good.’’

negative:> `Ì am not a fan of coding in general. I used Python before and RStudio is

better (for me) than Python was, but I am not a fan of either.’’> `Èasy to use, language is not too hard to understand although error

messages could be more informative.’’> `Ì don’t think RStudio will have any use to me outside of this class.’’

> also...

> ADDITIONAL CONSIDERATIONS


> Labs should be fully integrated with the curriculum


> Labs should be fully integrated with the curriculum``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’



> Works best in a classroom environment where students can collaborate with each other and get immediate support




``Collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.”




``Collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.”

> TAs need to be familiar and comfortable with the material

> thank you

[email protected]

stat.duke.edu/~mc301

openintro.org/stat/labs.php

contact :

web :

labs :

mailto:[email protected]

mailto:[email protected]

http://www.stat.duke.edu/~mc301

http://www.stat.duke.edu/~mc301



> integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating...

Documents

Transcript of > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating...