Simulation for Examining Margin of Error and Sample Size: Binomial Proportions Acknowledgements to...

Post on 29-Dec-2015

214 views 1 download

Tags:

Transcript of Simulation for Examining Margin of Error and Sample Size: Binomial Proportions Acknowledgements to...

Simulation for Examining Margin of Error and Sample Size: Binomial

Proportions

Acknowledgements to Mandy Kauffman (WEST, Inc.) for photosand ‘background’ slides…simulation exercise adapts pedagogy ofTrumbo,Suess, and Okumura (2005)

2

Background – Bovine brucellosis

• Bacterial disease– History in US– Elk, bison, cattle (humans)– Cattle wildlife– Causes abortions– Environmental contamination– Potential transmission to cattle

• $$$$• Management implications

3

• Harsh winters + development elk starving, commingling with cattle

• 23 supplemental winter elk feedgrounds created• 22 WGFD• 1 USFWS

• Up 84% of elk use feedgrounds• Low winter mortality• Costly• 22% seroprevalence

on feedgrounds3.7% elsewhere

Background - Elk Feedgrounds

Preble 1911

Background – Management

• Management strategies1. Maintain cattle/elk separation

-hazing elk -fencing haystacks-elk feedgrounds

2. ↓ likelihood of exposed cattle experiencing abortions (RB51)3. ↓ seroprevalence in elk

-T&S -low density feeding-elk vaccination

Background: Management• Despite ongoing management:

– Recent cases in cattle/bison traced back to elk– Affected area expanding

• Limited $$ available for management

– No clear scientifically sound method– Need for economic evaluation of available management strategies

• Groups 1 & 2 already evaluated/underway• Evaluation of Group 3 strategies still needed

– How to assess sero-prevalence of brucellosis in elk on feedgrounds…how many elk to sample?

A Beginning

R Code:

samp <- sample(0:1,25,rep=T,prob=c(0.78,0.22))samp

Let’s start with simulating brucellosis diagnosis from 25randomly sampled elk…assume prevalence is 0.22…goal to estimate prevalence within 5%...how many elk needed?

Issue with the assumption of random here?

> samp

[1] 0 1 1 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0

Generate a Profile

Let’s observe the incidence rate for a variety of samplesizes...keep in mind that this profile doesn’t display independent samples

R Code:n <- 25NumElk <- 1:np.bruc <- c(0.78,0.22)x <- sample(0:1,n,rep=T,prob=p.bruc)run.tot.pos <- cumsum(x)Proportion <- run.tot.pos/NumElktabresults <- round(cbind(NumElk,x,run.tot.pos,Proportion),3)tabresults

Generating a Profile

R Code:

plot(NumElk,Proportion,type="l", ylim= c(0,1))

abline(h=0.22,col="green",lwd=2)abline(h=0.17,col="blue",lwd=2,lty=3)abline(h=0.27,col="blue",lwd=2,lty=3)

Running the simulation and corresponding plots several timeswill provide differing versions of the profile on the next page…variation in profiles displays the instability in our statistic

Generating Multiple Profiles

R Code:

set.seed(11); n <- 25; numsamps <- 20

plot(0,pch=" ", xlim=c(0,n), ylim = c(0.1,1.0), xlab = "NumElk", ylab="Proportion")

#Loop below will produce a different profile for each of the specified#numsamps…

for(i in 1:numsamps){x <- cumsum(sample(0:1,n,rep=T,prob=p.bruc)) / (1:n)lines(1:n,x)}

abline(h=0.22,col="green",lwd=2)abline(h=0.17,col="blue",lwd=2,lty=3)abline(h=0.27,col="blue",lwd=2,lty=3)

Calculated margin of error:

0.22*0.781.96 0.1624

25

How about 263 elk?

• Power is a concept that is often difficult to teach, regardless of the level of the course…probably because hypothesis testing is such a strange beast!

• Many practical applications involve evaluating sampling protocols in terms of the ability to detect change over a period of time

• Simulation often quite effective for determining power associated with a particular design

Simulation for Determining Power

• Cobble bars are an important feature for many streams…home of native and non-native plants and many species of birds nest in these regions

What is the proportion of ‘woody vegetation’ cover on cobblebars in the Great Smokey Mountains?

Monitoring Woody Vegetation in Cobble Bars

-

22 cobble bars exist in the BISO area…can afford to sample 9 of them

-GRTS selection of sampling units

- rotating panel monitoring schedule

transect

Point intercept countson transect

While there are built-in functions in the ‘pwr’ library for computing power and

websites such as Russ Lenth’s Power And Sample Size

(http://homepage.stat.uiowa.edu/~rlenth/Power/), it is often the case that

your study design will be more complicated than what ‘canned’ allow for when

computing power.

Ex 1. Suppose that we are investigating a single cobblebar and that this year we

observe 34% coverage in woody vegetation. We will go out once a year to measure the % coverage. If there is a linear trend then the % coverage can be modeled as a function of year using:

0 1cover%i i iyear

Let’s scale ‘year’ to be 0, 1, 2, 3, and 4 (so a five year study) with 0 denoting

the % coverage in year 0, the slope, 1 , denoting the per-annum change in %

woody coverage, and the model error term, i denotes the uncertainty in the

measured woody percentage value within a given year (due to measurement

error, weather events, etc.)

0 1 0 1: 0 : 0 H vs. H

Will observe this cobblebar over time, run a linear regression and compute the

estimated slope along with its estimated standard error and see if the p-value

is less than our specified .

We can estimate power by simulating many data sets with a given slope,

intercept, standard deviation, and then keeping track of how many times we

reject the null that the slope is zero. Let’s suppose we want to detect a per

annum change of 3.4% in the %woody coverage.

R Code for p-value extraction:

bzero <- 34 b1 <- 3.4 yr <- 0:4 N <- 5 sd <- 7.4 pct_mean <- bzero + b1 * yr pct <- rnorm(N, mean = pct_mean, sd = sd) m <- lm(pct ~ yr) coef(summary(m))["yr", "Pr(>|t|)"]

Below, we have the true trend (i.e. 3.4% per year) plotted vs. the estimated

trend from one set of simulated data:

For this particular simulation, the p-value extracted was: > coef(summary(m))["yr", "Pr(>|t|)"] [1] 0.1726658 Thus, although we know there is a 3.4% per annum change, our data did not

result in a small enough p-value for us to detect an actual trend.

To compute the power, we need to do what we just did many times and then

record what percentage of times the null hypothesis would have been rejected

We can use the code below to do this: R Code: numsim <- 500 pvals <- numeric(numsim) for(i in 1:numsim){ pct_mean <- bzero + b1 * yr pct <- rnorm(N, mean = pct_mean, sd = sd) m <- lm(pct ~ yr) pvals[i] <- coef(summary(m))["yr", "Pr(>|t|)"] } sum(pvals < 0.05)/numsim

lin_reg_sim <- function(bzero,b1,numyear,sd,numsim){ yr <- 0:(numyear-1) pvals <- numeric(numsim) for (i in 1:numsim) { pct_mean <- bzero + b1 * yr pct <- rnorm(numyear, mean = pct_mean, sd = sd) m <- lm(pct ~ yr) pvals[i] <- coef(summary(m))["yr", "Pr(>|t|)"] } # end of i loop power <- (sum(pvals < 0.05)/numsim) return (power) } # end of function linreg_sim lin_reg_sim(bzero=34,b1=3.4,numyear=5,sd=7.4,numsim=500)

As a Single Function

The code to the left computesas we change the effect size (i.e. the per annum change)

R Code: bzero <- 34; yr <- 0:4; N <- 5; sd <- 7.4 bvec <- seq(0.85,8.5,by=0.85) power.b <- numeric(length(bvec)) for (j in 1:length(bvec)){ b1 <- bvec[j] for (i in 1:500) { pct_mean <- bzero + b1 * yr pct <- rnorm(N, mean = pct_mean, sd = sd) m <- lm(pct ~ yr) pvals[i] <- coef(summary(m))["yr", "Pr(>|t|)"] } # end of i loop power.b[j] <- (sum(pvals < 0.05)/500) } # end of j loop plot(bvec,power.b,xlab="slopes",ylab="power")

Just as we would expect, the power increases with larger trends