STAT 111 Introductory Statistics

STAT 111 Introductory Statistics

Lecture 4: Collecting Data

May 24, 2004

Today’s Topics

• Relationships between categorical variables

• Collecting Data– Designing experiments– Choosing a sample– Sampling distributions

Categorical Variables

• Recall that categorical variables separate individuals into groups.

• We’ve seen that to see relationships between quantitative variables, we use scatterplots.

• Similarly, to see relationships between categorical explanatory variables and quantitative responses, side-by-side boxplots are quite useful.

• What do we use to see the relationship between two categorical variables, though?

Contingency Table

• The contingency table is a two-way table with one variable as the row variable and the other as the column variable.

• The row totals and column totals in a two-way table give the marginal distributions of two variables separately.

• Conditional distribution of the response variable for each category of the explanatory variable could be used to describe the association between the two variables.

Contingency Table Example 1

• Titanic data – 2201 passengers, only the countsS

EX

female

male126 344

1364 367

470

1731

1490 711 2201

SURVIVED

TotalCount no yes

Total

Column variable

Row variable

Joint and Marginal Distributions

SE

X

female

male

126

5.72

344

15.63

1364

61.97

367

16.67

470

21.35

1731

78.65

1490

67.70

711

32.30

2201

SURVIVED

Count

Total %no yes

Contingency Table

Contingency Analysis of SURVIVED By SEX

Joint Distribution

Marginal distribution ofSURVIVED

Marginal distribution ofSEX

Conditional DistributionsS

EX

female

male

126

5.72

8.46

26.81

344

15.63

48.38

73.19

1364

61.97

91.54

78.80

367

16.67

51.62

21.20

470

21.35

1731

78.65

1490

67.70

711

32.30

2201

SURVIVED

Count

Total %

Col %

Row %

no yes

Contingency Table

Contingency Analysis of SURVIVED By SEXConditional distribution ofsurvival given gender

Conditional distribution ofgender given survival

• Joint distribution:– P( Male surviving ) = 16.67%– P( Female surviving ) = 15.63%

• Marginal distribution:– P( Surviving ) = 32.30%– P( Male ) = 78.65%

• Conditional distribution:

yes no

Survival 73.19% 26.81%

Given a female

yes no

Survival 21.20% 78.80%

Given a male

Example from Contingency Table 1

Example from Contingency Table 1

• We see that of the people on board the ship, female survivors and male survivors made up roughly the same percentage.

• But the number of females on board was substantially smaller than the number of males.

• Looking at each category, we see that the percentage of females that survived is higher than the percentage of males that survived.

• Survival and gender seem to be associated.

Lurking Variables

• We know that lurking variables can produce nonsensical relationships between two quantitative variables.

• Does the same hold true for relationships between categorical variables?

• Example – We have the number of delayed and on-time flights for two airlines, Alaska Airlines (AA) and America West (AW). Which one has more flights that leave on-time?

Lurking Variables (cont.)A

irli

ne

AA

AW

501

13.27

3274

86.73

787

10.89

6438

89.11

3775

7225

1288 9712 11000

Status

Count

Row %delay on-time

• Looking at the contingency table below, it looks like America West has a larger percentage of on-time flights. But…

Lurking Variables (cont.)

• Let’s look at the data for the individual cities.

Airlin

e

aa

aw

62

11.09

497

88.91

117

14.43

694

85.57

559

811

179 1191 1370

Status

Count

Row %d o

Airlin

e

aa

aw

12

5.15

221

94.85

415

7.90

4840

92.10

233

5255

427 5061 5488

Status

Count

Row %d o

Airlin

e

aa

aw

20

8.62

212

91.38

65

14.51

383

85.49

232

448

85 595 680

Status

Count

Row %d o

Airlin

e

aa

aw

305

14.21

1841

85.79

61

23.28

201

76.72

2146

262

366 2042 2408

Status

Count

Row %d o

Airlin

e

aa

aw

102

16.86

503

83.14

129

28.73

320

71.27

605

449

231 823 1054

Status

Count

Row %d o

Los Angeles Phoenix San Diego

Seattle San Francisco


• For each individual city, the percentage of flights that are on-time is higher for Alaska Airlines than it is for America West.

• On the other hand, the percentage of flights that are on-time is higher for America West than for Alaska Airlines when we look at the aggregate.

• What’s going on here?


• An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is Simpson’s paradox.

• Simpson’s paradox is an extreme form of the fact that observed associations can be misleading in the presence of lurking variables.

• Our case is an example of Simpson’s paradox, so what is the lurking variable here?


• The lurking variable here is the city, and in particular, the weather of that city.

• Of the five cities listed, Seattle has the worst weather, so flights tend to be more delayed in this airport. Phoenix, on the other hand, is not plagued with bad weather, so flights tend to be more on-time.

• Most of Alaska Airline’s flights involve Seattle, whereas America West’s flights mostly involve Phoenix!

Contingency Tables – Wrap-up

• Most often, the contingency tables you’ll see will be of categorical variables with two levels each.

• Naturally, we can extend this to categorical variables with more than two levels.

• Also, we can consider a contingency table involving three variables; what we do in this case is create a series of contingency tables involving only the first two variables, one table for each of the levels of the third variable.

Collecting Data

• We’ve discussed previously the idea of exploratory data analysis.– “What do we see in our data?”

• Formal statistical inference is another type of data analysis.– Here, we are more interested in answering specific

questions with a known degree of confidence.

• Either way, successful statistical analysis requires our data to be both reliable and accurate.

Collecting Data (cont.)

• The reliability and accuracy of our data depend on the method we use to collect our data. This method is known as a design.

• Some popular sources of data are– Available data from libraries and the internet

(Available data are data that were produced in the past for some other purpose but that may help answer a present question.)

– Observational studies– Experimental studies

Observational vs Experimental Studies

• In an observational study, we observe individuals and measure variables of interest, but we do not attempt to influence the responses.

• In an experiment, we deliberately impose some treatment on individuals in order to observe their responses.

• An observational study is generally poor at gauging the effect of an intervention, but in many situations, we have to use an observational study.

Sample Surveys

• The sample survey is one specific type of observational study.

• Why is it preferred to a census?– Financial constraints– Time

• A sampling survey can be conducted using– Personal interviews– Telephone interviews– Self-administered questionnaires

Experiments

• Experimental units: individuals on which our experiment is conducted

• Subjects: human experimental units

• Treatment: specific experimental condition applied to our units

• In principle, experiments can give good evidence of causation.

Principles in Designing Experiments

• Control the effects of lurking variables on the response; easiest way to do this is by comparing two or more treatments. This can help reduce the bias in a study.

• Randomize – use chance to assign experimental units to treatments.

• Replicate each treatment on many units to reduce chance variation in the results.

More on Experiments

• In an experiment, we hope a difference in the responses so large that it is unlikely to happen because of chance variation alone.

• In other words, we are looking for a statistically significant effect.

• This terms frequently appears in reports of studies and tells you that the investigators found good evidence for the effect they were seeking.

• The most serious weakness of experiments, though, is their lack of realism.

Types of Experimental Designs• Completely randomized design: experimental

units are allocated at random among treatments. Simplest design for experiments.

• Block design: blocks of experimental units are formed; random assignments of units to treatments is carried out separately within each block.

• Matched pairs design: special type of block design that compares only two treatments by choosing blocks of two units that are as closely matched as possible.

Review: Population vs Sample

• Population: the entire group of individuals that we want information about

• Sample: the part of the population we actually examine in order to gather information

• Parameter: a value that describes the population. It is fixed, but generally unknown.

• Statistic: a value that describes the sample. It is observed once a sample is obtained and can be used to estimate an unknown parameter.

• We generally require that the sample be a good representative of the population.

Sampling Designs

• Voluntary response sample– Biased sample scheme scheme

• Simple random sample

• Stratified random sample

• Cluster sample (one-stage and two-stage)

Sampling Designs

• A voluntary response sample consists of people who choose themselves by responding to a general appeal.

• This type of sample is invariably biased (contains a systematic error) and is not usually representative of the general population. Why?

• The people who are willing to respond are the only ones included in this sample, and usually those are the ones with very strong opinions.

• So what we get are the extreme cases.

Sampling Designs (cont.)

• Better sampling designs choose individuals by random chance so that the bias is eliminated.

• A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.

• How do we select an SRS?– Assign a number to each individual in the population.– Randomly select sample numbers by using a random

numbers table or software package.


• A probability sample is a sample chosen by chance and is the general framework for designs that use chance to choose a sample. Possible samples and the probability of each possible sample occurring must be known.

• The SRS is the simplest type of probability sample; it gives each member of the population an equal chance of selection.

• More complex designs are better for sampling from large populations.


• To select a stratified random sample, divide the population into groups of similar individuals, called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample.

Sex• Male• Female

Age• under 20• 20-30• 31-40• 41-50

Martial status• Married• Single


• We typically choose the strata based on facts we know prior to taking the sampling.

• Strata for sampling are similar to blocks in experiments.

• Overall, using a stratified random sample, we can acquire information about– The whole population– Each stratum– The relationships among the strata

Sampling Design (cont.)

• The SRS and stratified random sample both select individuals from the population.

• On the other hand, the cluster sample selects groups or clusters of individuals from the population. A cluster is also referred to as a primary sampling unit (PSU).

• In a one-stage cluster sample, all individuals within the selected clusters are selected.

• In a two-stage cluster sample, a SRS of the individuals within each selected cluster is drawn.


• A two-stage cluster sample is an example of a multistage sampling design.

• This is a more complex design in which, as the name suggests, a sample is obtained by sampling in multiple stages.

• Basically, any sort of combination of an SRS, stratified random sample, and cluster sample can create a multistage sample.

Errors – Non-sampling vs Sampling

• Non-sampling errors occur due to mistakes made during the process of data acquisition.

• Increasing sample size will not reduce this type of error.

• There are three types of non-sampling errors:– Errors in data acquisition, e.g., response bias– Nonresponse errors– Selection bias, such as undercoverage

Error in Data Acquisition

If this observation…

…is wrongly recorded here…

Sampling error + Data acquisition error

Population

Sample

Population

Sample

No response here... …may lead to biased results here.

Nonresponse Error

Selection Bias

Population

Sample

When parts of the population cannot be selected...

…the sample cannot representthe whole population.

Sampling Error

• Sampling error refers to differences between the sample and the population, because of the specific observations that happen to be selected.

• Sampling error is expected to occur when making a statement about the population based on the sample taken.

Population

Sample

The sample mean

Population meanSampling error

x

Sampling Distributions

• The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.

• The bias of a statistic is the difference between the mean of its sampling distribution and the population parameter; no bias = unbiased.

• The variability is described by the spread of its sampling distribution; determined by the design and size of the sample.

High bias, low variability Low bias, high variability

High bias, high variability Low bias, low variability

More on Sampling Errors

• We are often concerned with how to manage the bias and variability of a statistic.

• To reduce the bias, we use random sampling.– Generally speaking, estimates drawn from an SRS are

unbiased (which is why the SRS is so attractive).

• To reduce the variability of a statistic from an SRS, increase the sample size.

• There is a trade-off between bias and variability , however (i.e., we cannot make both very small).

STAT 111 Introductory Statistics

Documents

Transcript of STAT 111 Introductory Statistics