STAT 111 Introductory Statistics
-
Upload
madrona-escovado -
Category
Documents
-
view
55 -
download
6
description
Transcript of STAT 111 Introductory Statistics
STAT 111 Introductory Statistics
Lecture 4: Collecting Data
May 24, 2004
Today’s Topics
• Relationships between categorical variables
• Collecting Data– Designing experiments– Choosing a sample– Sampling distributions
Categorical Variables
• Recall that categorical variables separate individuals into groups.
• We’ve seen that to see relationships between quantitative variables, we use scatterplots.
• Similarly, to see relationships between categorical explanatory variables and quantitative responses, side-by-side boxplots are quite useful.
• What do we use to see the relationship between two categorical variables, though?
Contingency Table
• The contingency table is a two-way table with one variable as the row variable and the other as the column variable.
• The row totals and column totals in a two-way table give the marginal distributions of two variables separately.
• Conditional distribution of the response variable for each category of the explanatory variable could be used to describe the association between the two variables.
Contingency Table Example 1
• Titanic data – 2201 passengers, only the countsS
EX
female
male126 344
1364 367
470
1731
1490 711 2201
SURVIVED
TotalCount no yes
Total
Column variable
Row variable
Joint and Marginal Distributions
SE
X
female
male
126
5.72
344
15.63
1364
61.97
367
16.67
470
21.35
1731
78.65
1490
67.70
711
32.30
2201
SURVIVED
Count
Total %no yes
Contingency Table
Contingency Analysis of SURVIVED By SEX
Joint Distribution
Marginal distribution ofSURVIVED
Marginal distribution ofSEX
Conditional DistributionsS
EX
female
male
126
5.72
8.46
26.81
344
15.63
48.38
73.19
1364
61.97
91.54
78.80
367
16.67
51.62
21.20
470
21.35
1731
78.65
1490
67.70
711
32.30
2201
SURVIVED
Count
Total %
Col %
Row %
no yes
Contingency Table
Contingency Analysis of SURVIVED By SEXConditional distribution ofsurvival given gender
Conditional distribution ofgender given survival
• Joint distribution:– P( Male surviving ) = 16.67%– P( Female surviving ) = 15.63%
• Marginal distribution:– P( Surviving ) = 32.30%– P( Male ) = 78.65%
• Conditional distribution:
yes no
Survival 73.19% 26.81%
Given a female
yes no
Survival 21.20% 78.80%
Given a male
Example from Contingency Table 1
Example from Contingency Table 1
• We see that of the people on board the ship, female survivors and male survivors made up roughly the same percentage.
• But the number of females on board was substantially smaller than the number of males.
• Looking at each category, we see that the percentage of females that survived is higher than the percentage of males that survived.
• Survival and gender seem to be associated.
Lurking Variables
• We know that lurking variables can produce nonsensical relationships between two quantitative variables.
• Does the same hold true for relationships between categorical variables?
• Example – We have the number of delayed and on-time flights for two airlines, Alaska Airlines (AA) and America West (AW). Which one has more flights that leave on-time?
Lurking Variables (cont.)A
irli
ne
AA
AW
501
13.27
3274
86.73
787
10.89
6438
89.11
3775
7225
1288 9712 11000
Status
Count
Row %delay on-time
• Looking at the contingency table below, it looks like America West has a larger percentage of on-time flights. But…
Lurking Variables (cont.)
• Let’s look at the data for the individual cities.
Airlin
e
aa
aw
62
11.09
497
88.91
117
14.43
694
85.57
559
811
179 1191 1370
Status
Count
Row %d o
Airlin
e
aa
aw
12
5.15
221
94.85
415
7.90
4840
92.10
233
5255
427 5061 5488
Status
Count
Row %d o
Airlin
e
aa
aw
20
8.62
212
91.38
65
14.51
383
85.49
232
448
85 595 680
Status
Count
Row %d o
Airlin
e
aa
aw
305
14.21
1841
85.79
61
23.28
201
76.72
2146
262
366 2042 2408
Status
Count
Row %d o
Airlin
e
aa
aw
102
16.86
503
83.14
129
28.73
320
71.27
605
449
231 823 1054
Status
Count
Row %d o
Los Angeles Phoenix San Diego
Seattle San Francisco
Lurking Variables (cont.)
• For each individual city, the percentage of flights that are on-time is higher for Alaska Airlines than it is for America West.
• On the other hand, the percentage of flights that are on-time is higher for America West than for Alaska Airlines when we look at the aggregate.
• What’s going on here?
Lurking Variables (cont.)
• An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is Simpson’s paradox.
• Simpson’s paradox is an extreme form of the fact that observed associations can be misleading in the presence of lurking variables.
• Our case is an example of Simpson’s paradox, so what is the lurking variable here?
Lurking Variables (cont.)
• The lurking variable here is the city, and in particular, the weather of that city.
• Of the five cities listed, Seattle has the worst weather, so flights tend to be more delayed in this airport. Phoenix, on the other hand, is not plagued with bad weather, so flights tend to be more on-time.
• Most of Alaska Airline’s flights involve Seattle, whereas America West’s flights mostly involve Phoenix!
Contingency Tables – Wrap-up
• Most often, the contingency tables you’ll see will be of categorical variables with two levels each.
• Naturally, we can extend this to categorical variables with more than two levels.
• Also, we can consider a contingency table involving three variables; what we do in this case is create a series of contingency tables involving only the first two variables, one table for each of the levels of the third variable.
Collecting Data
• We’ve discussed previously the idea of exploratory data analysis.– “What do we see in our data?”
• Formal statistical inference is another type of data analysis.– Here, we are more interested in answering specific
questions with a known degree of confidence.
• Either way, successful statistical analysis requires our data to be both reliable and accurate.
Collecting Data (cont.)
• The reliability and accuracy of our data depend on the method we use to collect our data. This method is known as a design.
• Some popular sources of data are– Available data from libraries and the internet
(Available data are data that were produced in the past for some other purpose but that may help answer a present question.)
– Observational studies– Experimental studies
Observational vs Experimental Studies
• In an observational study, we observe individuals and measure variables of interest, but we do not attempt to influence the responses.
• In an experiment, we deliberately impose some treatment on individuals in order to observe their responses.
• An observational study is generally poor at gauging the effect of an intervention, but in many situations, we have to use an observational study.
Sample Surveys
• The sample survey is one specific type of observational study.
• Why is it preferred to a census?– Financial constraints– Time
• A sampling survey can be conducted using– Personal interviews– Telephone interviews– Self-administered questionnaires
Experiments
• Experimental units: individuals on which our experiment is conducted
• Subjects: human experimental units
• Treatment: specific experimental condition applied to our units
• In principle, experiments can give good evidence of causation.
Principles in Designing Experiments
• Control the effects of lurking variables on the response; easiest way to do this is by comparing two or more treatments. This can help reduce the bias in a study.
• Randomize – use chance to assign experimental units to treatments.
• Replicate each treatment on many units to reduce chance variation in the results.
More on Experiments
• In an experiment, we hope a difference in the responses so large that it is unlikely to happen because of chance variation alone.
• In other words, we are looking for a statistically significant effect.
• This terms frequently appears in reports of studies and tells you that the investigators found good evidence for the effect they were seeking.
• The most serious weakness of experiments, though, is their lack of realism.
Types of Experimental Designs• Completely randomized design: experimental
units are allocated at random among treatments. Simplest design for experiments.
• Block design: blocks of experimental units are formed; random assignments of units to treatments is carried out separately within each block.
• Matched pairs design: special type of block design that compares only two treatments by choosing blocks of two units that are as closely matched as possible.
Review: Population vs Sample
• Population: the entire group of individuals that we want information about
• Sample: the part of the population we actually examine in order to gather information
• Parameter: a value that describes the population. It is fixed, but generally unknown.
• Statistic: a value that describes the sample. It is observed once a sample is obtained and can be used to estimate an unknown parameter.
• We generally require that the sample be a good representative of the population.
Sampling Designs
• Voluntary response sample– Biased sample scheme scheme
• Simple random sample
• Stratified random sample
• Cluster sample (one-stage and two-stage)
Sampling Designs
• A voluntary response sample consists of people who choose themselves by responding to a general appeal.
• This type of sample is invariably biased (contains a systematic error) and is not usually representative of the general population. Why?
• The people who are willing to respond are the only ones included in this sample, and usually those are the ones with very strong opinions.
• So what we get are the extreme cases.
Sampling Designs (cont.)
• Better sampling designs choose individuals by random chance so that the bias is eliminated.
• A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.
• How do we select an SRS?– Assign a number to each individual in the population.– Randomly select sample numbers by using a random
numbers table or software package.
Sampling Designs (cont.)
• A probability sample is a sample chosen by chance and is the general framework for designs that use chance to choose a sample. Possible samples and the probability of each possible sample occurring must be known.
• The SRS is the simplest type of probability sample; it gives each member of the population an equal chance of selection.
• More complex designs are better for sampling from large populations.
Sampling Designs (cont.)
• To select a stratified random sample, divide the population into groups of similar individuals, called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample.
Sex• Male• Female
Age• under 20• 20-30• 31-40• 41-50
Martial status• Married• Single
Sampling Designs (cont.)
• We typically choose the strata based on facts we know prior to taking the sampling.
• Strata for sampling are similar to blocks in experiments.
• Overall, using a stratified random sample, we can acquire information about– The whole population– Each stratum– The relationships among the strata
Sampling Design (cont.)
• The SRS and stratified random sample both select individuals from the population.
• On the other hand, the cluster sample selects groups or clusters of individuals from the population. A cluster is also referred to as a primary sampling unit (PSU).
• In a one-stage cluster sample, all individuals within the selected clusters are selected.
• In a two-stage cluster sample, a SRS of the individuals within each selected cluster is drawn.
Sampling Designs (cont.)
• A two-stage cluster sample is an example of a multistage sampling design.
• This is a more complex design in which, as the name suggests, a sample is obtained by sampling in multiple stages.
• Basically, any sort of combination of an SRS, stratified random sample, and cluster sample can create a multistage sample.
Errors – Non-sampling vs Sampling
• Non-sampling errors occur due to mistakes made during the process of data acquisition.
• Increasing sample size will not reduce this type of error.
• There are three types of non-sampling errors:– Errors in data acquisition, e.g., response bias– Nonresponse errors– Selection bias, such as undercoverage
Error in Data Acquisition
If this observation…
…is wrongly recorded here…
Sampling error + Data acquisition error
Population
Sample
Population
Sample
No response here... …may lead to biased results here.
Nonresponse Error
Selection Bias
Population
Sample
When parts of the population cannot be selected...
…the sample cannot representthe whole population.
Sampling Error
• Sampling error refers to differences between the sample and the population, because of the specific observations that happen to be selected.
• Sampling error is expected to occur when making a statement about the population based on the sample taken.
Population
Sample
The sample mean
Population meanSampling error
x
Sampling Distributions
• The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.
• The bias of a statistic is the difference between the mean of its sampling distribution and the population parameter; no bias = unbiased.
• The variability is described by the spread of its sampling distribution; determined by the design and size of the sample.
High bias, low variability Low bias, high variability
High bias, high variability Low bias, low variability
More on Sampling Errors
• We are often concerned with how to manage the bias and variability of a statistic.
• To reduce the bias, we use random sampling.– Generally speaking, estimates drawn from an SRS are
unbiased (which is why the SRS is so attractive).
• To reduce the variability of a statistic from an SRS, increase the sample size.
• There is a trade-off between bias and variability , however (i.e., we cannot make both very small).