Post on 11-Jan-2016
Sociology 5811:Lecture 7: Samples, Populations,
The Sampling Distribution
Copyright © 2005 by Evan Schofer
Do not copy or distribute without permission
Announcements
• Problem Set #2 due today!
Review: Populations
• Population: The entire set of persons, objects, or events that have at least one common characteristic of interest to a researcher (Knoke, p. 15)
• Beyond literal definition, a population is the general group that we wish to study and gain insight into
• Sample: A subset of a population
• Random Sample: A sample chosen from a population such that each observation has an equal chance of being selected (Knoke, p. 77)
• Randomness is one strategy to avoid biased samples.
Review: Statistical Inference• Statistical inference: making statistical
generalizations about a population from evidence contained in a sample (Knoke, 77)
• When is statistical inference likely to work?• 1. When a sample is large
• If a sample approaches the size of the population, it is likely be a good reflection of that population
• 2. When a sample is representative of the entire population
• As opposed to a sample that is atypical in some way, and thus not reflective of the larger group.
Populations and Samples
• Population parameters (μ, σ) are constants• There is one true value, but it is usually unknown
• Sample statistics (Y-bar, s) are variables• Up until now we’ve treated them as constants
• But, there are many possible samples
• The value of mean, S.D. vary depending on which sample you have
• Like any variable, the mean and S.D. have a distribution
• Called the “sampling distribution”
• Made up of all values for any given population
Populations and Samples: Overview
Population Sample
Characteristics “parameters” “statistics”
Characteristics are:
constant (one for population)
variables (varies for each sample)
Notation Greek (, ) Roman ( , s)
Estimate “hat”: “point estimate” based on sample
σ̂
Y
Population and Sample Distributions
Y
s
Estimating the Mean
• Suppose we want to know the mean of a population (μ). What do we do?
• Plan A: Spend $100 million dollars to survey our entire population
• If it is even possible to survey the whole population
• Plan B: Spend $1,000 sampling a few hundred people.
• Estimate the mean
• Simply use formulas to estimate mu: μ̂
Estimating the Mean
• Question: Given our sample, what is our best guess of the population mean?
• Answer: The sample mean: Y-bar• Look at Y-bar, assume that it is a “good guess”
• Thus, we calculate:
N
iiY
NY
1
1μ̂
Estimating the Mean
• Issue: There are an infinite number of possible samples that one can take from any population– Each possible sample has a mean, most of which are
different• Some are close to the population mean, some not
• Q: How do we know if we got a “good guess”?
• A: We can’t know for sure. We may draw incorrect conclusions about the mean
• But: We can use probability theory to determine if our guess is likely to be good!
Estimates and Sampling Distributions
• It is possible to take more than one sample• And calculate more than one estimate of the mean
• If we took many samples (and calculated many means), we’d see a range of estimates
• We could even plot a histogram of the many estimates
• Our confidence in our guess depends on how “spread out” the range of guesses tends to be
• The “standard deviation” of that particular histogram.
Sampling Distributions
• Sampling Distribution: The distribution of estimates created by taking all possible unique samples (of a fixed size) from a population
• Example: Take every possible 10-person sample of sociology graduate students (all combinations)
• 1. Calculate the mean of each sample
• 2. Graph a histogram of all estimates
• This is called “the sampling distribution of the mean”
• Note: The sampling distribution is rarely known• It is typically thought of as a probability distribution.
Sampling Distribution Notation
• Population mean and S.D. are: • Each sample has a mean and S.D.: Y-bar, s
• The sampling distribution of the mean (i.e., the distribution of mean-estimates) also has a mean
• And a S.D., aka the “standard error”
• Mean, S.D. of sampling distribution: YY σ μ
• Question: Why are they Greek?• A:Because all possible samples represent a population
• Question: Why is there a sub-Y-bar?• Because it is the mean of all possible Y-bars (means)
Sampling Distribution of the Mean
• It turns out that under some circumstances, the shape of the sampling distribution of the mean can be determined– Thus allowing one to get a sense of the range of
estimates of the mean one is likely to see• If distribution is narrow, our guess is probably good!
• If S.D. is large, our guess may be quite bad
• This provides insight into the probable location of the population mean
• Even if you only have one single sample to look at
• This “trick” lets us draw conclusions!!!
Sampling Distribution Example
• Let’s create a sampling distribution from a small population, = 52. (Sample N = 3)
Case # of CDs
1 30
2 100
3 20
4 70
5 40
• Note how the mean varies depending on the sample
• Mean of cases 1,2,3 = 50
• Mean of 2,4,5 = 70
• For this population (N=5) we can calculate all possible means based on sample size 3
Sampling Distribution Example
• First, we must calculate every possible mean
Case # of CDs
1 30
2 100
3 20
4 70
5 40
• 1,2,3 = 50• 1,2,4 = 66.67• 1,2,5 = 56.67 • 1,3,4 = 40• 1,3,5 = 30• 1,4,5 = 46.67• 2,3,4 = 63.33• 2,3,5 = 53.33• 2,4,5 = 70• 3,4,5 = 43.33
Sampling Distribution Example
• Here, you can see how the sample mean is really a variable
• This complete list of all possible means is the sampling distribution
• As a probability distribution, this tells us the probability of picking a sample with each mean
• Note: Sampling Dist mean = 52• Same as population mean!
Sample Y-bar
1 50
2 66.67
3 56.67
4 40
5 30
6 46.67
7 63.33
8 53.33
9 70
10 43.33
Sampling Distribution Example
• Histogram of Sampling Distribution (N=3):
17-27 27-37 37-47 47-57 57-67 67-77 77-87
4
3
2
1
0
= 52
• Note: The distribution centers around the population mean
• And, it is roughly symmetrical
Sampling Distribution Example• As a probability distribution, the sampling
distribution gives a sense of the quality of our estimate of
17-27 27-37 37-47 47-57 57-67 67-77 77-87
.4
.3
.2
.1
0
= 52
Probability = Frequency / N
The probability of picking a sample with a mean that is within +/- 5
of is p = .3 (30%)
The probability of overestimating by
more than 15 is about p = .1 (10%)
Q: What is the probability of a
“poor” estimate of ?
Sampling Distribution Example
• Note: If the sampling distribution is narrow, most of our estimates of the mean will be good
• That is, they will be close to , the population mean
• If the sampling distribution is wide, the probability of a “bad” estimate goes up
• A measure of dispersion can help us assess the sampling distribution
• Recall: the standard deviation of a sampling distribution is called: the standard error
• It tells us the width of the sampling distribution!
The Central Limit Theorem
• But, how do we know the width of the sampling distribution?
• Statisticians have shown that the sampling distribution will have consistent properties, if we have a large sample
• Several of these properties constitute the “Central Limit Theorem”
• These properties provide the basis for drawing statistical inferences about the mean.
The Central Limit Theorem
• If you have a large sample (Large N):
• 1. The sampling distribution of the mean (and thus all possible estimates of the mean) cluster around the true population mean
• 2. They cluster as a normal curve• Even if the population distribution is not normal
• 3. The estimates are dispersed around the population mean by a knowable standard deviation (sigma over root N)
The Central Limit Theorem
• Formally stated:
1. As N grows large, the sampling distribution of the mean approaches normality
YY μμ 2.
NY
Y
σσ 3.
Central Limit Theorem: Visually
Ys
YμYσ
Implications of the C.L.T
• What does this mean for us?
• Typically, we only have one sample, and thus only one estimate of
• The actual value of is unknown• So we don’t know the center of the sampling distribution
• All we know for certain is that our estimate falls somewhere in the sampling distribution
• This is always true by definition
• And, later, we’ll estimate its width.
Implications of the C.L.T• Visually: Suppose we observe mu-hat = 16
16μ̂ μ
16μ̂ μ
16μ̂ μ
16μ̂ μ
But, mu-hat always falls within the
sampling distribution
Sampling distribution
There are many
possible locations
of
Implications of the C.L.T
• We know that the mean from our sample falls somewhere in this sampling distribution
• Which has mean , standard deviation over square root N
• If we can estimate , we can estimate sigma over root N... The “Standard Error” of the mean
• We don’t know exactly where the sample falls• But, laws of probability suggest that we are most likely to
draw a sample w/mean from near the center
• Recall: 67% fall +/- 1 SD, 95 +/- 2SD in a normal curve
• So, we can determine the range around in which 95% (or 99%, or 99.9%) of cases will fall.
Implications of the C.L.T
• What is the relation between the Standard Error and the size of our sample (N)?
• Answer: It is an inverse relationship.• The standard deviation of the sampling distribution shrinks
as N gets larger
• Formula:NY
Y
σσ
• Conclusion: Estimates of the mean based on larger samples tend to cluster closer around the true population mean.
Implications of the CLT
• Visually: The width of the sampling distribution is an inverse function of N (sample size)– The distribution of mean estimates based on N = 10
will be more dispersed. Mean estimates based on N = 50 will cluster closer to .
μ̂μ
μ̂μ
Smaller sample size Larger sample size
Confidence Intervals
• Benefits of knowing the width of the sampling distribution:
• 1. You can figure out the general range of error that a given point estimate might miss by
• based on the range around the true mean that the estimates will fall
• 2. And, this defines the range around an estimate that is likely to hold the population mean
• A “confidence interval”
• Note: These only work if N is large!
Confidence Interval
• Confidence Interval: “A range of values around a point estimate that makes it possible to state the probability that an interval contains the population parameter between its lower and upper bounds.” (Bohrnstedt & Knoke p. 90)
• It involves a range and a probability
• Examples: • We are 95% confident that the mean number of CDs owned
by grad students is between 20 and 45
• We are 50% confident the mean rainfall this year will be between 12 and 22 inches.
Confidence Interval
• Visually: It is probable that falls near mu-hat16μ̂
μ
μ μ
Probable values of
Range where is unlikely to be
Q: Can be this far from mu-hat?
Answer: Yes, but it is very improbable
Confidence Interval
• To figure out the range in of “error” in our mean estimate, we need to know the width of the sampling distribution– The Standard Error! (The S.D. of this distribution)
• The Central Limit Theorem provides a formula:
NY
Y
σσ
• Problem: We do not know the exact value of sigma-sub-Y, the population standard deviation!
Confidence Interval
• Question: How do we calculate the standard error if we don’t know the population S.D.?
• Answer: We estimate it using the information we have
• Formula for best estimate:
• Where N is the sample size and s-sub-Y is the sample standard deviation
NY
Y
sσ̂
95% Confidence Interval Example
• Suppose a sample of 100 students with mean SAT score of 1020, standard deviation of 200
• How do we find the 95% Confidence Interval?
• If N is large, we know that:• 1. The sampling distribution is roughly normal
• 2. Therefore 95% of samples will yield a mean estimate within 2 standard deviations (of the sampling distribution) of the population mean ()
• Thus, 95% of the time, our estimates of (Y-bar) are within two “standard errors” of the actual value of .
95% Confidence Interval
• Formula for 95% confidence interval:
)(σ2Y : CI 95% Y• Where Y-bar is the mean estimate and sigma (Y-
bar) is the standard error
• Result: Two values – an upper and lower bound
• Adding our estimate of the standard error:
N
s2Y )σ̂(2Y Y
Y
95% Confidence Interval
• Suppose a sample of 100 students with mean SAT score of 1020, standard deviation of 200
• Calculate:
)100
200)(2(1020
)(2Y : CI 95%Ns
)2( 1020 10200
40 1020 2(20) 1020 • Thus, we are 95% confident that the population
mean falls between 980 and 1060.