Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan...

Sociology 5811:Lecture 7: Samples, Populations,

The Sampling Distribution

Do not copy or distribute without permission

Announcements

• Problem Set #2 due today!

Review: Populations

• Population: The entire set of persons, objects, or events that have at least one common characteristic of interest to a researcher (Knoke, p. 15)

• Beyond literal definition, a population is the general group that we wish to study and gain insight into

• Sample: A subset of a population

• Random Sample: A sample chosen from a population such that each observation has an equal chance of being selected (Knoke, p. 77)

• Randomness is one strategy to avoid biased samples.

Review: Statistical Inference• Statistical inference: making statistical

generalizations about a population from evidence contained in a sample (Knoke, 77)

• When is statistical inference likely to work?• 1. When a sample is large

• If a sample approaches the size of the population, it is likely be a good reflection of that population

• 2. When a sample is representative of the entire population

• As opposed to a sample that is atypical in some way, and thus not reflective of the larger group.

Populations and Samples

• Population parameters (μ, σ) are constants• There is one true value, but it is usually unknown

• Sample statistics (Y-bar, s) are variables• Up until now we’ve treated them as constants

• But, there are many possible samples

• The value of mean, S.D. vary depending on which sample you have

• Like any variable, the mean and S.D. have a distribution

• Called the “sampling distribution”

• Made up of all values for any given population

Populations and Samples: Overview

Population Sample

Characteristics “parameters” “statistics”

Characteristics are:

constant (one for population)

variables (varies for each sample)

Notation Greek (, ) Roman ( , s)

Estimate “hat”: “point estimate” based on sample

Population and Sample Distributions

Estimating the Mean

• Suppose we want to know the mean of a population (μ). What do we do?

• Plan A: Spend $100 million dollars to survey our entire population

• If it is even possible to survey the whole population

• Plan B: Spend $1,000 sampling a few hundred people.

• Estimate the mean

• Simply use formulas to estimate mu: μ̂

Estimating the Mean

• Question: Given our sample, what is our best guess of the population mean?

• Answer: The sample mean: Y-bar• Look at Y-bar, assume that it is a “good guess”

• Thus, we calculate:

Estimating the Mean

• Issue: There are an infinite number of possible samples that one can take from any population– Each possible sample has a mean, most of which are

different• Some are close to the population mean, some not

• Q: How do we know if we got a “good guess”?

• A: We can’t know for sure. We may draw incorrect conclusions about the mean

• But: We can use probability theory to determine if our guess is likely to be good!

Estimates and Sampling Distributions

• It is possible to take more than one sample• And calculate more than one estimate of the mean

• If we took many samples (and calculated many means), we’d see a range of estimates

• We could even plot a histogram of the many estimates

• Our confidence in our guess depends on how “spread out” the range of guesses tends to be

• The “standard deviation” of that particular histogram.

Sampling Distributions

• Sampling Distribution: The distribution of estimates created by taking all possible unique samples (of a fixed size) from a population

• Example: Take every possible 10-person sample of sociology graduate students (all combinations)

• 1. Calculate the mean of each sample

• 2. Graph a histogram of all estimates

• This is called “the sampling distribution of the mean”

• Note: The sampling distribution is rarely known• It is typically thought of as a probability distribution.

Sampling Distribution Notation

• Population mean and S.D. are: • Each sample has a mean and S.D.: Y-bar, s

• The sampling distribution of the mean (i.e., the distribution of mean-estimates) also has a mean

• And a S.D., aka the “standard error”

• Mean, S.D. of sampling distribution: YY σ μ

• Question: Why are they Greek?• A:Because all possible samples represent a population

• Question: Why is there a sub-Y-bar?• Because it is the mean of all possible Y-bars (means)

Sampling Distribution of the Mean

• It turns out that under some circumstances, the shape of the sampling distribution of the mean can be determined– Thus allowing one to get a sense of the range of

estimates of the mean one is likely to see• If distribution is narrow, our guess is probably good!

• If S.D. is large, our guess may be quite bad

• This provides insight into the probable location of the population mean

• Even if you only have one single sample to look at

• This “trick” lets us draw conclusions!!!

Sampling Distribution Example

• Let’s create a sampling distribution from a small population, = 52. (Sample N = 3)

Case # of CDs

• Note how the mean varies depending on the sample

• Mean of cases 1,2,3 = 50

• Mean of 2,4,5 = 70

• For this population (N=5) we can calculate all possible means based on sample size 3

• First, we must calculate every possible mean

Case # of CDs

• 1,2,3 = 50• 1,2,4 = 66.67• 1,2,5 = 56.67 • 1,3,4 = 40• 1,3,5 = 30• 1,4,5 = 46.67• 2,3,4 = 63.33• 2,3,5 = 53.33• 2,4,5 = 70• 3,4,5 = 43.33

• Here, you can see how the sample mean is really a variable

• This complete list of all possible means is the sampling distribution

• As a probability distribution, this tells us the probability of picking a sample with each mean

• Note: Sampling Dist mean = 52• Same as population mean!

Sample Y-bar

2 66.67

3 56.67

6 46.67

7 63.33

8 53.33

10 43.33

• Histogram of Sampling Distribution (N=3):

17-27 27-37 37-47 47-57 57-67 67-77 77-87

• Note: The distribution centers around the population mean

• And, it is roughly symmetrical

Sampling Distribution Example• As a probability distribution, the sampling

distribution gives a sense of the quality of our estimate of

17-27 27-37 37-47 47-57 57-67 67-77 77-87

Probability = Frequency / N

The probability of picking a sample with a mean that is within +/- 5

of is p = .3 (30%)

The probability of overestimating by

more than 15 is about p = .1 (10%)

Q: What is the probability of a

“poor” estimate of ?

• Note: If the sampling distribution is narrow, most of our estimates of the mean will be good

• That is, they will be close to , the population mean

• If the sampling distribution is wide, the probability of a “bad” estimate goes up

• A measure of dispersion can help us assess the sampling distribution

• Recall: the standard deviation of a sampling distribution is called: the standard error

• It tells us the width of the sampling distribution!

The Central Limit Theorem

• But, how do we know the width of the sampling distribution?

• Statisticians have shown that the sampling distribution will have consistent properties, if we have a large sample

• Several of these properties constitute the “Central Limit Theorem”

• These properties provide the basis for drawing statistical inferences about the mean.

• If you have a large sample (Large N):

• 1. The sampling distribution of the mean (and thus all possible estimates of the mean) cluster around the true population mean

• 2. They cluster as a normal curve• Even if the population distribution is not normal

• 3. The estimates are dispersed around the population mean by a knowable standard deviation (sigma over root N)

• Formally stated:

1. As N grows large, the sampling distribution of the mean approaches normality

YY μμ 2.

σσ 3.

Central Limit Theorem: Visually

YμYσ

Implications of the C.L.T

• What does this mean for us?

• Typically, we only have one sample, and thus only one estimate of

• The actual value of is unknown• So we don’t know the center of the sampling distribution

• All we know for certain is that our estimate falls somewhere in the sampling distribution

• This is always true by definition

• And, later, we’ll estimate its width.

Implications of the C.L.T• Visually: Suppose we observe mu-hat = 16

16μ̂ μ

But, mu-hat always falls within the

sampling distribution

Sampling distribution

There are many

possible locations

• We know that the mean from our sample falls somewhere in this sampling distribution

• Which has mean , standard deviation over square root N

• If we can estimate , we can estimate sigma over root N... The “Standard Error” of the mean

• We don’t know exactly where the sample falls• But, laws of probability suggest that we are most likely to

draw a sample w/mean from near the center

• Recall: 67% fall +/- 1 SD, 95 +/- 2SD in a normal curve

• So, we can determine the range around in which 95% (or 99%, or 99.9%) of cases will fall.

• What is the relation between the Standard Error and the size of our sample (N)?

• Answer: It is an inverse relationship.• The standard deviation of the sampling distribution shrinks

as N gets larger

• Formula:NY

• Conclusion: Estimates of the mean based on larger samples tend to cluster closer around the true population mean.

Implications of the CLT

• Visually: The width of the sampling distribution is an inverse function of N (sample size)– The distribution of mean estimates based on N = 10

will be more dispersed. Mean estimates based on N = 50 will cluster closer to .

μ̂μ

Smaller sample size Larger sample size

Confidence Intervals

• Benefits of knowing the width of the sampling distribution:

• 1. You can figure out the general range of error that a given point estimate might miss by

• based on the range around the true mean that the estimates will fall

• 2. And, this defines the range around an estimate that is likely to hold the population mean

• A “confidence interval”

• Note: These only work if N is large!

Confidence Interval

• Confidence Interval: “A range of values around a point estimate that makes it possible to state the probability that an interval contains the population parameter between its lower and upper bounds.” (Bohrnstedt & Knoke p. 90)

• It involves a range and a probability

• Examples: • We are 95% confident that the mean number of CDs owned

by grad students is between 20 and 45

• We are 50% confident the mean rainfall this year will be between 12 and 22 inches.

Confidence Interval

• Visually: It is probable that falls near mu-hat16μ̂

Probable values of

Range where is unlikely to be

Q: Can be this far from mu-hat?

Answer: Yes, but it is very improbable

Confidence Interval

• To figure out the range in of “error” in our mean estimate, we need to know the width of the sampling distribution– The Standard Error! (The S.D. of this distribution)

• The Central Limit Theorem provides a formula:

• Problem: We do not know the exact value of sigma-sub-Y, the population standard deviation!

Confidence Interval

• Question: How do we calculate the standard error if we don’t know the population S.D.?

• Answer: We estimate it using the information we have

• Formula for best estimate:

• Where N is the sample size and s-sub-Y is the sample standard deviation

95% Confidence Interval Example

• Suppose a sample of 100 students with mean SAT score of 1020, standard deviation of 200

• How do we find the 95% Confidence Interval?

• If N is large, we know that:• 1. The sampling distribution is roughly normal

• 2. Therefore 95% of samples will yield a mean estimate within 2 standard deviations (of the sampling distribution) of the population mean ()

• Thus, 95% of the time, our estimates of (Y-bar) are within two “standard errors” of the actual value of .

95% Confidence Interval

• Formula for 95% confidence interval:

)(σ2Y : CI 95% Y• Where Y-bar is the mean estimate and sigma (Y-

bar) is the standard error

• Result: Two values – an upper and lower bound

• Adding our estimate of the standard error:

s2Y )σ̂(2Y Y

95% Confidence Interval

• Suppose a sample of 100 students with mean SAT score of 1020, standard deviation of 200

• Calculate:

200)(2(1020

)(2Y : CI 95%Ns

)2( 1020 10200

40 1020 2(20) 1020 • Thus, we are 95% confident that the population

mean falls between 980 and 1060.

Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan...

Documents

Transcript of Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan...

Economic Globalization Sociology 2, Class 6 Copyright © 2014 by Evan Schofer Do not copy or distribute without permission.

Logistic Regression Sociology 229: Advanced Regression Copyright © 2010 by Evan Schofer Do not copy or distribute without permission.

Sociology 5811: Lecture 4: Other Univariate Descriptives, Quantiles, and Z- Scores Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multilevel Models 1 Sociology 229A Copyright © 2008 by Evan Schofer Do not copy or distribute without permission.

States and Markets Sociology 2, Class 4 Copyright © 2014 by Evan Schofer Do not copy or distribute without permission.

Economic Globalization Sociology 2, Class 5 Copyright © 2011 by Evan Schofer Do not copy or distribute without permission.

Economic Globalization Sociology 2, Class 7 Copyright © 2011 by Evan Schofer Do not copy or distribute without permission.

Multilevel Models 4 Sociology 8811, Class 26 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission.

Sociology 5811: Lecture 10: Hypothesis Tests Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions & Diagnostics Sociology 8811 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission.

Sociology 2: Class 14: World-System Theory Copyright © 2008 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression 4 Sociology 5811 Lecture 25 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Sociology 5811: Lecture 14: ANOVA 2 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Economic Globalization Sociology 2, Class 7 Copyright © 2013 by Evan Schofer Do not copy or distribute without permission.

Linear Functions 2 Sociology 5811 Lecture 18 Copyright © 2004 by Evan Schofer Do not copy or distribute without permission.

Sociology 5811: T-Tests for Difference in Means Wes Longhofer, pinch-hitting for Evan Schofer.

Sociology 5811: Lecture 3: Measures of Central Tendency and Dispersion Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Sociology 5811: Lecture 6: Probability, Probability Distributions, Normal Distributions Copyright © 2005 by Evan Schofer Do not copy or distribute without.

Multiple Regression 1 Sociology 8811 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission.