Economics 105: Statistics Review #1 due next Tuesday in class Go over GH 8 No GH’s due until next...

21
Economics 105: Statistics Review #1 due next Tuesday in class Go over GH 8 No GH’s due until next Thur! GH 9 and 10 due next Thur. Do go to lab this week. It is due 2 weeks after your lab (so you’ll have 2 labs due that week, assuming you don’t complete it ahead of time)

Transcript of Economics 105: Statistics Review #1 due next Tuesday in class Go over GH 8 No GH’s due until next...

Economics 105: Statistics• Review #1 due next Tuesday in class• Go over GH 8• No GH’s due until next Thur! GH 9 and 10 due next Thur. • Do go to lab this week. It is due 2 weeks after your lab (so you’ll have 2 labs due that week, assuming you don’t complete it ahead of time)

Sampling• Start with the Population, which is the set of all possible persons, firms, countries, etc. for the particular frame of reference• For each research question, define the relevant population:

– what is the average income in the United States? – what is the average height in Krakozhia? – who will win the presidential election in November?– what is the average number of volunteer hours per student?– what percent of left-handed people have blue eyes?

• A sample is the subset of the population selected for analysis– Must be representative of the population to avoid biased estimates

• U.S. census taken every 10 years, according to the Constitution– First one in 1790 (3.9 million residents; today 312 million)– http://www.archives.gov/exhibits/charters/constitution_transcript.html http://www.ipums.org/ ? http://usa.ipums.org/usa-action/variables/group

Simple Random Sampling• Most straightforward way to achieve representativeness is Simple Random Sampling where each person has an equal, and independent, chance of being selected• Also called i.i.d. sampling for independent and identically distributed (since drawn from same population)

• Say we want to know how many magazines a household currently purchases. choose 1000 names from ________?

• Suppose it is a good idea, now we contact them … if they’re not available, we just scratch them from our list. Or we go to the next name on the list until we find someone who is available. Any problems?

Systematic Sample• Partition the population into n groups with k members each (k = N/n)• Randomly choose one from the first group of k• Take every kth item after that• Faster and easier than simple random sample • Telephone book, class roster, items from an assembly line, etc. • Greater chance of selection bias if there’s a pattern in the population

Stratified Random Sampling• Hypothetical research question: What % of students will vote in the election? • Only have time & money to survey 100 students. • You do so, but get only 2 political science majors in your sample.• Problems?• Solution: Stratified Random Sampling

– If a subgroup, or strata, of the population is particularly relevant to the research question, one may break the population down into strata and take a simple random sample from each strata– Each person can only belong to one strata– Ensures reasonable sample size of the subpopulation of interest or concern– Can stratify on > 1 characteristic -- major and gender

Cluster Sampling• Hypothetical survey of rural families spread over a wide area • Hypothetical survey of homeless individuals in a large city• Problems?

– Accurate list of population members– In-person interviews too costly– Mail surveys might lead to really high non-response

• Solution: Cluster Sampling– Divide the population into geographically small units, or clusters– For example, political wards or residential blocks for a city– Then take a simple random sample of clusters– Each person or household in a chosen cluster is then contacted, that is, a complete census of chosen clusters, or sometimes a simple random sample of units within chosen clusters

Cluster Sampling

Sources of Error from a Survey• Sampling Errors

– come from having info on only a subset of population– statistical theory is used to quantify

• Non-sampling Errors– can occur even with a complete census of the population– possible sources:

• Population sampled is not relevant one or list is incomplete (coverage error, sample selection bias)• Measurement error

– Inaccurate or dishonest answers– Halo effect– Poor wording of questions

• Non-response (to whole survey or some questions)– try to minimize at outset & check up on some answers

Sample Statistics• Population parameter Sample statistic

Sample Statistics• Denote an i.i.d. sample by X1, X2, X3, . . . ,Xn

• What exactly is an Xi ?

• Actual outcomes are x1, x2, x3, . . . , xn

• How many samples could we take?• How many samples do we actually take?• A sample statistic is formed by taking some function of the

random variables X1, X2, X3, . . . ,Xn, A = f(X1, X2, X3, . . . ,Xn)

• Examples

• The point estimate of the population parameter is a single number rather than a range

Sampling Distribution• Sample statistic A is a random variable! Why?

• Thus, a sample statistic has a probability distribution, known as a sampling distribution

• Example: Let S = {0,1,2,3,4,5,6}• Graph the sampling distribution of for n = 2

Mean & Variance• are set of random variables from an i.i.d. sample of size n, what are the mean and variance of sample average?• Things we would like to know …

1.

2.

3. What does the p.d.f. of look like?

• What does the pdf of look like?

Central Limit Theorem• Rough statement of CLT:“Sample means are eventually, approximately normally distributed.”

• Formal statement of CLT:

Let X1, X2, X3, . . . ,Xn, where Xi is a random variable denoting the outcome of the ith observation, be an i.i.d. sample from ANY population distribution with mean and variance then as n becomes large

• Graphically (page 236 in BLK, 10th edition, has a nice visual)

CLT!

Point and Interval Estimates• A point estimate is a single number, • a confidence interval provides additional

information about variability

Point Estimate

Lower

Confidence

Limit

Upper

Confidence

Limit

Width of confidence interval

We can estimate a Population Parameter …

Point Estimates

with a SampleStatistic

(a Point Estimate)

Mean

Proportion pπ

Confidence Level, (1-)• Suppose confidence level = 95% • Also written (1 - ) = 0.95• A relative frequency interpretation:

– In repeated samples, 95% of all the confidence intervals that can be constructed are expected to contain the unknown true parameter

• A specific interval either will contain or will not contain the true parameter– No probability involved in a specific interval

Confidence Intervals

Population Mean

σ Unknown

ConfidenceIntervals

PopulationProportion

σ Known

Confidence Intervals for • First, assume 2 is known & X ~ N , so •Things are different when these are not true.• Random sample of n observations • We will use to make inferences about •

Confidence Interval for μ(σ Known)

• Assumptions– Population standard deviation σ is known– Population is normally distributed

• Confidence interval estimate:

Finding the Critical Value, Z• Consider a 95% confidence interval:

Z= -1.96 Z= 1.96

Point EstimateLower Confidence Limit

UpperConfidence Limit

Z units:

X units: Point Estimate

0