Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str...

Stat472/572 Sampling: Theory and Practice

Instructor: Yan Lu

Chapter 3: Stratified Sampling

Example: 1000 male and 100 female in population.

• Now take an SRS of size 55 from the population. Possibly we

got a sample without female.

—-Most people would not consider such a sample to be rep-

resentative of the population, since men and women might re-

spond differently on the item of interest

• Use stratified sample, we can take 50 male and 5 female

—-a sample with no or few males cannot be selected, protected

from the possibility of obtaining a really bad sample

—-increases the precision of the estimators2

Stratified Sampling

• Divide population into H subpopulations, called strata. The

strata do not overlap and they constitute the whole population

• Each sampling unit belongs to exactly one stratum

• Draw an independent probability sample from each stratum

• Pool the information to obtain overall population estimates

Figure 1: Stratification

Example 3.2: Agriculture survey (Refer to Example 2.5)

• In Example 2.5, we generated a random sample. But some areas were

overrepresented, and others not represented at all

• part of the large variability arises because counties in the western United

States are larger, and thus tend to have larger values of y, than counties in

the eastern United States

• Taking a stratified sample can provide some balance in the sample on the

stratifying variable

• We use the four census regions of the United States: Northeast (NE), North

Central (NC), South (S), and West (W) strata, and sample about 10% of the

counties in each stratum.

Figure 2: Boxplot of data from example 3.2. The thick line for each region is the median of

the sample data from that region; the other horizontal lines in the boxes are the 25th and 75th

percentiles. The Northeast region has a relatively small median and small variance; the West

region, however, has a much higher median and variance. The distribution of farm acreage

appears to be positively skewed in each of the regions.

NC NE S W

0.00.5

1.01.5

Region

Million

s of A

Stratum # of counties in stratum # of counties in sample

Northeast 220 21

North Central 1054 103

South 1382 135

West 422 41

Total 3078 300

Table 1: Summary statistics for each stratum

region stratum size sample size average variance

Northeast 220 21 97,629.8 7,647,472,708

North Central 1045 103 300,504.2 29,618,183.543

South 1382 135 211,315.0 53,587,487,856

West 422 41 662,295.5 396,185,950,266

• We took an SRS in each stratum, for Northeast region

t1 = (220)(97, 629.81) = 21, 478, 558.2

V (t1) = (220)2(

1− 21220

)7, 647, 472, 708

21= 1.594316× 1013

Table 2: Estimates of the total number of farm acres and estimated variance of the total for

each of the four strata

region estimated total estimated variance of the total

Northeast 21, 478, 558.2 1.59432× 1013

North Central 316, 731, 379.4 2.88232× 1014

South 292, 037, 390.8 6.84076× 1014

West 279, 488, 706.1 1.55365× 1015

Total 909, 736, 034.4 2.5419× 1015

Table 3: Comparison between SRS and stratified random sampling for agriculture data

sample size t SE

SRS 300 916,927,110 58,169,381

Stratification 300 909,736,034 50,417,248

• Observations within many strata tend to be more homogeneous than observations in the

population as a whole. Reduction in variance in the individual strata often leads to a

reduced variance for the population estimate

• estimated variance from stratified sample, with n = 300estimated variance from SRS, with n = 300

=2.5419× 1015

3.3837× 1015= 0.75

• If these were the population variances, we would expect that we would need only (300)(0.75) =

225 observations with a stratified sample to obtain the same precision as from an SRS of

300 observations.

Comments:

• Reduce variability by eliminating possible bad samples

• May want data of known precision for subgroups

• Lower cost, convenient

• Usually reduce variability when estimating the whole popula-

Theory of Stratified Sampling:

strata 1 2 · · · H

popn size N1 N2 · · · NH

∑Hh=1 Nh = N

sample size n1 n2 · · · nH

∑Hh=1 nh = n

popn total t1 t2 · · · tH

• Take an SRS of size nh from stratum H

• tstr = t1 + t2 + · · ·+ tH

•tstr = t1 + t2 + · · ·+ tH

= N1y1 + N2y2 + · · ·NH yH

•V (tstr) = V (t1) + V (t2) + · · ·+ V (tH)

(1− nh

ystr =tstrN

∑Hh=1 thN

∑Hh=1 Nhyh

Weighted average of stratum means

• Confidence intervals for stratified samples

—If either(1) the sample sizes within each stratum are large

—or (2) the sampling design has a large number of strata

According to central limit theorem (Krewski and Rao 1981), an

approximate 100(1− α)% confidence interval for the popula-

tion mean yU is

ystr ± zα/2SE(ystr)

Some survey software packages use the percentile of a t dis-

tribution with n − H degrees of freedom rather than the per-

centile of the normal distribution

Population quantities Sample quantities

yhj : value of jth unit in stratum h

th =Nh∑j=1

yhj th =Nh

∑j∈Sh

yhj = Nhyh

t =H∑

th tstr =H∑

th =H∑

Nh∑j=1

∑j∈Sh

H∑h=1

Nh∑j=1

Nystr =

Nh∑j=1

(yhj − yhU)2

Nh − 1s2

h =∑

j∈Sh

(yhj − yh)2

nh − 116

tstr = t1 + t2 + · · ·+ tH

= N1y1 + N2y2 + · · ·NH yH

V (tstr) = V (t1) + V (t2) + · · ·+ V (tH)

(1− nh

ystr =tstrN

V (ystr) =1

N2V (tstr) =

(1− nh

Properties of the estimators:

• E[tstr] = t

• E[ystr] = yU

• V (tstr) is an unbiased estimator of V (tstr)

• V (ystr) is an unbiased estimator of V (ystr)

E[tstr] = E[H∑

NhE(yh)

NhyhU =H∑

th = t

E[ystr] = E[tstrN

Stratified sampling for proportions

Special case of mean when

1 if the unit has the characteristic

0 otherwise

yh = ph

nh − 1ph(1− ph)

pstr =H∑

V (pstr) =H∑

(1− nh

)2ph(1− ph)

nh − 1

tstr =H∑

V (tstr) = N 2V (pstr)

Example 3.4. The American Council of Learned Societies (ACLS)

used a stratified random sample of selected ACLS societies in

seven disciplines to study publication patterns and computer

and library use among scholars who belong to one of the mem-

ber organizations of the ACLS. The data is shown in the follow-

ing table.

Discipline Membership # mailed valid returns female

Nh nh members(%)

Literature 9100 915 636 38

Classics 1950 633 451 27

Philosophy 5500 658 481 18

History 10850 855 611 19

Linguistics 2100 667 493 36

Political Science 5500 833 575 13

Sociology 9000 824 588 26

Totals 44000 5385 3835

• Want to estimate the percentage and number of female members of the major societies in

those seven disciplines

• Ignoring the nonresponse, assume no duplicate memberships

pstr =7∑

=910044000

× .38 + · · ·+ 900044000

× .26

= .2465

SE(pstr) =

√√√√7∑

(1− nh

)2ph(1− ph)

nh − 1

= .0071

The estimated total number of female members in the societies is

tstr = 44000× .2465 = 10847

SE(tstr) = 44000× .0071 = 312

Review: Stratified random sampling

Strata 1 2 · · · H

Population size N1 N2 · · · NH

∑Hh=1 Nh = N

Sample size n1 n2 · · · nH

∑Hh=1 nh = n

Population total t1 t2 · · · tH

Population quantities Sample quantities

yhj : value of jth unit in stratum h same

th =Nh∑j=1

yhj th =Nh

∑j∈Sh

yhj = Nhyh

t =H∑

th tstr =H∑

th =H∑

Nh∑j=1

∑j∈Sh

H∑h=1

Nh∑j=1

Nystr =

Nh∑j=1

(yhj − yhU)2

Nh − 1s2

h =∑

j∈Sh

(yhj − yh)2

nh − 126

Properties of the estimators:

• E[tstr] = t

• E[ystr] = yU

Confidence intervals for stratified samples

—If either(1) the sample sizes within each stratum are large

—or (2) the sampling design has a large number of strata

According to central limit theorem (Krewski and Rao 1981), an approximate

100(1− α)% confidence interval for the population mean yU is

ystr ± zα/2SE(ystr)

Some survey software packages use the percentile of a t distribution with

n−H degrees of freedom rather than the percentile of the normal distrib-

Using Weights

Sampling weights: the number of units in the population represented by each sample

member (h, j), h: stratum, j: elements.

tstr =H∑

j∈Sh

whjyhj

where whj =Nh

ystr =

H∑h=1

∑j∈Sh

whjyhj

H∑h=1

∑j∈Sh

Example: Suppose a population has 2000 units, 1600 of them

are males (stratum 1), and 400 are females (stratum 2). If the

sample has 400 units, 200 units from each stratum, then,

π1j =200

8and w1j =

π1j= 8

π2j =200

2and w2j =

π2j= 2

• each man in the sample represents 8 men in the population

• each woman in the sample represents 2 women in the popula-

• πhj = nh/Nh

• whj = Nh/nh

• tstr =H∑

th =H∑

Nhyh =H∑

∑j∈Sh

yhj =H∑

∑j∈Sh

whjyhj

• V (tstr) =H∑

V (th) =H∑

(1− nh

• ystr = tstr/N =H∑

H∑h=1

∑j∈Sh

whjyhj

H∑h=1

∑j∈Sh

• V (ystr) = V (tstr)/N2 =

H∑h=1

(1− nh

Comments:

• Let πhj be the probability of selecting unit j from stratum h. Then whj =

1/πhj = Nh/nh

• ∑Hh=1

∑i∈Sh

whj =∑H

∑i∈Sh

Nh = N

—-The whole sample represents the entire population and sum of the weights

is equal to the population size

• tstr =∑H

∑j∈Sh

whjyhj

• ystr =∑H

∑j∈Sh

whjyhj/∑H

∑j∈Sh

Back to the previous example. Suppose a population has 2000

units, 1600 of them are males (stratum 1), and 400 are females

(stratum 2). If we randomly select 160 males from stratum 1

and 40 women from stratum 2,

π1j =160

10and w1j =

π1j= 10

π2j =40

10and w2j =

π2j= 10

# of sampled units in each stratum is proportional to the size of

the stratum. We call this allocation method proportional alloca-

Proportional Allocation: # of sampled units in each stratum is proportional

to the size of the stratum

N, nh = Nh

πhj =nh

Nand whj =

Sample is self-weighting

ystr =H∑

∑j∈Sh

yhj =1

∑j∈Sh

Variances:

Vprop(ystr) =(1− n

Vprop(tstr) =(1− n

ANOVA Table

SSB df Sum of Squares

Between strata SSB H − 1H∑

Nh∑j=1

(yhU − yU)2

Nh(yhU − yU)2

Within Strata SSW N −HH∑

Nh∑j=1

(yhj − yhU)2

(Nh − 1)S2h

Total corrected SSTO N − 1H∑

Nh∑j=1

(yhj − yU)2

= (N − 1)S2

SSTO = SSB +SSW35

Comparison between SRS and proportional allocation

V (tstr) = V

(1− nh

(1− n

hS2h =

(1− n

=(1− n

[SSW +

V (tsrs) =(1− n

)N2 S2

=(1− n

1N − 1

(SSW + SSB)

≈(1− n

n(SSW + SSB)

Proportional stratification is more efficient, if

S2h < SSB

where SSB =H∑

Nh(yhU − yU )2.

This is usually true, since the large population sizes of the strata will force Nh(yhU −yU )2 > S2

Comments

• In general, the variance of the estimator of t from a stratified sample with proportional

allocation will be smaller than the variance of the estimator of t from SRS with the same

number of observations

• The more unequal the stratum means yhU , the more homogeneous the within stratum

units, the more precision you will gain by using proportional allocation.

Optimal AllocationExample: Want to take a sample of American corporations to estimate the amount of trade

with Europe

• The variation among large corporations would be greater than the variation among small

—-often, large units are more variable than small units

• Need to sample a higher percentage of the large corporations

• Proportional allocation won’t work well in this situation

—-Proportional allocation has same percentage of sampling within each stratum

—-If the variances S2h are similar, proportional allocation is a good choice

—-If the variances S2h vary substantially, we may want to take more samples from the

strata with larger variances

Cost function

c = c0 +H∑

where c0 is the overhead costs, such as maintaining an office, ch is the cost

of sampling an observation in stratum h

• Want to minimize V (tstr) for a fixed cost c or minimize c for a fixed V (tstr)

V (tstr) =H∑

(1− nh

−H∑

—–Same as minimizeH∑

f =H∑

chnh − c

=−N2

+ λch = 0

nh =NhSh√

by the fact that∑

h nh = n we have

1√λ

=n∑H

l=1 NlSl/√

nh,opt = n×(

NhSh/√

ch∑Hl=1 NlSl/

nh,opt ∝ NhSh√ch

We take a larger sample from stratum h if

• The stratum size Nh is large

• The variance within the stratum Sh is large

• The sampling within the stratum ch is inexpensive

nh,opt = n×(

NhSh/√

ch∑Hl=1 NlSl/

Neyman allocation: ch’s are all equal

nh,Neyman = n×(

NhSh∑Hl=1 NlSl

Let a =n∑l=H

l=1 NlSl

Recall

nh,Neyman = n×(

NhSh∑Hl=1 NlSl

so that nh,Neyman = a×NhSh

V (tstr,Neyman) =H∑

(1− nh

(1− aNhSh

(1− aSh)NhSh

(1− n∑H

l=1 NlSl

∑Hl=1 NlSl

H∑h=1

H∑l=1

V (tstr,Prop) =H∑

(1− nh

(1− n

2h −

NlSl =H∑

N 2hS2

h + 2H∑

H∑j>i

NiNjSiSj

NNhS2h =

N 2hS2

h +H∑

H∑j>i

NiNj(S2i + S2

V (tstr,Neyman) ≤ V (tstr,prop)

Relative precision of stratification and srs

V (tstr,Neyman) ≤ V (tstr,Prop) ≤ Vsrs(t)

Example 3.9, Dollar stratification is often used in accounting. The recorded book amounts

are used to stratify the population. If auditing the loan amounts for a financial institution

stratum 1 might consist of all loans of more than $1 million, S2h will be much larger in this

stratum, need a higher sampling fraction for this stratum

stratum 2 might consist of loans between $500,000 and $999,999 · · ·smallest stratum of loans less than $10,000

• Optimal allocation is often an efficient strategy for such a stratification

— If the goal of the audit is to estimate the dollar discrepancy between the audited amounts

and the amounts in the institution’s books, an error in the recorded amount of one of the

$3,000,000 loans is likely to contribute more to the audited difference than an error in the

recorded amount of one of the $3,000 loans. In a survey such as this, you may even want

to use sample size N1 in stratum 1.

Some design issues of stratified random sampling

• Allocating observations to strata

—-Proportional allocation:nh

N—-Optimal allocation: Neyman allocation: ch’s are all equal

nh,Neyman = n

H∑l=1

• Sample size

• Defining strata: variables and number of strata

Determining sample size

V (tstr) =H∑

(1− nh

≤H∑

N2h ·

h = v/n

• v depends on stratum size Nh, variances S2h, and on the relative sample

sizes nh/n

• v can be thought of as the “average” variability per observation unit in a

stratified random sample with the specified allocation

95 % CI: tstr ± zα/2

√v/n

√v/n = e, n = z2

α/2v/e2

Defining Strata:

1. Variables for stratification

• Highly associated with variables of interest

—–For estimating total business expenditures on advertising,

we might stratify by number of employees or size of the busi-

ness and by the type of product or service

—–For farm income, we might use the size of the farm as a

stratifying variable, since we expect that larger farms would

have higher incomes

• Known for all sampling units in the population

2. Number of strata:

• Depends upon many factors such as the difficulty in construct-

ing a sampling frame with stratifying information, and the cost

of stratifying

• Formulas in literature

• Pilot study

• General rule: the more information you have about the pop-

ulation, the more strata you should use. You should use an

SRS when little prior information about the target population is

available.

Recall: Relative precision of stratification and SRS

V (tstr,opt) ≤ V (tstr,prop) ≤ Vsrs(t)

1. Stratified sampling provides higher precision than SRS, why conduct SRS?

• Stratification adds complexity to the survey, which may not be worth a small

gain in precision

• Need information which units and how many units belong to each stratum

2. When stratified sampling is efficient?

• SSB is large (strata means differ greatly)

• SSW is small (variability within stratum is small)

Example: National Pesticide Survey (NPS)

US Environmental Protection Agency (EPA) sampled drinking wells to esti-

mate the prevalence of pesticides and nitrate between 1988 and 1990.

• Want a sample that was representative of drinking water wells in the United

States

• Want to guarantee that wells in the sample would have a wide range of

levels of pesticide use and susceptibility to ground-water pollution

• Want to study two categories of wells: (1)Community water systems (CWS)

—systems of piped drinking water with at least 15 connections and/or 25 or

more permanent residents with at least one working well

and (2) rural domestic wells

—supplying occupied housing in rural areas, not on government property

1. Frame issue: how many drinking wells exist in the United States?

• For CWS, list with addresses is in the Federal Reporting Data

System (FRDS), maintained by EPA, There are approximately

51,000 CWSs.

• The 1980 census data is used to estimate number of rural do-

mestic wells. There are about 13 million rural domestic wells.

2. Stratification issue: EPA choose stratification design, which variables are

used to construct strata?

• EPA developed criteria for separating the population of CWS wells and

rural domestic wells into four categories of pesticide use and three relative

ground-water vulnerability measures. This design ensures that the range of

variability that exists nationally with respect to the agricultural use of pesti-

cides and ground-water vulnerability is reflected in the sample of wells.

• Pesticide use obtained from

—marketing research

—proportion of county in agricultural use

• Ground-water vulnerability measures (by DRASTIC)

• Four categories of pesticide use: high, moderate, low, uncommon; Three

categories of groundwater vulnerability: high, moderate, low gives 12 strata54

Table 4: Strata for National Pesticide Survey

Stratum pesticide use groundwater vulnerability number of

(estimated by DRASTIC) counties

1 high high 106

2 high moderate 234

3 high low 129

4 moderate high 110

5 moderate moderate 204

6 moderate low 267

7 low high 193

8 low moderate 375

9 low low 404

10 uncommon high 186

11 uncommon moderate 513

12 uncommon low 416

3. Design considerations

—For CWS, assume 0.5% of wells contain pesticides; choose

n so that the probability of detection is 90%.

—For rural wells, there were some subgroups of particular in-

terest; assume a 1% rate and 97% probability of detection.

—n = 564 public, 734 private Rural wells

4. Rural wells

—-Each county (N = 3137) categorized according to the strati-

fication variables.

—-Sample counties;

—-Characterize pesticide use and groundwater vulnerability for

subcounty areas.

—-No subcounty areas selection for CWS wells

Model-based inference for stratified sampling

• The one-way ANOVA model with fixed effects provides an un-

derlying structure for stratified sampling.

yhj = µh + εhj (1)

where εhj are independent with mean 0 and variance σ2h.

• The least squares estimator of µh is yh, the average in stratum

Estimators and Properties:

• Th =Nh∑j=1

yhj : the total in stratum h

• T =H∑

Th: the overall total

• Note that both Th and T are random variables

• The best linear unbiased estimator for Th is Th =Nh

j∈Sh

• EM [Th − Th] = 0

• EM [(Th − Th)2] = N2h

(1− nh

By the fact that observations in different strata are independent under the model

EM [(T − T )2] = EM

(Th − Th)

(Th − Th)2 +H∑

(Th − Th)(Tk − Tk)

(Th − Th)2]

(1− nh

Comments:

• The theoretical variance σ2h can be estimated by s2

• Adopting the model in (1) results in the same estimation for t

and its standard error as found under randomization theory.

• If a different model is used, however, then different estimators

are obtained.

Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str...

Documents

Transcript of Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str...

Phase changes measurement in GUN HPC - tesla.desy.detesla.desy.de/~wcichal/k3_meas.pdf · Temp cell 1, err ttfr( cell 1 (r ... Modul e GUN A Kl ystr imer LLR ddd n I y) PRINTING https:\\gansvr2.desy.de\sgd\ap

Dominion Resources Services, Inc. Innsbrook Technical ... · Dominion Resources Services, Inc. Innsbrook Technical Center- V -om lniolon 5000 Dominion Boulevard, 2SE, Glen Allen,

P l e a s e add 1 82 0 witi b e r heffectfrom 19th S e p t e m Tools · 30190 2SE-Strip 2.5-6 energy / double shielded solar cable 2.5 - 6 mm 2.5 - 6 mm2 394.00 30200 2SE-Strip 10/16

ITU-T Technical Report€¦ · 4 YSTR-M2M-DG.CoAP (2017) 5) Actuator switch on/off: the actuator application that is discovered by and connected to the smartphone application is able

didier.villers.free.frdidier.villers.free.fr/2SE/eva/eva xx fin annee/eva xx les bases... · *These ratir. are limiting values above the serviceability of the diode may be impaired.

HW Solution Chapter 2 By Silvester. - Purdue Universityjbeckley/q/WD/STAT472/F14/HW... · 2014-09-11 · Probability that a person age 25 will die between age 50 and age 75. S 25

Decorated ultrathin bismuth selenide nanosheets as targeted … · 2017. 10. 27. · The reaction mixture immediately turned dark due to the formation of Bi 2Se 3 NSs. The reaction

Journal of Materials Chemistry Cserg/postscript/c7tc02739g.pdfp-blockmetalbased2Dmaterials,suchassilicene, 6,7 germanene,8 stanene,9 Bi 2Se 3, 10 2D III-Bi compounds,11 BiCH 3, 12

[4^ · [4^ N«wMtind:2Se WEEKLY $1M poitpald (UX, Conj / »2J0 (Japan Ak) 'nr- {Uf i-ff (i>»v (ifflunnf ' ’-.if* 4i ^ *iOr { >• y’.f' » (..iuno ' ;V >: j.

Presentatie YSTR Yannick Hoogeveen def YSTR... · –Stressmanagement vaardigheden –Zelfvertrouwentraining –Identificeren en bespreken van kritieke situaties in sportleven en

didier.villers.free.frdidier.villers.free.fr/2SE/TD/td xx algebre de Boole et De Morgan... · 74LS08 NC7S08M5 / Single Gate SMD-SOT23/5 7409 Quad 2-lnput AND Gate with Open-Drain

Infection rbna ystr !

Dominion Presentation on Radiological Releases and Doses ... · Dominion Residence Innsbrook Technical Center, 2SE 1463 Robindale Road 5000 Dominion Blvd. Richmond, VA 23235 Glen

Britannia Works, Folds Lane, Bolton, BL1 2SE

Building Plot And Croft, Ardindrean, Lochbroom, IV23€2SE...Offers over £148,000 Building Plot And Croft, Ardindrean, Lochbroom, IV23€2SE Owner Occupied Croft Extending to approximately

Thermal expansion coefficients of Bi2Se3 and …€¦ · Thermal expansion coefficients of Bi 2Se 3 and Sb 2Te ... is necessary for the directional growth of TI crystals and the ...

Bank Lending and Property Prices in Hong Kong · dp × +/-2SE 1990 1995 2000-0.75-0.50-0.25 0.00 dumdp2 × +/-2SE 24. 5 ... • Some evidence that spread between BLR and interbank

Journal of Energy Chemistry - whut.edu.cnmai.group.whut.edu.cn/chs/lw/2018/201805/P...pounds, antimony selenide (Sb 2Se 3) is one kind of representa- tive direct band-gap semiconductors

TOYOTA - Five Star Manufacturing · TOYOTA BLOCK HEATER APPLICATION CHART *FOR THE CORE PLUG HEATER WITH 0 RING, ... CAMRY 1983- 2SE C0140-00123* C0140- …

· GROW COMBINE LTD, 40 Bloomsbury Way, London, WC1A 2SE, United Kingdom

TOYOTA - Five Star Manufacturing · TOYOTA BLOCK HEATER APPLICATION CHART FOR THE CORE PLUG HEATER WITH 0 RING, ... CAMRY 1983- 2SE C0140-00123 C0140- …