Investigation of Treatment of Influential Values Mary H. Mulry Roxanne M. Feldpausch.

Post on 27-Mar-2015

214 views 0 download

Tags:

Transcript of Investigation of Treatment of Influential Values Mary H. Mulry Roxanne M. Feldpausch.

Investigation of Treatment of Influential Values

Mary H. Mulry

Roxanne M. Feldpausch

Outline

• Current practices

• Methods investigated

• Results

• Next steps

Influential Observation

An observation is considered influential if its weighted contribution has an excessive effect on the estimate of the total (Chambers et al 2000)

The Data - U.S. Monthly Retail Trade Survey

• Collect sales and inventories• Monthly survey of about 12,500 retail

business with paid employees• Sample selected every 5 years

– Sample is stratified based on industry and sales

– Quarterly sample of births– Deaths are removed

The Data

• Analysis done at published NAICS level

• Hidiroglou-Berthelot algorithm ran on the data before looking for influential values

• Horvitz-Thompson estimator

Causes of Influential Units

• One time or rare event

• Erroneous measure of size

• Change in the make-up of the unit

• Seasonal Businesses

Current Practices

• Analyst review an effect listing of micro level data and investigates units that may be influential

• When the analyst determines a correctly reporting unit may be influential, the case is referred to a statistician

Current Practices

• One time influential value– Imputation

• Recurring influential value– Weight adjustment based on the principles

of representativeness– Moving the unit to a different industry

when the nature of the business changes

Goals

• To improve upon current methodology by making it more objective and rigorous

• To find methodology that uses the observation but in a manner that assures its contribution does not have an excessive effect on the total

Assumptions

• Influential observations occur infrequently, but are problematic when they appear.

• The influential observation is true, although unusual. It is not the result of a reporting or coding error.

Strategy

Identify candidate methodologies and test with real data from one industry (about 700 businesses) for a month that contains an influential value

Evaluation Criteria

• Number of influential observations detected, including the number of true and false detections made

• Estimate of bias

• Impact on month-to-month change

Notation

where

Yi is the sales for the i-th business in a survey sample of size n

wi is the sample weight for the i-th unit

Xi is the previous month’s sales for the ith business

i

n

iiYwY

1

ˆ

Methods Examined

• Weight trimming

• Reverse calibration

• Winsorization

• Generalized M-estimation

Weight Trimming

• Does not identify influential units

• Adjusts the weight of the observation

Weight Trimming

• Truncate the weight of the influential observation

• Adjust the weights of the non-influential observations to account for the remainder of the truncated weight

• Sum of the new weights is the same as the sum of the original weights

(Potter 1990)

Weight Trimming Notes

• Calculations were done within sample stratum.

• Choice of correction factor could be investigated. We arbitrarily chose ci=wi/3.

Reverse Calibration

• Does not identify influential units

• Adjusts the value of the observation

Reverse Calibration

1. Use a robust estimation method to estimate the total

2. Modify the influential observations to achieve that total

(Chambers and Ren 2004)

Winsorization

• Identifies influential units

• Adjusts the value of the observation

Winsorization

Type I

Type II

otherwiseY

KYKY

i

ii

,

*,

otherwiseY

KYKYKY

i

iiw

ii

,

1*

),(

Winsorization – Defining K

• Define a separate Kh for each stratum in a manner than minimizes the mse (Kokic and Bell 1994)

• Define a separate Ki for each observation in a manner that minimizes the mse (Clarke 1995)

Winsorization – Defining K

• Use unweighted data to define Kh for each stratum where Kh = h +2sh

• Use weighted data to define Kh for each stratum where Kh = h +2sh where h and sh are based on the weighted data

Winsorization-Our Implementation

Used a robust regression in SAS to estimate the parameters needed in the calculations

M-estimation

M-estimators are robust estimators that come from a generalization of maximum likelihood estimation

M-estimation

• Identifies influential units

• Adjusts either the weight or the value of the influential observation

M-estimation

Used a weighted M-estimation technique that is able to modify the weights or the values of the influential observations (Beaumont and Alavi 2004)

Results

Number of Outliers Detected

Weight trimming 1*Winsor by stratum 51Winsor by obs 1Winsor +2s 0Winsor wgt +2s 4Reverse Calibration 1*M-estimation obs 1M-estimation wgt 1

*Method does not detect outliers, one outlier was specified

Replacement Values (in Millions)

*Weight trimming adjusts the other 18 weights in the stratum **Winsor wgt +2s identified 3 other values

Value WeightWeighted

Valueprevious month 0.6 55 31current month 7.5 55 413Weight trimming* 7.5 18 135Winsor by obs 4.0 55 220Winsor wgt +2s ** 1.6 55 87M-estimation obs 4.3 55 234M-estimation wgt 7.5 30 225

Total Sales for the IndustryTotal

(billions)Month-to-month percent change

previous month 42.4current month 38.6 -9.1weight trimming 38.3 -9.7Winsor by obs 38.5 -9.5Winsor wgt +2s 38.2 -9.9M-estimation obs 38.4 -9.5M-estimation wgt 38.4 -9.5

Chosen for Further Study

• Winsorization by each observation

• M-estimation by observation

• M-estimation by weight

Contact Information

Mary.H.Mulry@census.gov

Roxanne.Feldpausch@census.gov