Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
-
Upload
sophia-griffith -
Category
Documents
-
view
215 -
download
1
Transcript of Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
![Page 1: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/1.jpg)
Missing Values
Raymond KimPink PreechavanichwongAndrew Wendel
October 27, 2015
![Page 2: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/2.jpg)
I. Intro Missing Values and BiasII. Simulations and ImputationIII. Deletion MethodologyIV. Not Missing at Random
![Page 3: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/3.jpg)
Initial Steps
Why is our data missing?
What is the characteristic of
our missing data?
How will that affect the bias?
Mean? Std?
,
https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf
![Page 4: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/4.jpg)
OLS Unbiased Estimator
![Page 5: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/5.jpg)
Initial Steps
1. Identify the reason for missing data Marriage, graduation, death, etc.
2. Understand the distribution of missing data Certain groups more likely to have missing values
3. Decide on the best method of analysis Deletion methods – Listwise, pairwise deletion Single Imputation Methods – Mean substitution, dummy variable, single
regression Model based methods – Maximum likelihood and multiple imputation
4. Power and Bias Too many missing variables reduces power Introduction of bias in your estimator
https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf
![Page 6: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/6.jpg)
Missing Values and Bias
Are missing values moving us away or closer to the true DGP?
𝐵𝑖𝑎𝑠�̂�=𝐸𝜃 [ �̂� ]−𝜃=𝐸𝜃[ �̂�−𝜃 ]
![Page 7: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/7.jpg)
Conditional Distribution
MCAR (missing completely at random)
Probability ( Y = Missing | X,Y) = Probability (Y=Missing)Probability that Y is missing does not depend on X or Y
MAR (missing at random)
Probability ( Y = Missing | X,Y) = Probability (Y=Missing | X)
Probability that Y is missing depends on X but not Y
NMAR (not missing at random)
Probability ( Y = Missing | X,Y) = Probability (Y=Missing | X,Y)Probability that Y is missing depends on Y and possibly on X
Statistical Models- A.C. Davison- Cambridge University Press
![Page 8: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/8.jpg)
Example: Sea Level =
Normal Data MCAR
NMAR MAR
Statistical Models- A.C. Davison- Cambridge University Press
![Page 9: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/9.jpg)
Bias Matrix – Does Bias Exist?
Deletion Mean Imputation
None (but reduced power)
None (but reduced power)
None < 0
ConditionalNone
UnconditionalYes
ConditionalNone
UnconditionalYes
ConditionalNone
UnconditionalYes
ConditionalYes < 0
UnconditionalYes
Yes Yes Yes Yes
Statistical Models- A.C. Davison- Cambridge University Press
![Page 10: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/10.jpg)
Working with Missing Data
• Deletion• Maximum Likelihood• Multiple Imputation• Single Imputation
MCAR
• Maximum Likelihood• Multiple Imputation• Single ImputationMAR
• Sensitivity Analysis• Pattern Mixture Models• Selection Model• Maximum Entropy
NMAR
https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf
![Page 11: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/11.jpg)
Listwise and Pairwise Deletion
Missing values are MCAR
MAR
BIASEDNMAR
Conditonal
UNBIASEDMCAR
MAR
![Page 12: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/12.jpg)
Single Imputation
• Replace missing data with mean or mode
• Introduces bias in estimated variance
Mean Mode Substition
• Create indicator (1=missing, 0=not missing)
• Impute missing values to a constant
Dummy Variable Control
• Replace missing values with predicted score from a regression
• Overestimates model fit
Conditional Mean
Substitution
https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf
![Page 13: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/13.jpg)
PRESENTATION TITLE HERE
Simulations and Imputation
![Page 14: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/14.jpg)
Imputing Values
• Deal with missing data by generating values for those that are missing.
• Use a variety of methods to impute these values varying in accuracy and complexity.
• We will focus on single imputation methods and a few multiple imputation methods.
![Page 15: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/15.jpg)
Mean Imputation
• We can use the mean in place of the missing values
• This will retain the mean from the dataset
• This will also cause a negative bias in the variance
![Page 16: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/16.jpg)
Regression Mean Imputation
• Instead of using the mean, we can use regression to give us predicted values for those missing.
• This may allow us to achieve better estimates
http://missingdata.lshtm.ac.uk/
![Page 17: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/17.jpg)
Multiple Imputations
• A more complex way to impute missing values.
• Imputes and analyzes data to replace missing values within the data set.
http://www.stefvanbuuren.nl/mi/MI.html
![Page 18: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/18.jpg)
A Few R Methods
How can we do this in R? Amelia
mi
There are many others, and some can be used to treat specific conditions for certain data sets.
![Page 19: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/19.jpg)
Amelia
Amelia is an algorithm that bootstraps data and uses that data in a multiple imputation process.
http://gking.harvard.edu/files/gking/files/amelia_jss.pdf?m=1360040717
![Page 20: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/20.jpg)
mi
“mi” imputes missing values using Bayesian regression methods, which are run a number of times and analyzed for convergence.
This method is very customizable, but is also very costly
https://cran.r-project.org/web/packages/mi/mi.pdf
![Page 21: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/21.jpg)
Additional Resources
Additional packages that can be used in R can be found here:
http://www.stefvanbuuren.nl/mi/Software.html
![Page 22: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/22.jpg)
Imputation Summary
In order to use imputation based methods we need to first understand the data and the reason for the “missingness” of the data.
By knowing this we can fit the method that we feel is most appropriate to our data set.
Single imputation methods can give us quick and easy answers to our missing values, but they also bias statistics like the variance.
Multiple imputation methods can handle the bias better but are complex and require more specialized R packages or software
![Page 23: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/23.jpg)
PRESENTATION TITLE HERE
Deletion Methodology
![Page 24: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/24.jpg)
Bias
• 0 means no bias
• there is a systematic tendency for the estimate to be larger than the parameter it is estimating.
• there is a systematic tendency for the estimate to be smaller than the parameter it is estimating.
Credit: email from Dr.Westfall
![Page 25: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/25.jpg)
Listwise Vs Pairwise Deletion
What are they?• They are methods that discard data.
How do they work?• Listwise (Complete-case analysis): Excluding all units for which the
outcome or any of the inputs are missing.
• Pairwise (Available-case analysis): Excluding a pair which contains one ore two missing values from data set.
What is the difference?• Pairwise attempts to minimize the loss that occurs in listwise deletion.
Credit: http://www.stat.columbia.edu/~gelman/arm/missing.pdf]
![Page 26: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/26.jpg)
Listwise Vs Pairwise Deletion (Cont’)
Listwise deletion
Pairwise deletion
![Page 27: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/27.jpg)
Listwise Vs Pairwise Deletion (Cont’)
Pros and Cons of Listwise and Pairwise deletions:• Listwise :
• The sample after deletion may not be representative of the full sample.• Reducing power and type II error rates increase.• Tendency to get bias results.
• Pairwise:
• Preserved or increase statistical power in the analyses.• The result will be the same if the data has two variables (columns)• Bias (over or underestimated)
Credit: https://www.statisticssolutions.com/missing-data-listwise-vs-pairwise/Credit: http://files.eric.ed.gov/fulltext/ED281854.pdf
![Page 28: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/28.jpg)
PRESENTATION TITLE HERE
Not Missing at Random
![Page 29: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/29.jpg)
Case of NMAR
Why are our values missing? High income individuals don’t report income
What is the characteristic of the missing dataMissing values are NMAR
Our sample
![Page 30: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/30.jpg)
Meboot Package
Our NMAR missing values introduce the most unsolvable estimator bias We don’t know the true distribution. But we can infer a similar distribution
for imputation. Maximum Entropy is for time series statistical inference when traditional
assumptions are unreliable For the worst case scenario:
• Missing values are NMAR
• Missing values follow a different distribution
• Extraction of this distribution is not available from historical data– i.e. company stock enters bankruptcy
– Company stock trading is halted
– Your client is calling and wants to know whether they should sell or hold
– This is a methodology for a “best guess” in the worst possible case
Our sample
https://cran.r-project.org/web/packages/meboot/index.html
![Page 31: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/31.jpg)
Evaluation of a Fund Manager
Bond Fund Equity Fund2007 8.54% 7.58%2008 9.58% NA2009 -1.87% 23.14%2010 5.46% 13.44%
Yearly Returns10 Year Treasury S&P 500
2007 10.21% 5.48%2008 20.10% -36.55%2009 -11.12% 25.94%2010 8.46% 14.82%
Yearly Returns
• While evaluating a fund manager for investment you notice that the fund did not include 2008 returns for its equity fund
• You highly suspect it is NMAR – It was left out because returns were bad
![Page 32: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/32.jpg)
Evaluation of a Fund Manager
Equity Fund '07,'09, '10 US MarketsIT 20.7% 20.4%Financials 17.5% 16.5%Health Care 15.8% 14.7%Cons. D 15.3% 13.1%Industrials 10.8% 10.1%Cons. S 10.1% 9.9%Energy 8.2% 6.9%Utilities 3.5% 3.1%Materials 3.0% 2.8%Telecom 2.8% 2.4%
Sector Breakdown
• You find out that the equity fund normally held stocks representative of the entire stock market
• Distribution of the missing data may follow the overall US equity market
![Page 33: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/33.jpg)
Meboot Maximum Entropy
• Data dependent nonstandard bootstrap
• Creates a population of time series that is non-stationary (i.e. mean changes over time)
• Creates a large number of replicates based on your provided ensemble 1. Sorts provided data in increasing order2. Compute intermediate points of sorted data3. Compute min/max4. Compute mean preserving constraints5. Generate random U[0,1] interval iterations6. Repeat
https://cran.r-project.org/web/packages/meboot/index.html
![Page 34: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/34.jpg)
Meboot Maximum Entropy
Bond Fund Equity Fund2007 8.54% 7.58%2008 9.58% -35.47%2009 -1.87% 23.14%2010 5.46% 13.44%m 5.43% 14.72%s 5.17% 7.86%
mME 2.17% s ME 25.90%
Yearly Returns
• NMAR missing values requires the most assumptions• Minimizing bias for NMAR depends heavily on your model setup• There is no “right” answer, we do not know the true DGP• All we can do is minimize bias with well grounded assumptions
![Page 35: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/35.jpg)
Questions?
Questions?
THANK YOU!
![Page 36: Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.](https://reader034.fdocuments.us/reader034/viewer/2022042718/5697bfd01a28abf838caa61b/html5/thumbnails/36.jpg)