Missing data handling
-
Upload
quantuniversity -
Category
Data & Analytics
-
view
211 -
download
0
Transcript of Missing data handling
Location:
Boston Data Festival
September 23rd 2016
What’s Missing ? Methods in missing data analysis
2016 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
• Will be on the QuantUniversity Meetup page.
• If you are not a member signup here: https://www.meetup.com/QuantUniversity-Meetup/
Slides and code
- Analytics Advisory services- Custom training programs- Architecture assessments, advice and audits
• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial
Analytics
• Prior Experience at MathWorks, Citigroup
and Endeca and 25+ financial services and
energy customers
• Regular Columnist for the Wilmott
Magazine
• Author of forthcoming book
“Financial Modeling: A case study
approach” published by Wiley
• Charted Financial Analyst and Certified
Analytics Professional
• Teaches Analytics in the Babson College
MBA program and at Northeastern
University, Boston
Sri Krishnamurthy
Founder and CEO
4
5
Quantitative Analytics and Big Data Analytics Onboarding
• Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R
• Launching the Analytics Certificate Program in 2016
(MATLAB version also available)
8
www.analyticscertificate.com/SparkWorkshop
Early bird pricing ending
today!
Definition
Assumptions
Work flow
What does a missing data problem look like?
Missing data
• Dealing with missing data, has been always a challenge in data analysis context.• We need methods in missing data analysis that:▫ Minimize the bias▫ Maximize use of available information, and▫ Get good estimates of uncertainty e.g., p-value, confidence interval, etc.
Rec No Variable n
1 Unit non-response Unobserved/Latent
variable
2
3 Missing data
4 Item-non-
response
Assumptions (MCAR, MAR, NMAR)• When values are missing completely at random (MCAR), the probability of
missingness is unrelated to the values of any variable
• Data are missing at random (MAR) if missingness is unrelated to the value that is missing, but is related to the values of other variables• E.g., a question about typical number of hours spent browsing the internet
might be missing more often for married than unmarried participants; BUT among the married subset, missingness is completely random—Not related to how many hours the person browses
• Data are missing not at random (MNAR) if missingness is related to the value that is missing, and often to the values of other variables as well• E.g. missing values are more prevalent among those who typically browse
more than among those who browse less
• Deletion methods : Delete cases or variables that are missing
▫ Listwise methods
▫ Pairwise deletion
▫ Variable deletion
• Imputation methods : Substitution methods
▫ Single imputation
Mean imputation
Conditional mean imputation
Case mean imputation
Regression imputation
Last observation carried forward
Worst case imputation
Best case imputation
EM imputation
▫ Multiple imputation
Methods of handling missing data
List wise deletion• A good method when the proportion of
missing data is less than 15%.
• Advantages:▫ It can be used for any type of statistical
analysis.
▫ No special computations are required.
▫ The parameters estimations areunbiased.
▫ The standard errors are appropriatecompare to original data.
• Disadvantages:▫ May remove a considerable fraction of
data
Pair wise deletion• Pairwise deletion involves dropping cases
with missing values on an analysis-by-analysis basis
• Advantages:▫ Using all available non-missing data
• Disadvantages:▫ Estimated standard errors and test
statistics are biased
Variable deletion• Variable deletion involves dropping
variables with missing values on an case -by-case basis
• Advantages:▫ Makes sense when lot of missing values in
a variable and if the variable is of relativelyless importance
• Disadvantages:▫ Loss of information regarding the variable
Mean imputation• Replace missing values with the mean of
that variable Case Var1 Var2 Var3
1 9 8 8
2 7.44 7 6
3 8 5 6
4 7 4 5
5 9 5 7
6 8 8 9
7 6 7 6
8 5 9 7
9 7 8 ?
10 8 8 7
Conditional Mean imputation• Replace missing values with value of the
variable mean for a relevant subgroupCase Var1 Sex Var2 Var3
1 9 F 8 8
2 8.25 F 7 6
3 8 F 5 6
4 7 F 4 5
5 9 F 5 7
6 8 M 8 9
7 6 M 7 6
8 5 M 9 7
9 7 M 8 ?
10 8 M 8 7
Case Mean imputation• Replace missing values using information
from other variables for the same case to impute the missing value
Case Var1 Var2 Var3
1 9 8 8
2 6.50 7 6
3 8 5 6
4 7 4 5
5 9 5 7
6 8 8 9
7 6 7 6
8 5 9 7
9 7 8 ?
10 8 8 7
Regression imputation• Replace missing values using information
from complete cases to “predict” the value of the missing data, based on a regression equation for cases with nonmissing values
Case Var1 Var2 Var3
1 9 8 8
2 6.32 7 6
3 8 5 6
4 7 4 5
5 9 5 7
6 8 8 9
7 6 7 6
8 5 9 7
9 7 8 ?
10 8 8 7
VAR1′ = 4.621 – (.734 * VAR2) + (1.139 * VAR3)
• Imputes the missing value as a value on the same outcome the most recent time it was observed
• Variants :
• Average of T1 and T2
Last observation carried forward
Case T1 T2 T3
1 9 8 8
2 ? 7 6
3 8 5 6
4 7 4 5
5 9 5 7
6 8 8 9
7 6 7 6
8 5 9 7
9 7 8 8
10 8 8 7
• Use interpolation to fill in missing values
• Useful for longitudinal datasets
Interpolation
• Worst case replaces a missing value with the worst case scenario for a categorical outcome
• Best case replaces a missing value with the best case scenario for a categorical outcome
Worst case and Best case imputation
• Substitute best missing values using a ML imputation
• In the E-step, expected values are calculated based on all complete data points
• In the M-step, the procedure imputes the expected values from the E-step and then maximizes the likelihood function to obtain new parameter estimates
Expectation-Maximization
• Multiple imputation is quickly becoming the “gold standard” approach to handling missing values
• Computationally complex
Multiple imputation
SummaryWe have covered Missing data
Introduction Missing data definition, assumptions and work flow
Deletion methods Listwise methods Pairwise deletion Variable deletion
Imputation methods Single imputation Mean imputation Conditional mean imputation Case mean imputation Regression imputation Last observation carried forward Worst case imputation Interpolation Best case imputation EM imputation
Multiple imputation
References
29
www.analyticscertificate.com/SparkWorkshop
Early bird pricing ending
today!
Answer this question on the Eventbrite site
promo code section and take an additional
25% off! Only today!
What package in Python do you need to
use DataFrame functionality?
Thank you!
Sri Krishnamurthy, CFA, CAPFounder and CEO
srikrishnamurthy
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and
shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC.