Missing data handling

30
Location: Boston Data Festival September 23 rd 2016 What’s Missing ? Methods in missing data analysis 2016 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP www.QuantUniversity.com [email protected]

Transcript of Missing data handling

Page 1: Missing data handling

Location:

Boston Data Festival

September 23rd 2016

What’s Missing ? Methods in missing data analysis

2016 Copyright QuantUniversity LLC.

Presented By:

Sri Krishnamurthy, CFA, CAP

www.QuantUniversity.com

[email protected]

Page 2: Missing data handling

• Will be on the QuantUniversity Meetup page.

• If you are not a member signup here: https://www.meetup.com/QuantUniversity-Meetup/

Slides and code

Page 3: Missing data handling

- Analytics Advisory services- Custom training programs- Architecture assessments, advice and audits

Page 4: Missing data handling

• Founder of QuantUniversity LLC. and

www.analyticscertificate.com

• Advisory and Consultancy for Financial

Analytics

• Prior Experience at MathWorks, Citigroup

and Endeca and 25+ financial services and

energy customers

• Regular Columnist for the Wilmott

Magazine

• Author of forthcoming book

“Financial Modeling: A case study

approach” published by Wiley

• Charted Financial Analyst and Certified

Analytics Professional

• Teaches Analytics in the Babson College

MBA program and at Northeastern

University, Boston

Sri Krishnamurthy

Founder and CEO

4

Page 5: Missing data handling

5

Quantitative Analytics and Big Data Analytics Onboarding

• Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R

• Launching the Analytics Certificate Program in 2016

Page 6: Missing data handling

(MATLAB version also available)

Page 7: Missing data handling
Page 8: Missing data handling

8

www.analyticscertificate.com/SparkWorkshop

Early bird pricing ending

today!

Page 9: Missing data handling

Definition

Assumptions

Work flow

Page 10: Missing data handling

What does a missing data problem look like?

Page 11: Missing data handling

Missing data

• Dealing with missing data, has been always a challenge in data analysis context.• We need methods in missing data analysis that:▫ Minimize the bias▫ Maximize use of available information, and▫ Get good estimates of uncertainty e.g., p-value, confidence interval, etc.

Rec No Variable n

1 Unit non-response Unobserved/Latent

variable

2

3 Missing data

4 Item-non-

response

Page 12: Missing data handling

Assumptions (MCAR, MAR, NMAR)• When values are missing completely at random (MCAR), the probability of

missingness is unrelated to the values of any variable

• Data are missing at random (MAR) if missingness is unrelated to the value that is missing, but is related to the values of other variables• E.g., a question about typical number of hours spent browsing the internet

might be missing more often for married than unmarried participants; BUT among the married subset, missingness is completely random—Not related to how many hours the person browses

• Data are missing not at random (MNAR) if missingness is related to the value that is missing, and often to the values of other variables as well• E.g. missing values are more prevalent among those who typically browse

more than among those who browse less

Page 13: Missing data handling
Page 14: Missing data handling

• Deletion methods : Delete cases or variables that are missing

▫ Listwise methods

▫ Pairwise deletion

▫ Variable deletion

• Imputation methods : Substitution methods

▫ Single imputation

Mean imputation

Conditional mean imputation

Case mean imputation

Regression imputation

Last observation carried forward

Worst case imputation

Best case imputation

EM imputation

▫ Multiple imputation

Methods of handling missing data

Page 15: Missing data handling

List wise deletion• A good method when the proportion of

missing data is less than 15%.

• Advantages:▫ It can be used for any type of statistical

analysis.

▫ No special computations are required.

▫ The parameters estimations areunbiased.

▫ The standard errors are appropriatecompare to original data.

• Disadvantages:▫ May remove a considerable fraction of

data

Page 16: Missing data handling

Pair wise deletion• Pairwise deletion involves dropping cases

with missing values on an analysis-by-analysis basis

• Advantages:▫ Using all available non-missing data

• Disadvantages:▫ Estimated standard errors and test

statistics are biased

Page 17: Missing data handling

Variable deletion• Variable deletion involves dropping

variables with missing values on an case -by-case basis

• Advantages:▫ Makes sense when lot of missing values in

a variable and if the variable is of relativelyless importance

• Disadvantages:▫ Loss of information regarding the variable

Page 18: Missing data handling

Mean imputation• Replace missing values with the mean of

that variable Case Var1 Var2 Var3

1 9 8 8

2 7.44 7 6

3 8 5 6

4 7 4 5

5 9 5 7

6 8 8 9

7 6 7 6

8 5 9 7

9 7 8 ?

10 8 8 7

Page 19: Missing data handling

Conditional Mean imputation• Replace missing values with value of the

variable mean for a relevant subgroupCase Var1 Sex Var2 Var3

1 9 F 8 8

2 8.25 F 7 6

3 8 F 5 6

4 7 F 4 5

5 9 F 5 7

6 8 M 8 9

7 6 M 7 6

8 5 M 9 7

9 7 M 8 ?

10 8 M 8 7

Page 20: Missing data handling

Case Mean imputation• Replace missing values using information

from other variables for the same case to impute the missing value

Case Var1 Var2 Var3

1 9 8 8

2 6.50 7 6

3 8 5 6

4 7 4 5

5 9 5 7

6 8 8 9

7 6 7 6

8 5 9 7

9 7 8 ?

10 8 8 7

Page 21: Missing data handling

Regression imputation• Replace missing values using information

from complete cases to “predict” the value of the missing data, based on a regression equation for cases with nonmissing values

Case Var1 Var2 Var3

1 9 8 8

2 6.32 7 6

3 8 5 6

4 7 4 5

5 9 5 7

6 8 8 9

7 6 7 6

8 5 9 7

9 7 8 ?

10 8 8 7

VAR1′ = 4.621 – (.734 * VAR2) + (1.139 * VAR3)

Page 22: Missing data handling

• Imputes the missing value as a value on the same outcome the most recent time it was observed

• Variants :

• Average of T1 and T2

Last observation carried forward

Case T1 T2 T3

1 9 8 8

2 ? 7 6

3 8 5 6

4 7 4 5

5 9 5 7

6 8 8 9

7 6 7 6

8 5 9 7

9 7 8 8

10 8 8 7

Page 23: Missing data handling

• Use interpolation to fill in missing values

• Useful for longitudinal datasets

Interpolation

Page 24: Missing data handling

• Worst case replaces a missing value with the worst case scenario for a categorical outcome

• Best case replaces a missing value with the best case scenario for a categorical outcome

Worst case and Best case imputation

Page 25: Missing data handling

• Substitute best missing values using a ML imputation

• In the E-step, expected values are calculated based on all complete data points

• In the M-step, the procedure imputes the expected values from the E-step and then maximizes the likelihood function to obtain new parameter estimates

Expectation-Maximization

Page 26: Missing data handling

• Multiple imputation is quickly becoming the “gold standard” approach to handling missing values

• Computationally complex

Multiple imputation

Page 27: Missing data handling

SummaryWe have covered Missing data

Introduction Missing data definition, assumptions and work flow

Deletion methods Listwise methods Pairwise deletion Variable deletion

Imputation methods Single imputation Mean imputation Conditional mean imputation Case mean imputation Regression imputation Last observation carried forward Worst case imputation Interpolation Best case imputation EM imputation

Multiple imputation

References

Page 28: Missing data handling
Page 29: Missing data handling

29

www.analyticscertificate.com/SparkWorkshop

Early bird pricing ending

today!

Answer this question on the Eventbrite site

promo code section and take an additional

25% off! Only today!

What package in Python do you need to

use DataFrame functionality?

Page 30: Missing data handling

Thank you!

Sri Krishnamurthy, CFA, CAPFounder and CEO

srikrishnamurthy

Contact

Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and

shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC.