Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and...
Transcript of Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and...
Copyright © [year] UK Data Service. Created by [Organisation], [Institution]
Analysing and Integrating Multiple
datasets Workshop
Dr Ana Morales
Prof Mark Elliot
Manchester 12-13 September 2019
Integrating and analysing multiple datasets
Feedback
We really value your feedback, please
complete the survey at:
www.ncrm.ac.uk/surveys/man/
Day 1
9:45-10:00 - Registration
10:00-11:00 - Introduction to Data Analysis: Exploratory data analysis and visualisations
11:00 - 12:00 - Practical session
12:00 - 13:00 - Lunch
13:00 - 13:45 - Linking data and data manipulation
13:45 -14:30 - Practical session
14:30 - 15:00 - Coffee
15:00 -15:45 - Handling missing data
15:45 - 16:45 - Practical session
16:45 - 17:00 - Summary
Day 2
9:00 - 9:30 - Introduction to the day 9:30 - 11:00 - Data Hack Session 1
11:00 - 11:30 - Coffee
11:30 - 13:00 - Data Hack Session 2
13:00 - 14:00 - Lunch
14:00 - 15:15 - Group presentations
15:15 - 15:30 - Summary
Today Schedule
• Day 1
• 9:45-10:00 - Registration
• 10:00-11:00 - Introduction to Data Analysis: Exploratory data analysis and visualisations
• 11:00 - 12:00 - Practical session
• 12:00 - 13:00 - Lunch
• 13:00 - 13:45 - Linking data and data manipulation
• 13:45 -14:30 - Practical session
• 14:30 - 15:00 - Coffee
• 15:00 -15:45 - Handling missing data
• 15:45 - 16:45 - Practical session
• 16:45 - 17:00 - Summary
Housekeeping
Who am I?
What is the UK Data Service?
• A comprehensive resource
funded by the ESRC
• A single point of access to a
wide range of secondary
social science data
• Support, training and
guidance
ukdataservice.ac.uk
What data do we hold?
Surveys InternationalLongitudinal
Large-scale
government
funded UK and
cross-national
surveys
Census Business
Major UK
surveys following
individuals over
time
Multi-nation
aggregate
databanks and
survey data
Range of
multimedia
qualitative data
sources
Census data
1971 – 2011Microdata and
administrative
data
Qualitative
Other resources and support
Webinars & Workshops
• See events pages
Guides & Video tutorials
• Topic
• Dataset
• Methods and software
Helpdesk
• Individual support by e-mail
Teaching Resources
• Training and e-books
• Worksheets and teaching ideas
Copyright © [year] UK Data Service. Created by [Organisation], [Institution]
Introduction to data analysis and
visualisations
Ana Morales
Analysing and Integrating Multiple datasets
Workshop
Manchester, 12 and 13 September, 2019
What is data?
• Information, especially facts or numbers, collected to be examined and
considered and used to help decision-making, or information in an
electronic form that can be stored and used by a computer (Cambridge
dictionary)
• Numeric
• Images
• Attributes (characters)
World Happiness index
Where are data
coming from?
Survey
Administrative data
Bank transactions
Mobile data (apps)
How are data
stored?
Structured data• Organised
• Tabular form
Unstructured data• Text
• Images
• Audio files
The data Science process (1):
Ask an interesting question
A topic of interest:
Crime
Health inequalities
Pollution
Specific goal
1. Confidence in the Criminal
Justice System in England
2. Antisocial behaviour in
Manchester
What do you want to
predict or estimate?
Local indicators of
antisocial behaviours
The data Science process (2):
Get the data
Police recorded crime
data:
https://data.police.uk/data/
The data Science process (2):
Get the data
Police recorded crime
data:
https://data.police.uk/data/
Coverage
• Date range
• Different Police unit
(spatial location)
Categories of crime
• Crime data
• Outcome data
• Stop and search data
Format
• Csv files
The data Science process (3):
Explore the data
What data do we have?
Variables
• Crime ID
• Date of crime
• Type of crime
• LSOA (geographic
location)
Type of data
• Numeric?
• Attribute (character)
• Spatial (coordinates)
Is it ready to analyse?
• Data cleaning
• Manipulation
The data Science process (3):
Explore the data
What data do we have?
Variables
• Crime ID
• Date of crime
• Type of crime
• LSOA (geographic
location)
Type of data
• Numeric?
• Attribute (character)
• Spatial (coordinates)
Is it ready to analyse?
• Data cleaning
• Manipulation
Descriptive statistics
Central tendency measures
• Any correlations?
• Anomalies?
Plot the data
• Anomalies?
• Patterns?
More questions
• Are the data enough for my RQ?
• Do we need more data?
• Is there more data?
• Change RQs?
The data Science process (4):
Model the data
What is the best approach to understand the data we have? Depends on…
• Our research questions
• Our data available
Example: Hotspots of crime in Greater Manchester
The data Science process (5):
Communicate and visualise
the results
Visualise the results
Tables
Figures
Plots
Maps
Communicate the results
Know your audience
• Effective
• The right details for each
audience
• Academic ≠ Local
Government officers
Brexit
Referendum
Exploratory Data
Analysis
Exploratory data analysis
Flowchart for data preparation
https://r4ds.had.co.nz/explore-intro.html
From R for data science
Describe the data
To Understand:
data availability,
Types,
quality,
data complexity (i.e. nonlinearity, requires transformation, etc)
Guided by two types of questions (Grolemund and Wickham,
2016):
What type of covariation occurs between my variables?
What type of variation occurs within my variables?
How to describe the data (1)
Distribution of numerical variables:
Extreme values (outliers)
Shape of the distribution
Missing cases
Unusual patterns
Distribution of categorical variables
Missing cases
Odd values
Unusual patterns
Most common values
How to describe the data (2)
Central tendency measures
Mean
median
mode
Measures of spread
Variance and standard deviation
Range:
Interquartile range (IQR)
Visualisations
Histograms, boxplots, bar plots, scatterplots
How to describe the data (3)
Mean (sample vs. population):
The "average" number; found by adding all data points and dividing by the number of
data points
Median
Middle value if odd number of values, or average of the middle two values otherwise
Mode
Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
Is the mean always
the best central
tendency measure?
The problem with the mean
“There are two pieces of bread. You eat two.
I eat none. Average consumption: one bread
per person.“
Nicanor Parra, (Anti)Poet, Mathematician and
Physicist
More about visualisations
Two types:
1. Exploring and getting to know the data
1. Assess the data: decide what to do next
2. Accurate
3. Internal, never reach the wider audience
2. Communication
1. Present data and ideas
2. Accurate: provide evidence
3. Easy to understand
4. Effective
5. It would depend on the audience
Images: https://r4ds.had.co.nz/exploratory-data-analysis.html
https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-
a2284bae952f
Effective visualisations for communication
Simple but effective (don’t over do it!)
Easy to understand
Use the right type of graph of figure
Not a fit them all purpose graph
Appropriate use of colours (colour blind people)
Know your audience
Effective visualisations for communication
https://extremepresentation.typepad.com/blog/files/choosing_a_good_chart.pdf
Effective visualisations for communication: Use
the right display
Trends over time
Lines
Scatterplots
Distributions
Density plots
Histograms
Correlations
Scatterplots
Comparisons:
Bars
lines
Proportions
Pie charts
Stacked charts
Bad practice in data visualisations
Lack of clear axis labelling
Truncation of an axis so it ends or starts in a misleading place
Numbers not adding up correctly
Visual deceptions
inappropriate use of white space or
comparative elements that don’t seem to correspond to the data they represent
Poor use of a graph type
Making a series of bar graphs when a line graph would be clearer
Cherry-picking data,
Including limited years or responses chosen to support a specific argument
Bad graphic choices,
Low-contrast colour choices
distorting perspective
No title
No key for the information in the graph
Bad practice in data visualisations: Visual
deceptions
Bad practice in data visualisations: Numbers not
adding up correctly
Bad practice in data visualisations: Truncation of an axis
so it ends or starts in a misleading place
DelinquencyHouseholds victim of a crime in the last 12 months.
Bad practice in data visualisations: Truncation of
an axis so it ends or starts in a misleading place
30.7
22.8
27.3
0
5
10
15
20
25
30
35
2010 2013 2016
Same values!
Copyright © [year] UK Data Service. Created by [Organisation], [Institution]
Data manipulation and
linking data
Ana Morales
Analysing and Integrating Multiple datasets
Workshop
Manchester, 12 -13 September, 2019
Data manipulation
Tidy Data
Data Transformation
Linking data
Tidying data
Issues around data: ‘Untidy’ data
Messy data (common causes of messiness)
Multiple variables stored in one column
Missing cases
Variables stored as both rows and columns
Duplicate data
Reason for ‘untidy’ data
• Data organised and stored for different purposes other than analysis
Administrative data
Customer data
People and data producer not really aware of how a tidy dataset should
look like
No standard procedures guiding data entry and storing of variables
Inconsistent coding, i.e. Data entry process made by different persons
Recall from the exploratory data analysis process:
Flowchart for data preparation
https://r4ds.had.co.nz/explore-intro.html
Tidy dataset (Grolemund and Wickham, 2016)
• Each variable must have its own column.
• Each observation must have its own row.
• Each value must have its own cell.
Recall of useful terms
• Variables: a quantity, quality, or
property that you can measure
• Categorical (name, gender, etc)
• Ordinal (low, medium and high)
• Continuous (dates, age, etc.)
• Observations: set of measurements
made under similar conditions
• An observation will contain several values,
each associated with a different variable.
• Values: the state of a variable when
you measure it. The value of a variable
may change from measurement to
measurement.
casenumber sex age SES
1 female 15 Low
2 female 32 Medium
3 male 38 Low
4 female 57 High
Tidy dataset: examplesWhat is the problem with this
data?
• Column storing a
variable ‘type’.
• The values of ‘types’ are
also variable: cases
and population.
country year cases population
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
country year type count
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Tidy dataset: examplesWhat is the problem with this
data?
• Two columns storing
one variable.
• The variables ‘1999’ and
‘2000’ are values of the
year variable
• The values of ‘1999’ and
‘2000’ are part of the
variables cases for each
year and country.
country 1999 2000
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
country year cases
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
Tidy dataset: examplesWhat is the problem with this
data?
• Column storing variable
rate, expressed as
division of two variables:
cases and population
• We need two new
variables:
• Cases
• Population
country year rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
country year cases population
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Is ‘untidy’ data really a bad thing?
There are some cases where a tidy dataset, in the form explained before,
might not be the best format
Some analyses in other software packages might be optimised for different
formats
Some analyses based on different R packages might require other formats
Problem first, not solution backward
https://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/
Data transformation
Recall from the exploratory data analysis process:
Flowchart for data preparation
https://r4ds.had.co.nz/explore-intro.html
Overview
• Data is rarely in the format that we need
• New variables
• Edit names
• Summarise
• Select subsamples
• Cases
• Variables
• Reshape (as seen in ‘tidy data’)
Subsetting: Making big datasets small
• Why subset?
• Useful when we have a lot of data, but not all of them is relevant
• Big datasets (rows and columns) are more difficult to manage
• It may slow down the analysis (higher power intensity)
• Example:
• Police data for Greater Manchester
• Different type of offences, but we only want to focus on burglary in Bolton
• What subset?
• Variables (columns): exclude what we don’t need, e.g. Lat and Lon if not mapping
• Cases (rows): Only selected cases, e.g. those in Bolton
Select: Pick variables by their names
• Database of 50 most rated films
Movie Name Rating Years Released Duration
The Shawshank Redemption 9.3 -1994 142 min
The Godfather 9.2 -1972 175 min
The Dark Knight 9 -2008 152 min
The Godfather: Part II 9 -1974 202 min
Pulp Fiction 8.9 -1994 154 min
The Lord of the Rings: The Return of the King 8.9 -2003 201 min
Schindler's List 8.9 -1993 195 min
12 Angry Men 8.9 -1957 96 min
Il buono, il brutto, il cattivo 8.9 -1966 161 min
The Lord of the Rings: The Fellowship of the Ring 8.8 -2001 178 min
Inception 8.8 -2010 148 min
Movie Name Rating
The Shawshank Redemption 9.3
The Godfather 9.2
The Dark Knight 9
The Godfather: Part II 9
Pulp Fiction 8.9
The Lord of the Rings: The Return of the King 8.9
Schindler's List 8.9
12 Angry Men 8.9
Il buono, il brutto, il cattivo 8.9
The Lord of the Rings: The Fellowship of the Ring 8.8
Inception 8.8
Pick based
• Variable names
• Contains (part of variable name)
• Ends (with)
• Starts (with)
Filter: Pick observations based on their values
• Database of 50 most rated films
Movie Name Rating Years Released Duration
The Shawshank Redemption 9.3 -1994 142 min
The Godfather 9.2 -1972 175 min
The Dark Knight 9 -2008 152 min
The Godfather: Part II 9 -1974 202 min
Pulp Fiction 8.9 -1994 154 min
The Lord of the Rings: The Return of the King 8.9 -2003 201 min
Schindler's List 8.9 -1993 195 min
12 Angry Men 8.9 -1957 96 min
Il buono, il brutto, il cattivo 8.9 -1966 161 min
The Lord of the Rings: The Fellowship of the Ring 8.8 -2001 178 min
Inception 8.8 -2010 148 min
Filter:• List of movies before 1990
• Based in values of any variable
• Variables can be of any type
(numeric, character, etc.)
• Based on multiple conditions
• >; <; <=; >=, =
• Boolean operators
MOvie Name Rating Years Released Duration
The Shawshank Redemption 9.3 -1994 142 min
The Godfather 9.2 -1972 175 min
The Godfather: Part II 9 -1974 202 min
Pulp Fiction 8.9 -1994 154 min
Schindler's List 8.9 -1993 195 min
12 Angry Men 8.9 -1957 96 min
Il buono, il brutto, il cattivo 8.9 -1966 161 min
Mutate: Creating new variables as a function of
existing columns.
country year rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Using two functions:
• separate cases and
population into two
variables
• Create a new variable ‘rate’
based on the values of
cases and population
country year cases population rate
Afghanistan 1999 745 19987071 0.00003727
Afghanistan 2000 2666 20595360 0.000129447
Brazil 1999 37737 172006362 0.000219393
Brazil 2000 80488 174504898 0.000461236
China 1999 212258 1272915272 0.00016675
China 2000 213766 1280428583 0.000166949
Summarise: Creating aggregated variables
Useful to create summaries of
variables
In R, the function summarise
works better when used along
other functions
In this example, we grouped
observations by country and
created a new variable
‘mean_cases’ that represents
mean cases per country
country year cases population rate
Afghanistan 1999 745 19987071 0.00003727
Afghanistan 2000 2666 20595360 0.000129447
Brazil 1999 37737 172006362 0.000219393
Brazil 2000 80488 174504898 0.000461236
China 1999 212258 1272915272 0.00016675
China 2000 213766 1280428583 0.000166949
country mean_cases
Afghanistan 1705.5
Brazil 59112.5
China 213012
Linking data: working
with relational datasets
Relational data
Typically, more than a single data table or dataset to work with
Need to combine them to answer the questions of your project
Multiple datasets with relations in common are called: relational data
Example: We want to calculate the arrest rate of burglary for 100,000
inhabitants in Bolton using open Police Data:
Relational data
• Crime rate for area:
• We are missing the total population
• Our data for Bolton is composed by several LSOA
• We need total population estimates for LSOA level
• ONS mid year population statistics
𝐶𝑟𝑖𝑚𝑒 𝑟𝑎𝑡𝑒 =𝑁 𝑐𝑟𝑖𝑚𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 𝑝𝑜𝑝∗ 100000
Linking data:
The formula takes the following form:
f(x,y)
Where f is the function
x and y are the datasets
- We need at least one common variable that identifies each case in both
datasets.
- Datasets not necessarily have the same number of cases,
- but there has to be some cases in common
- They might have repeated variables (we may duplicate them!) but not a
problem
- They may have the same variable under a different name
Types of join: Mutating joins
Combine variables from two set of data
Matched observation and copied (modified) variables from one data
to the other
Add new variables to the right of the of the x (reference) dataset
• Inner join: keeps only matched observation that are in both datasets (AND)
• Left join: keeps all observations of dataset x
• Right join: keeps all observations of dataset y
• full join: Keeps all observation in both datasets (OR)
Types of Join: Inner join
Inner join Matches pairs of observations (cases) by the key variable/s preserving only
those available in both relational data
Is the simplest type of join,
Not very convenient as we might lose too many observations
Data 1
id sex occupation
102 male baker
103 female banker
104 female teacher
105 male teacher
Data 2
id age
101 25
102 34
104 51
105 38
id sex occupation age
102 male baker 34
104 female teacher 51
105 male teacher 38
Combined data
Code in R: inner_join(data1, data 2, by = “id”)
Types of Join: left join
Left join Matches pairs of observations (cases) by the key variable/s preserving only those
available in x (first) data
Returns all rows from x, and all columns from x and y
Rows in x with unmatched rows in y will have missing values (NA) for the columns imported from y
Data 1
id sex occupation
102 male baker
103 female banker
104 female teacher
105 male teacher
Data 2
id age
101 25
102 34
104 51
105 38
Combined data
Code in R: left_join(data1, data 2, by = “id”)
id sex occupation age
102 male baker 34
103 female banker NA
104 female teacher 51
105 male teacher 38
Types of Join: Right join
Right join
Returns all rows from y, and all columns from x and y
Rows in y with unmatched rows in x will have missing values (NA) for the
columns contained in x.
Data 1
id sex occupation
102 male baker
103 female banker
104 female teacher
105 male teacher
Data 2
id age
101 25
102 34
104 51
105 38
Combined data
Code in R: right_join(data1, data 2, by = “id”)
id sex occupation age
101 NA NA 25
102 male baker 34
104 female teacher 51
105 male teacher 38
Types of Join: full join
full join
Returns all rows and all columns from both x and y
Where there are unmatched values, returns NA for the ones missing
Data 1
id sex occupation
102 male baker
103 female banker
104 female teacher
105 male teacher
Data 2
id age
101 25
102 34
104 51
105 38
Combined data
Code in R: full_join(data1, data 2, by = “id”)
id sex occupation age
101 NA NA 25
102 male baker 34
103 female banker NA
104 female teacher 51
105 male teacher 38
Types of join: Filtering joins
match observations between two datasets, but affect the
observations, not the variables.
It does not add new variables!
• semi join: keeps all rows from x where there are matched values in y,
keeping just columns from x
• anti join: keeps only observations of dataset x that are NOT in y
Types of Join: semi join
semi join
Returns all rows from x where there are matching values in y
Similar to inner join, but only keeps columns from x
Useful when there are duplicated observations, it will return only one
Data 1
id sex occupation
102 male baker
103 female banker
104 female teacher
105 male teacher
Data 2
id age
101 25
102 34
104 51
105 38
Combined data
Code in R: semi_join(data1, data 2, by = “id”)
id sex occupation
102 male baker
104 female teacher
105 male teacher
Types of Join: anti join
anti join
keeps the rows in x that don’t have a match in y
Keeps only variables from dataset x
Data 1
id sex occupation
102 male baker
103 female banker
104 female teacher
105 male teacher
Data 2
id age
101 25
102 34
104 51
105 38
Combined data
Code in R: anti_join(data1, data 2, by = “id”)
id sex occupation
103 female banker
Issues around linking data
• Identify key variables
• Check no missing in the key variable. Comparing number of rows in both
datasets not sufficient… be aware of duplicates!
• Check not spelling mistake (entry errors) in the key variable. An anti join can
be useful
Copyright © [year] UK Data Service. Created by [Organisation], [Institution]
Missing Data
Ana Morales
Analysing and Integrating Multiple datasets
Workshop
Manchester, 12 -13 September, 2019
Overview
• Causes, patterns, and mechanisms of missing data
• Problems caused by missing data
• Dealing with missing data
What are missing data
Data values that is not stored (missing) for a variable in the observation of
interest.
• Systematic missing: Certain types of people didn’t want or couldn’t or
preferred not to answer certain types of questions
• Values can be explicitly missing,
• Different software conventions, i.e. ‘NA’ in R; 9, 99 -9 in SPSS
• Values can be missing but it is not made explicit, i.e. empty cells
Causes
• Unit non-response
- attrition (drop-out)
- wave non-response
• Item non-response
• Data entry problems
• Random error: Someone forgot to write down a number
Missing mechanisms: Missing Completely at
Random (MCAR)
Probability of missing data on a variable Y is unrelated to other measured
variables and is unrelated to the values of Y itself. Missing not related to any
variable
Example:
Blood measurements were lost due to equipment malfunction in the laboratory.
Missing mechanisms: Missing at Random (MAR)
The probability of a variable is missing depends only on available information
Example:
A child did not take an assessment because he was ill and we know that he
was ill from other variables (he was truly ill).
We know the reason of the missingness, and we know that the missingness is
not related to the score that the child would have obtained had he not being ill
Missing mechanisms: Missing not at Random
(MNAR)
The probability of missing data on a variable is specifically related to the values
of what is missing itself
Example:
a person does not attend a drug test because the person took drugs the night
before
Problems with missing data
• Selection bias:
• cases with complete information might be different from those cases with missing values
• Precision:
• We might find associations where there are not or the other way round
• Loss of observations
Dealing with missing data
• Missing data methods that discard data
• Complete case analysis: Listwise
• Pairwise
• Nonresponse weighting
• Missing data method that retains all the data
• Imputation based
• Mode, mean, median imputation
• Regression methods
• Multiple imputation
Missing data methods that discard data
• Complete case analysis: Listwise
Removes all data for an observation that has one or more missing values.
Pros:
Easy to implement
Unbiased under MCAR
Cons:
Lost of data lead to a loss of precision
Biased results for MAR and MNAR
Remained observation might not be representative of the full sample
Missing data methods that discard data
• Pairwise deletion
Analyses all cases in which the variables of interest are present by pairs
Pros:
Easy to implement
Unbiased under MCAR
Cons:
Different observations contributing to different parts of the model
Loss of data lead to a loss of precision
Biased results for MAR and MNAR
Missing data methods that discard data
• Non-response weighting
A model that predicts the non-response in a variable.
The inverse of the probability of non-response used as a survey weight
Pros:
Same sample size, but make complete cases analysis become representative
Relatively easy to implement.
Cons:
Still a small sample
Increasingly problematic with more that one variable missing
Increase variance
Biased under MNAR
Missing data method that retains all the data
• Mode, mean, median imputation
Compute the overall mean, median or mode and uses this to replace the
missing values
Pros:
Easy to implement
Preserve sample size (no data loss)
Cons:
Reduce variability of the data
Biased estimates especially when MAR and MNAR
Not recommended
Missing data method that retains all the data
• Regression methods
Existing variables are used to make a prediction, and then the predicted value is used to replace the missing values
Create model using auxiliary variables (the same that would form part of the model)
Pros:
Easy to implement
Preserve sample size (no data loss)
Easy to implement
Cons:
Reduce variability of the data
As the replaced values were predicted from other variables they tend to fit together “too well”
Not recommended
Missing data method that retains all the data
Multiple imputation
• Create model using auxiliary variables (the same that would form part of the model)
• Missing values are replaced with a set of plausible values which contain the natural variability and uncertainty of the “true” values.
• Create multiple datasets with different plausible imputed values
• Analyses of pooled dataset
Pros:
• Produces unbiased estimates under MCAR and MAR and sometimes under MNAR
• Improves validity
• Increases precision and power
Cons:
Computationally demanding
It needs a good imputation model (good predictors)
Considerations
• Ignoring missing data leads to selection bias and loss of power
• Many statistical methods to deal with missing data
• Poor missing value imputation might destroy the correlations between your
variables - BIAS
• There is no “one size fits all” answer
• Inspect the data
• Establish the missing data mechanism
• Deal with missing data accordingly
• Make explicit the implications of the chosen method to deal with missing cases
• If using imputation method, make it explicit.