Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and...

86
Copyright © [year] UK Data Service. Created by [Organisation], [Institution] Analysing and Integrating Multiple datasets Workshop Dr Ana Morales Prof Mark Elliot Manchester 12-13 September 2019

Transcript of Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and...

Page 1: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Copyright © [year] UK Data Service. Created by [Organisation], [Institution]

Analysing and Integrating Multiple

datasets Workshop

Dr Ana Morales

Prof Mark Elliot

Manchester 12-13 September 2019

Page 2: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Integrating and analysing multiple datasets

Feedback

We really value your feedback, please

complete the survey at:

www.ncrm.ac.uk/surveys/man/

Day 1

9:45-10:00 - Registration

10:00-11:00 - Introduction to Data Analysis: Exploratory data analysis and visualisations

11:00 - 12:00 - Practical session

12:00 - 13:00 - Lunch

13:00 - 13:45 - Linking data and data manipulation

13:45 -14:30 - Practical session

14:30 - 15:00 - Coffee

15:00 -15:45 - Handling missing data

15:45 - 16:45 - Practical session

16:45 - 17:00 - Summary

Day 2

9:00 - 9:30 - Introduction to the day 9:30 - 11:00 - Data Hack Session 1

11:00 - 11:30 - Coffee

11:30 - 13:00 - Data Hack Session 2

13:00 - 14:00 - Lunch

14:00 - 15:15 - Group presentations

15:15 - 15:30 - Summary

Page 3: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Today Schedule

• Day 1

• 9:45-10:00 - Registration

• 10:00-11:00 - Introduction to Data Analysis: Exploratory data analysis and visualisations

• 11:00 - 12:00 - Practical session

• 12:00 - 13:00 - Lunch

• 13:00 - 13:45 - Linking data and data manipulation

• 13:45 -14:30 - Practical session

• 14:30 - 15:00 - Coffee

• 15:00 -15:45 - Handling missing data

• 15:45 - 16:45 - Practical session

• 16:45 - 17:00 - Summary

Page 4: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Housekeeping

Page 5: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Who am I?

Page 6: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

What is the UK Data Service?

• A comprehensive resource

funded by the ESRC

• A single point of access to a

wide range of secondary

social science data

• Support, training and

guidance

Page 7: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

ukdataservice.ac.uk

Page 8: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

What data do we hold?

Surveys InternationalLongitudinal

Large-scale

government

funded UK and

cross-national

surveys

Census Business

Major UK

surveys following

individuals over

time

Multi-nation

aggregate

databanks and

survey data

Range of

multimedia

qualitative data

sources

Census data

1971 – 2011Microdata and

administrative

data

Qualitative

Page 9: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Other resources and support

Webinars & Workshops

• See events pages

Guides & Video tutorials

• Topic

• Dataset

• Methods and software

Helpdesk

• Individual support by e-mail

Teaching Resources

• Training and e-books

• Worksheets and teaching ideas

Page 10: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Copyright © [year] UK Data Service. Created by [Organisation], [Institution]

Introduction to data analysis and

visualisations

Ana Morales

Analysing and Integrating Multiple datasets

Workshop

Manchester, 12 and 13 September, 2019

Page 11: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

What is data?

• Information, especially facts or numbers, collected to be examined and

considered and used to help decision-making, or information in an

electronic form that can be stored and used by a computer (Cambridge

dictionary)

• Numeric

• Images

• Attributes (characters)

Page 12: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

World Happiness index

Page 13: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Where are data

coming from?

Survey

Administrative data

Bank transactions

Mobile data (apps)

How are data

stored?

Structured data• Organised

• Tabular form

Unstructured data• Text

• Images

• Audio files

Page 14: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.
Page 15: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

The data Science process (1):

Ask an interesting question

A topic of interest:

Crime

Health inequalities

Pollution

Specific goal

1. Confidence in the Criminal

Justice System in England

2. Antisocial behaviour in

Manchester

What do you want to

predict or estimate?

Local indicators of

antisocial behaviours

Page 16: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

The data Science process (2):

Get the data

Police recorded crime

data:

https://data.police.uk/data/

Page 17: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

The data Science process (2):

Get the data

Police recorded crime

data:

https://data.police.uk/data/

Coverage

• Date range

• Different Police unit

(spatial location)

Categories of crime

• Crime data

• Outcome data

• Stop and search data

Format

• Csv files

Page 18: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

The data Science process (3):

Explore the data

What data do we have?

Variables

• Crime ID

• Date of crime

• Type of crime

• LSOA (geographic

location)

Type of data

• Numeric?

• Attribute (character)

• Spatial (coordinates)

Is it ready to analyse?

• Data cleaning

• Manipulation

Page 19: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

The data Science process (3):

Explore the data

What data do we have?

Variables

• Crime ID

• Date of crime

• Type of crime

• LSOA (geographic

location)

Type of data

• Numeric?

• Attribute (character)

• Spatial (coordinates)

Is it ready to analyse?

• Data cleaning

• Manipulation

Descriptive statistics

Central tendency measures

• Any correlations?

• Anomalies?

Plot the data

• Anomalies?

• Patterns?

More questions

• Are the data enough for my RQ?

• Do we need more data?

• Is there more data?

• Change RQs?

Page 20: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

The data Science process (4):

Model the data

What is the best approach to understand the data we have? Depends on…

• Our research questions

• Our data available

Example: Hotspots of crime in Greater Manchester

Page 21: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

The data Science process (5):

Communicate and visualise

the results

Visualise the results

Tables

Figures

Plots

Maps

Communicate the results

Know your audience

• Effective

• The right details for each

audience

• Academic ≠ Local

Government officers

Brexit

Referendum

Page 22: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Exploratory Data

Analysis

Page 23: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Exploratory data analysis

Flowchart for data preparation

https://r4ds.had.co.nz/explore-intro.html

From R for data science

Page 24: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Describe the data

To Understand:

data availability,

Types,

quality,

data complexity (i.e. nonlinearity, requires transformation, etc)

Guided by two types of questions (Grolemund and Wickham,

2016):

What type of covariation occurs between my variables?

What type of variation occurs within my variables?

Page 25: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

How to describe the data (1)

Distribution of numerical variables:

Extreme values (outliers)

Shape of the distribution

Missing cases

Unusual patterns

Distribution of categorical variables

Missing cases

Odd values

Unusual patterns

Most common values

Page 26: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

How to describe the data (2)

Central tendency measures

Mean

median

mode

Measures of spread

Variance and standard deviation

Range:

Interquartile range (IQR)

Visualisations

Histograms, boxplots, bar plots, scatterplots

Page 27: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

How to describe the data (3)

Mean (sample vs. population):

The "average" number; found by adding all data points and dividing by the number of

data points

Median

Middle value if odd number of values, or average of the middle two values otherwise

Mode

Value that occurs most frequently in the data

• Unimodal, bimodal, trimodal

Is the mean always

the best central

tendency measure?

Page 28: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

The problem with the mean

“There are two pieces of bread. You eat two.

I eat none. Average consumption: one bread

per person.“

Nicanor Parra, (Anti)Poet, Mathematician and

Physicist

Page 29: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

More about visualisations

Two types:

1. Exploring and getting to know the data

1. Assess the data: decide what to do next

2. Accurate

3. Internal, never reach the wider audience

2. Communication

1. Present data and ideas

2. Accurate: provide evidence

3. Easy to understand

4. Effective

5. It would depend on the audience

Images: https://r4ds.had.co.nz/exploratory-data-analysis.html

https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-

a2284bae952f

Page 30: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Effective visualisations for communication

Simple but effective (don’t over do it!)

Easy to understand

Use the right type of graph of figure

Not a fit them all purpose graph

Appropriate use of colours (colour blind people)

Know your audience

Page 31: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Effective visualisations for communication

https://extremepresentation.typepad.com/blog/files/choosing_a_good_chart.pdf

Page 32: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Effective visualisations for communication: Use

the right display

Trends over time

Lines

Scatterplots

Distributions

Density plots

Histograms

Correlations

Scatterplots

Comparisons:

Bars

lines

Proportions

Pie charts

Stacked charts

Page 33: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Bad practice in data visualisations

Lack of clear axis labelling

Truncation of an axis so it ends or starts in a misleading place

Numbers not adding up correctly

Visual deceptions

inappropriate use of white space or

comparative elements that don’t seem to correspond to the data they represent

Poor use of a graph type

Making a series of bar graphs when a line graph would be clearer

Cherry-picking data,

Including limited years or responses chosen to support a specific argument

Bad graphic choices,

Low-contrast colour choices

distorting perspective

No title

No key for the information in the graph

Page 34: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Bad practice in data visualisations: Visual

deceptions

Page 35: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Bad practice in data visualisations: Numbers not

adding up correctly

Page 36: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Bad practice in data visualisations: Truncation of an axis

so it ends or starts in a misleading place

DelinquencyHouseholds victim of a crime in the last 12 months.

Page 37: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Bad practice in data visualisations: Truncation of

an axis so it ends or starts in a misleading place

30.7

22.8

27.3

0

5

10

15

20

25

30

35

2010 2013 2016

Same values!

Page 38: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Copyright © [year] UK Data Service. Created by [Organisation], [Institution]

Data manipulation and

linking data

Ana Morales

Analysing and Integrating Multiple datasets

Workshop

Manchester, 12 -13 September, 2019

Page 39: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Data manipulation

Tidy Data

Data Transformation

Linking data

Page 40: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Tidying data

Page 41: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Issues around data: ‘Untidy’ data

Messy data (common causes of messiness)

Multiple variables stored in one column

Missing cases

Variables stored as both rows and columns

Duplicate data

Reason for ‘untidy’ data

• Data organised and stored for different purposes other than analysis

Administrative data

Customer data

People and data producer not really aware of how a tidy dataset should

look like

No standard procedures guiding data entry and storing of variables

Inconsistent coding, i.e. Data entry process made by different persons

Page 42: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Recall from the exploratory data analysis process:

Flowchart for data preparation

https://r4ds.had.co.nz/explore-intro.html

Page 43: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Tidy dataset (Grolemund and Wickham, 2016)

• Each variable must have its own column.

• Each observation must have its own row.

• Each value must have its own cell.

Page 44: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Recall of useful terms

• Variables: a quantity, quality, or

property that you can measure

• Categorical (name, gender, etc)

• Ordinal (low, medium and high)

• Continuous (dates, age, etc.)

• Observations: set of measurements

made under similar conditions

• An observation will contain several values,

each associated with a different variable.

• Values: the state of a variable when

you measure it. The value of a variable

may change from measurement to

measurement.

casenumber sex age SES

1 female 15 Low

2 female 32 Medium

3 male 38 Low

4 female 57 High

Page 45: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Tidy dataset: examplesWhat is the problem with this

data?

• Column storing a

variable ‘type’.

• The values of ‘types’ are

also variable: cases

and population.

country year cases population

1 Afghanistan 1999 745 19987071

2 Afghanistan 2000 2666 20595360

3 Brazil 1999 37737 172006362

4 Brazil 2000 80488 174504898

5 China 1999 212258 1272915272

6 China 2000 213766 1280428583

country year type count

1 Afghanistan 1999 cases 745

2 Afghanistan 1999 population 19987071

3 Afghanistan 2000 cases 2666

4 Afghanistan 2000 population 20595360

5 Brazil 1999 cases 37737

6 Brazil 1999 population 172006362

7 Brazil 2000 cases 80488

8 Brazil 2000 population 174504898

9 China 1999 cases 212258

10 China 1999 population 1272915272

11 China 2000 cases 213766

12 China 2000 population 1280428583

Page 46: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Tidy dataset: examplesWhat is the problem with this

data?

• Two columns storing

one variable.

• The variables ‘1999’ and

‘2000’ are values of the

year variable

• The values of ‘1999’ and

‘2000’ are part of the

variables cases for each

year and country.

country 1999 2000

1 Afghanistan 745 2666

2 Brazil 37737 80488

3 China 212258 213766

country year cases

1 Afghanistan 1999 745

2 Brazil 1999 37737

3 China 1999 212258

4 Afghanistan 2000 2666

5 Brazil 2000 80488

6 China 2000 213766

Page 47: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Tidy dataset: examplesWhat is the problem with this

data?

• Column storing variable

rate, expressed as

division of two variables:

cases and population

• We need two new

variables:

• Cases

• Population

country year rate

1 Afghanistan 1999 745/19987071

2 Afghanistan 2000 2666/20595360

3 Brazil 1999 37737/172006362

4 Brazil 2000 80488/174504898

5 China 1999 212258/1272915272

6 China 2000 213766/1280428583

country year cases population

1 Afghanistan 1999 745 19987071

2 Afghanistan 2000 2666 20595360

3 Brazil 1999 37737 172006362

4 Brazil 2000 80488 174504898

5 China 1999 212258 1272915272

6 China 2000 213766 1280428583

Page 48: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Is ‘untidy’ data really a bad thing?

There are some cases where a tidy dataset, in the form explained before,

might not be the best format

Some analyses in other software packages might be optimised for different

formats

Some analyses based on different R packages might require other formats

Problem first, not solution backward

https://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/

Page 49: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Data transformation

Page 50: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Recall from the exploratory data analysis process:

Flowchart for data preparation

https://r4ds.had.co.nz/explore-intro.html

Page 51: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Overview

• Data is rarely in the format that we need

• New variables

• Edit names

• Summarise

• Select subsamples

• Cases

• Variables

• Reshape (as seen in ‘tidy data’)

Page 52: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Subsetting: Making big datasets small

• Why subset?

• Useful when we have a lot of data, but not all of them is relevant

• Big datasets (rows and columns) are more difficult to manage

• It may slow down the analysis (higher power intensity)

• Example:

• Police data for Greater Manchester

• Different type of offences, but we only want to focus on burglary in Bolton

• What subset?

• Variables (columns): exclude what we don’t need, e.g. Lat and Lon if not mapping

• Cases (rows): Only selected cases, e.g. those in Bolton

Page 53: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Select: Pick variables by their names

• Database of 50 most rated films

Movie Name Rating Years Released Duration

The Shawshank Redemption 9.3 -1994 142 min

The Godfather 9.2 -1972 175 min

The Dark Knight 9 -2008 152 min

The Godfather: Part II 9 -1974 202 min

Pulp Fiction 8.9 -1994 154 min

The Lord of the Rings: The Return of the King 8.9 -2003 201 min

Schindler's List 8.9 -1993 195 min

12 Angry Men 8.9 -1957 96 min

Il buono, il brutto, il cattivo 8.9 -1966 161 min

The Lord of the Rings: The Fellowship of the Ring 8.8 -2001 178 min

Inception 8.8 -2010 148 min

Movie Name Rating

The Shawshank Redemption 9.3

The Godfather 9.2

The Dark Knight 9

The Godfather: Part II 9

Pulp Fiction 8.9

The Lord of the Rings: The Return of the King 8.9

Schindler's List 8.9

12 Angry Men 8.9

Il buono, il brutto, il cattivo 8.9

The Lord of the Rings: The Fellowship of the Ring 8.8

Inception 8.8

Pick based

• Variable names

• Contains (part of variable name)

• Ends (with)

• Starts (with)

Page 54: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Filter: Pick observations based on their values

• Database of 50 most rated films

Movie Name Rating Years Released Duration

The Shawshank Redemption 9.3 -1994 142 min

The Godfather 9.2 -1972 175 min

The Dark Knight 9 -2008 152 min

The Godfather: Part II 9 -1974 202 min

Pulp Fiction 8.9 -1994 154 min

The Lord of the Rings: The Return of the King 8.9 -2003 201 min

Schindler's List 8.9 -1993 195 min

12 Angry Men 8.9 -1957 96 min

Il buono, il brutto, il cattivo 8.9 -1966 161 min

The Lord of the Rings: The Fellowship of the Ring 8.8 -2001 178 min

Inception 8.8 -2010 148 min

Filter:• List of movies before 1990

• Based in values of any variable

• Variables can be of any type

(numeric, character, etc.)

• Based on multiple conditions

• >; <; <=; >=, =

• Boolean operators

MOvie Name Rating Years Released Duration

The Shawshank Redemption 9.3 -1994 142 min

The Godfather 9.2 -1972 175 min

The Godfather: Part II 9 -1974 202 min

Pulp Fiction 8.9 -1994 154 min

Schindler's List 8.9 -1993 195 min

12 Angry Men 8.9 -1957 96 min

Il buono, il brutto, il cattivo 8.9 -1966 161 min

Page 55: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Mutate: Creating new variables as a function of

existing columns.

country year rate

1 Afghanistan 1999 745/19987071

2 Afghanistan 2000 2666/20595360

3 Brazil 1999 37737/172006362

4 Brazil 2000 80488/174504898

5 China 1999 212258/1272915272

6 China 2000 213766/1280428583

Using two functions:

• separate cases and

population into two

variables

• Create a new variable ‘rate’

based on the values of

cases and population

country year cases population rate

Afghanistan 1999 745 19987071 0.00003727

Afghanistan 2000 2666 20595360 0.000129447

Brazil 1999 37737 172006362 0.000219393

Brazil 2000 80488 174504898 0.000461236

China 1999 212258 1272915272 0.00016675

China 2000 213766 1280428583 0.000166949

Page 56: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Summarise: Creating aggregated variables

Useful to create summaries of

variables

In R, the function summarise

works better when used along

other functions

In this example, we grouped

observations by country and

created a new variable

‘mean_cases’ that represents

mean cases per country

country year cases population rate

Afghanistan 1999 745 19987071 0.00003727

Afghanistan 2000 2666 20595360 0.000129447

Brazil 1999 37737 172006362 0.000219393

Brazil 2000 80488 174504898 0.000461236

China 1999 212258 1272915272 0.00016675

China 2000 213766 1280428583 0.000166949

country mean_cases

Afghanistan 1705.5

Brazil 59112.5

China 213012

Page 57: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Linking data: working

with relational datasets

Page 58: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Relational data

Typically, more than a single data table or dataset to work with

Need to combine them to answer the questions of your project

Multiple datasets with relations in common are called: relational data

Example: We want to calculate the arrest rate of burglary for 100,000

inhabitants in Bolton using open Police Data:

Page 59: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Relational data

• Crime rate for area:

• We are missing the total population

• Our data for Bolton is composed by several LSOA

• We need total population estimates for LSOA level

• ONS mid year population statistics

𝐶𝑟𝑖𝑚𝑒 𝑟𝑎𝑡𝑒 =𝑁 𝑐𝑟𝑖𝑚𝑒𝑠

𝑇𝑜𝑡𝑎𝑙 𝑝𝑜𝑝∗ 100000

Page 60: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Linking data:

The formula takes the following form:

f(x,y)

Where f is the function

x and y are the datasets

- We need at least one common variable that identifies each case in both

datasets.

- Datasets not necessarily have the same number of cases,

- but there has to be some cases in common

- They might have repeated variables (we may duplicate them!) but not a

problem

- They may have the same variable under a different name

Page 61: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Types of join: Mutating joins

Combine variables from two set of data

Matched observation and copied (modified) variables from one data

to the other

Add new variables to the right of the of the x (reference) dataset

• Inner join: keeps only matched observation that are in both datasets (AND)

• Left join: keeps all observations of dataset x

• Right join: keeps all observations of dataset y

• full join: Keeps all observation in both datasets (OR)

Page 62: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Types of Join: Inner join

Inner join Matches pairs of observations (cases) by the key variable/s preserving only

those available in both relational data

Is the simplest type of join,

Not very convenient as we might lose too many observations

Data 1

id sex occupation

102 male baker

103 female banker

104 female teacher

105 male teacher

Data 2

id age

101 25

102 34

104 51

105 38

id sex occupation age

102 male baker 34

104 female teacher 51

105 male teacher 38

Combined data

Code in R: inner_join(data1, data 2, by = “id”)

Page 63: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Types of Join: left join

Left join Matches pairs of observations (cases) by the key variable/s preserving only those

available in x (first) data

Returns all rows from x, and all columns from x and y

Rows in x with unmatched rows in y will have missing values (NA) for the columns imported from y

Data 1

id sex occupation

102 male baker

103 female banker

104 female teacher

105 male teacher

Data 2

id age

101 25

102 34

104 51

105 38

Combined data

Code in R: left_join(data1, data 2, by = “id”)

id sex occupation age

102 male baker 34

103 female banker NA

104 female teacher 51

105 male teacher 38

Page 64: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Types of Join: Right join

Right join

Returns all rows from y, and all columns from x and y

Rows in y with unmatched rows in x will have missing values (NA) for the

columns contained in x.

Data 1

id sex occupation

102 male baker

103 female banker

104 female teacher

105 male teacher

Data 2

id age

101 25

102 34

104 51

105 38

Combined data

Code in R: right_join(data1, data 2, by = “id”)

id sex occupation age

101 NA NA 25

102 male baker 34

104 female teacher 51

105 male teacher 38

Page 65: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Types of Join: full join

full join

Returns all rows and all columns from both x and y

Where there are unmatched values, returns NA for the ones missing

Data 1

id sex occupation

102 male baker

103 female banker

104 female teacher

105 male teacher

Data 2

id age

101 25

102 34

104 51

105 38

Combined data

Code in R: full_join(data1, data 2, by = “id”)

id sex occupation age

101 NA NA 25

102 male baker 34

103 female banker NA

104 female teacher 51

105 male teacher 38

Page 66: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Types of join: Filtering joins

match observations between two datasets, but affect the

observations, not the variables.

It does not add new variables!

• semi join: keeps all rows from x where there are matched values in y,

keeping just columns from x

• anti join: keeps only observations of dataset x that are NOT in y

Page 67: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Types of Join: semi join

semi join

Returns all rows from x where there are matching values in y

Similar to inner join, but only keeps columns from x

Useful when there are duplicated observations, it will return only one

Data 1

id sex occupation

102 male baker

103 female banker

104 female teacher

105 male teacher

Data 2

id age

101 25

102 34

104 51

105 38

Combined data

Code in R: semi_join(data1, data 2, by = “id”)

id sex occupation

102 male baker

104 female teacher

105 male teacher

Page 68: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Types of Join: anti join

anti join

keeps the rows in x that don’t have a match in y

Keeps only variables from dataset x

Data 1

id sex occupation

102 male baker

103 female banker

104 female teacher

105 male teacher

Data 2

id age

101 25

102 34

104 51

105 38

Combined data

Code in R: anti_join(data1, data 2, by = “id”)

id sex occupation

103 female banker

Page 69: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Issues around linking data

• Identify key variables

• Check no missing in the key variable. Comparing number of rows in both

datasets not sufficient… be aware of duplicates!

• Check not spelling mistake (entry errors) in the key variable. An anti join can

be useful

Page 70: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Copyright © [year] UK Data Service. Created by [Organisation], [Institution]

Missing Data

Ana Morales

Analysing and Integrating Multiple datasets

Workshop

Manchester, 12 -13 September, 2019

Page 71: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Overview

• Causes, patterns, and mechanisms of missing data

• Problems caused by missing data

• Dealing with missing data

Page 72: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

What are missing data

Data values that is not stored (missing) for a variable in the observation of

interest.

• Systematic missing: Certain types of people didn’t want or couldn’t or

preferred not to answer certain types of questions

• Values can be explicitly missing,

• Different software conventions, i.e. ‘NA’ in R; 9, 99 -9 in SPSS

• Values can be missing but it is not made explicit, i.e. empty cells

Page 73: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Causes

• Unit non-response

- attrition (drop-out)

- wave non-response

• Item non-response

• Data entry problems

• Random error: Someone forgot to write down a number

Page 74: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Missing mechanisms: Missing Completely at

Random (MCAR)

Probability of missing data on a variable Y is unrelated to other measured

variables and is unrelated to the values of Y itself. Missing not related to any

variable

Example:

Blood measurements were lost due to equipment malfunction in the laboratory.

Page 75: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Missing mechanisms: Missing at Random (MAR)

The probability of a variable is missing depends only on available information

Example:

A child did not take an assessment because he was ill and we know that he

was ill from other variables (he was truly ill).

We know the reason of the missingness, and we know that the missingness is

not related to the score that the child would have obtained had he not being ill

Page 76: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Missing mechanisms: Missing not at Random

(MNAR)

The probability of missing data on a variable is specifically related to the values

of what is missing itself

Example:

a person does not attend a drug test because the person took drugs the night

before

Page 77: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Problems with missing data

• Selection bias:

• cases with complete information might be different from those cases with missing values

• Precision:

• We might find associations where there are not or the other way round

• Loss of observations

Page 78: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Dealing with missing data

• Missing data methods that discard data

• Complete case analysis: Listwise

• Pairwise

• Nonresponse weighting

• Missing data method that retains all the data

• Imputation based

• Mode, mean, median imputation

• Regression methods

• Multiple imputation

Page 79: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Missing data methods that discard data

• Complete case analysis: Listwise

Removes all data for an observation that has one or more missing values.

Pros:

Easy to implement

Unbiased under MCAR

Cons:

Lost of data lead to a loss of precision

Biased results for MAR and MNAR

Remained observation might not be representative of the full sample

Page 80: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Missing data methods that discard data

• Pairwise deletion

Analyses all cases in which the variables of interest are present by pairs

Pros:

Easy to implement

Unbiased under MCAR

Cons:

Different observations contributing to different parts of the model

Loss of data lead to a loss of precision

Biased results for MAR and MNAR

Page 81: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Missing data methods that discard data

• Non-response weighting

A model that predicts the non-response in a variable.

The inverse of the probability of non-response used as a survey weight

Pros:

Same sample size, but make complete cases analysis become representative

Relatively easy to implement.

Cons:

Still a small sample

Increasingly problematic with more that one variable missing

Increase variance

Biased under MNAR

Page 82: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Missing data method that retains all the data

• Mode, mean, median imputation

Compute the overall mean, median or mode and uses this to replace the

missing values

Pros:

Easy to implement

Preserve sample size (no data loss)

Cons:

Reduce variability of the data

Biased estimates especially when MAR and MNAR

Not recommended

Page 83: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Missing data method that retains all the data

• Regression methods

Existing variables are used to make a prediction, and then the predicted value is used to replace the missing values

Create model using auxiliary variables (the same that would form part of the model)

Pros:

Easy to implement

Preserve sample size (no data loss)

Easy to implement

Cons:

Reduce variability of the data

As the replaced values were predicted from other variables they tend to fit together “too well”

Not recommended

Page 84: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Missing data method that retains all the data

Multiple imputation

• Create model using auxiliary variables (the same that would form part of the model)

• Missing values are replaced with a set of plausible values which contain the natural variability and uncertainty of the “true” values.

• Create multiple datasets with different plausible imputed values

• Analyses of pooled dataset

Pros:

• Produces unbiased estimates under MCAR and MAR and sometimes under MNAR

• Improves validity

• Increases precision and power

Cons:

Computationally demanding

It needs a good imputation model (good predictors)

Page 85: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Considerations

• Ignoring missing data leads to selection bias and loss of power

• Many statistical methods to deal with missing data

• Poor missing value imputation might destroy the correlations between your

variables - BIAS

• There is no “one size fits all” answer

• Inspect the data

• Establish the missing data mechanism

• Deal with missing data accordingly

• Make explicit the implications of the chosen method to deal with missing cases

• If using imputation method, make it explicit.

Page 86: Integrating and analysing multiple datasetsMore about visualisations Two types: 1. Exploring and getting to know the data 1. Assess the data: decide what to do next 2. Accurate 3.

Questions

Dr Ana Morales

[email protected]