EDA in the Data Analysis Process

download EDA in the Data Analysis Process

of 30

Transcript of EDA in the Data Analysis Process

  • 7/29/2019 EDA in the Data Analysis Process

    1/30

    SADC Course in Statistics

    Exploratory DataAnalysis (EDA) in thedata analysis process

    Module B2 Session 13

  • 7/29/2019 EDA in the Data Analysis Process

    2/30

    To put your footer here go to View > Header and Footer 2

    Learning Objectives

    students should be able to Construct a dot plot for a numeric variable

    split by a categorical variable

    Apply EDA concepts to a large dataset

    Explain the use of Excels pivot tables and filters, in the EDA process

    Explain the importance of EDA

    for data checking and at the start of the analysis

    Relate EDA to the principles of official statistics .

  • 7/29/2019 EDA in the Data Analysis Process

    3/30

    To put your footer here go to View > Header and Footer 3

    EDA with small and large data sets

    Session 12:

    Stressed the importance of EDA

    Introduced 2 new tools (dot and stem)

    Practiced with small data sets

    In this session we scale up Look at large data sets

    The tools do not scale up easily

    But the concepts do scale up EDA becomes even more crucial

    Most data sets are large! at least compared with teaching examples

  • 7/29/2019 EDA in the Data Analysis Process

    4/30

    To put your footer here go to View > Header and Footer 4

    The essence of a stem and leaf plot

    Stem and leaf plotStacked dot plot

    The leafshows thenext digit.

    This can beuseful in theexploration

    phase

    data5.35.46.0..

    11.111.9

  • 7/29/2019 EDA in the Data Analysis Process

    5/30

    To put your footer here go to View > Header and Footer 5

    What are the key points?

    We look at individual data points not summaries at this stage this is general for EDA

    The stem and leaf plot in particular

    keeps the actual numbers as far as possible This can be important

    An example uses the Tanzania survey

  • 7/29/2019 EDA in the Data Analysis Process

    6/30

    To put your footer here go to View > Header and Footer 6

    Tanzania agriculture survey

    This is the variable we wish to explore.

    It is a value between 0 and 100

  • 7/29/2019 EDA in the Data Analysis Process

    7/30To put your footer here go to View > Header and Footer 7

    The data in Excel

    The variable to explore before analysis

  • 7/29/2019 EDA in the Data Analysis Process

    8/30To put your footer here go to View > Header and Footer 8

    How to explore this value

    Can we do a stem and leaf plot? By hand in Excel but there are 16628 values!

    Even if automated, that is too many!

    The essence of a stem and leaf plot is to look at all the possible values

    Try a pivot table a powerful feature in Excel

    used previously on categorical data

  • 7/29/2019 EDA in the Data Analysis Process

    9/30To put your footer here go to View > Header and Footer 9

    The pivot table

  • 7/29/2019 EDA in the Data Analysis Process

    10/30To put your footer here go to View > Header and Footer 10

    Some results

  • 7/29/2019 EDA in the Data Analysis Process

    11/30To put your footer here go to View > Header and Footer 11

  • 7/29/2019 EDA in the Data Analysis Process

    12/30To put your footer here go to View > Header and Footer 12

    What do you deduce?

    There are oddities in rounding Perhaps enumerator differences Can this question be answered to 1%?

    So what should be done before analysis?

    First look further at the data

    Excel can help it can drill down to

    examine individual records

    The concept: Use the table to look for oddities

    Then examine them in more detail

  • 7/29/2019 EDA in the Data Analysis Process

    13/30To put your footer here go to View > Header and Footer 13

    Drilling down an example

    Make the 6 corresponding to

    2% the active cell

    Then double click to give thedetail

    4 of these values are from the samevillage so same enumerator

  • 7/29/2019 EDA in the Data Analysis Process

    14/30To put your footer here go to View > Header and Footer 14

  • 7/29/2019 EDA in the Data Analysis Process

    15/30To put your footer here go to View > Header and Footer 15

    What do you conclude technique/results

    Technique Stem and leaf plots when looking at small datasets Pivot tables when datasets are large

    But the principle is general

    Numbers must be looked at carefully!

    The principle can be adapted for the data

    and explored effectively in Excel

    Results

    Did enumerators have different interpretations of the precision required in the percentages

    This needs further exploration

    and the analysis needs to take account of this

  • 7/29/2019 EDA in the Data Analysis Process

    16/30To put your footer here go to View > Header and Footer 16

    Another new element in this session

    Exploratory analysis includes looking for oddities in the data

    Unexplained oddities cause variation that can make it difficult to detect the pattern

    because they add unnecessary noise to the data

    How do you tame the variation

    One way is to examine related variables

    This is important in the analysis the next slide is a repeat from Session 3

    It is also a key weapon in data exploration and is covered in the practical

  • 7/29/2019 EDA in the Data Analysis Process

    17/30To put your footer here go to View > Header and Footer 17

    Slide from Module B2 Session 3

    To do good statistics you must

    fight the curse of variation

    Two main strategies to overcome variation

    1. Take enough observations

    In the Tanzania survey there were 3223 householdsjust from this one region

    2. Measure characteristics that explain

    variation

    Variation itself is not necessarily the problem Variation you do not understand is the problem

    Here we start understanding variation at the exploration stage

  • 7/29/2019 EDA in the Data Analysis Process

    18/30

    To put your footer here go to View > Header and Footer 18

    Practical three parts

    Tanzania data practice what has been done in these slides

    Dot plots split by a factor

    demonstration and practice Swaziland data

    apply the concepts

    checking factors

    as well as numeric columns

    Then the key points are reviewed

  • 7/29/2019 EDA in the Data Analysis Process

    19/30

    To put your footer here go to View > Header and Footer 19

    Points for review after the practical

    Looking for individual problems And surprising patterns

    Exploratory graphics need to help the analyst and data checker

    see dot plots on next slide

    Tables are also useful especially with the facility to drill down

    Look at individual variables

    and at records as a whole

    Trust your common sense It is useful to estimate results

    And question the computer if they are very different

  • 7/29/2019 EDA in the Data Analysis Process

    20/30

    To put your footer here go to View > Header and Footer 20

    Dot plots - yield by variety

    Outliers (typing errors) are clear, but onlybecause of the 2nd variable

    They are not outliers overall

  • 7/29/2019 EDA in the Data Analysis Process

    21/30

    To put your footer here go to View > Header and Footer 21

    EDA is a continuous process

    EDA effectively is a continuation of the

    data checking process The example on the previous slide shows

    how some oddities only become clear once the

    analysis is undertaken

    This continues into the formal analysis where it involves looking at the residuals

    They are the unexplained variation

    As discussed in Session 3!

    So analysis is not just a set of rules It is a thoughtful process

    Where you become the data detective!

  • 7/29/2019 EDA in the Data Analysis Process

    22/30

    To put your footer here go to View > Header and Footer 22

    Swaziland data was for checking

  • 7/29/2019 EDA in the Data Analysis Process

    23/30

    To put your footer here go to View > Header and Footer 23

    Investigating the column called Presence

    What does 0 mean?

    Why are there blanks?

    Next steps:

    1. Look at thequestionnaire

    2. Select these records

    You are becoming detectives!

  • 7/29/2019 EDA in the Data Analysis Process

    24/30

    To put your footer here go to View > Header and Footer 24

    Codes for the column

    Seems clear enough. Zeros and blanks still a puzzle

  • 7/29/2019 EDA in the Data Analysis Process

    25/30

    To put your footer here go to View > Header and Footer 25

    Selecting the blank records

    i.e. serious problems with the whole record

    Missing also

    Too young and

    all the same

    Crop code not recognised Areas too large

  • 7/29/2019 EDA in the Data Analysis Process

    26/30

    To put your footer here go to View > Header and Footer 26

    Dot plot of area by Presence

    Odd crop areas were ALL associated with oddcodes for the column PRESENCE

    It was found to be a data transfer problem

    with one byte missing in these records

  • 7/29/2019 EDA in the Data Analysis Process

    27/30

    To put your footer here go to View > Header and Footer 27

    Checking data quality and EDA

    Where Why How By Whom

    Before data

    entry

    To ensure

    complete data

    set received

    Manual

    check

    supervisor

    During data

    entry

    To highlight

    anomalies

    Filter, dot

    plots etc

    Supervisor

    and helpers

    Before

    analysis

    Double check As above Analyst/

    statistician

    During

    analysis

    Remain critical Residuals Analyst/

    statistician

  • 7/29/2019 EDA in the Data Analysis Process

    28/30

    To put your footer here go to View > Header and Footer 28

    Importance principles of official statistics

    Principle 2: Professional standards

    It is unprofessional to analyse the data and reportresults without exploring critically at all stages

    Principle 4: Prevention of misuse We risk misusing the data unless we explore the data

    critically

    Principle 5: Sources of statistics Includes a requirement to avoid undue burden on

    respondents

    We must process the data fully and effectively. Thisneeds EDA

    Otherwise the burden imposed on respondents is to

    some extent wasted

  • 7/29/2019 EDA in the Data Analysis Process

    29/30

    To put your footer here go to View > Header and Footer 29

    Can you now:

    Apply EDA concepts to a large dataset

    Explain the importance of EDA for data

    checking and at the start of the analysis

    Relate EDA to the principles of official

    statistics

  • 7/29/2019 EDA in the Data Analysis Process

    30/30

    Now you can organise the data for analysis

    And then do an exploratory analysis

    We show next how the analysis is easyIF your objectives are clear