8/7/2019 Presentation Fbook Version
1/22
2-1
Chapter 2
Examining YourData
8/7/2019 Presentation Fbook Version
2/22
2-2
LEARNING OBJECTIVESLEARNING OBJECTIVES
Upon completing this chapter, you should be able to do theUpon completing this chapter, you should be able to do the
following:following:
Select the appropriate graphical method to examine theSelect the appropriate graphical method to examine thecharacteristics of the data or relationships of interest.characteristics of the data or relationships of interest. Assess the type and potential impact of missing data.Assess the type and potential impact of missing data. Understand the different types of missing data processes.Understand the different types of missing data processes.
Explain the advantages and disadvantages of theExplain the advantages and disadvantages of theapproaches available for dealing with missing data.approaches available for dealing with missing data.
Chapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your Data
8/7/2019 Presentation Fbook Version
3/22
2-3
LEARNING OBJECTIVES continued . . .LEARNING OBJECTIVES continued . . .
Upon completing this chapter, you should be able to do theUpon completing this chapter, you should be able to do the
following:following:
IdentifyIdentify univariateunivariate,, bivariatebivariate, and multivariate outliers., and multivariate outliers. Test your data for the assumptions underlying mostTest your data for the assumptions underlying most
multivariate techniques.multivariate techniques.
Understand Transformation.Understand Transformation.
Chapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your Data
8/7/2019 Presentation Fbook Version
4/22
2-4
Examination PhasesExamination Phases
Graphical examination.Graphical examination.
Identify and evaluate missing values.Identify and evaluate missing values.
Identify and deal with outliers.Identify and deal with outliers.
Check whether statistical assumptions are met.Check whether statistical assumptions are met.
Develop a preliminary understanding of your data.Develop a preliminary understanding of your data.
8/7/2019 Presentation Fbook Version
5/22
2-5
Shape:Shape:
HistogramHistogram
Bar ChartBar Chart Box & Whisker plotBox & Whisker plot
Stem and Leaf plotStem and Leaf plot
Relationships:Relationships: ScatterplotScatterplot
OutliersOutliers
Graphical ExaminationGraphical Examination
8/7/2019 Presentation Fbook Version
6/22
2-6
Missing DataMissing Data
Missing Data = information not available for a subject (orMissing Data = information not available for a subject (orcase) about whom other information is available. Typicallycase) about whom other information is available. Typicallyoccurs when respondent fails to answer one or moreoccurs when respondent fails to answer one or more
questions in a survey.questions in a survey.
Systematic?Systematic?
Random?Random?
Researchers Concern =Researchers Concern = to identify the patterns andto identify the patterns andrelationships underlying the missing data in order torelationships underlying the missing data in order to
maintain as close as possible to the original distribution ofmaintain as close as possible to the original distribution of
values when any remedy is applied.values when any remedy is applied.
Impact . . .Impact . . . Reduces sample size available for analysis.Reduces sample size available for analysis.
Can distort results.Can distort results.
8/7/2019 Presentation Fbook Version
7/22
2-7
FourFour--Step Process forStep Process for
IdentifyingMissing DataIdentifyingMissing Data
Step 1: Determine the Type of Missing DataStep 1: Determine the Type of Missing Data
Step 2: Determine the Extent of Missing DataStep 2: Determine the Extent of Missing Data
Step 3: Diagnose the Randomness of the MissingStep 3: Diagnose the Randomness of the Missing
Data ProcessesData Processes
Step 4: Select the Imputation MethodStep 4: Select the Imputation Method
8/7/2019 Presentation Fbook Version
8/22
2-8
Missing Data
Missing Data
Strategies for handling missing data . . .Strategies for handling missing data . . .
use observations with complete datause observations with complete data
only;only;
delete case(s) and/or variable(s);delete case(s) and/or variable(s);
estimate missing values.estimate missing values.
8/7/2019 Presentation Fbook Version
9/22
2-9
Rules of Thumb 2Rules of Thumb 211
How Much Missing Data Is Too Much?How Much Missing Data Is Too Much?
yy Missing data under 10% for an individual case orMissing data under 10% for an individual case or
observation can generally be ignored, exceptobservation can generally be ignored, except
when the missing data occurs in a specificwhen the missing data occurs in a specific
nonrandom fashion (e.g., concentration in anonrandom fashion (e.g., concentration in a
specific set of questions, attrition at the end ofspecific set of questions, attrition at the end of
the questionnaire, etc.).the questionnaire, etc.).
yy The number of cases with no missing data mustThe number of cases with no missing data must
be sufficient for the selected analysis technique ifbe sufficient for the selected analysis technique if
replacement values will not be substitutedreplacement values will not be substituted
(imputed) for the missing data.(imputed) for the missing data.
8/7/2019 Presentation Fbook Version
10/22
2-10
Rules of Thumb 2Rules of Thumb 233
Imputation of Missing DataImputation of Missing Data
yy Under 10%Under 10% Any of the imputation methods can be applied whenAny of the imputation methods can be applied when
missing data is this low, although the complete casemissing data is this low, although the complete case
method has been shown to be the least preferred.method has been shown to be the least preferred.
yy 10 to 20%10 to 20% The increased presence of missing data makes the allThe increased presence of missing data makes the allavailable, hot deck case substitution and regressionavailable, hot deck case substitution and regression
methods most preferred for MCAR data, while modelmethods most preferred for MCAR data, while model--
based methods are necessary with MAR missing databased methods are necessary with MAR missing data
processesprocesses
yy Over 20%Over 20% If it is necessary to impute missing data when the level isIf it is necessary to impute missing data when the level isover 20%, the preferred methods are:over 20%, the preferred methods are:
oo the regression method for MCAR situations, andthe regression method for MCAR situations, and
oo modelmodel--based methods when MAR missing data occurs.based methods when MAR missing data occurs.
8/7/2019 Presentation Fbook Version
11/22
2-11
Outlier = an observation/response with a uniqueOutlier = an observation/response with a unique
combination of characteristics identifiablecombination of characteristics identifiable
as distinctly different from the otheras distinctly different from the other
observations/responses.observations/responses.
Issue: Is the observation/response representativeIssue: Is the observation/response representative
of the population?of the population?
OutlierOutlier
8/7/2019 Presentation Fbook Version
12/22
2-12
Why Do OutliersOccur?Why Do OutliersOccur?
Procedural Error.Procedural Error.
Extraordinary Event.Extraordinary Event.
Extraordinary Observations.Extraordinary Observations.
Observations unique in theirObservations unique in their
combination of values.combination of values.
8/7/2019 Presentation Fbook Version
13/22
2-13
Dealing with OutliersDealing with Outliers
Identify outliers.Identify outliers.
Describe outliers.Describe outliers.
Delete or Retain?Delete or Retain?
8/7/2019 Presentation Fbook Version
14/22
2-14
IdentifyingOutliersIdentifyingOutliers
Standardize data and then identify outliers in termsStandardize data and then identify outliers in termsof number of standard deviations.of number of standard deviations.
Examine data using Box Plots, Stem & Leaf, andExamine data using Box Plots, Stem & Leaf, andScatterplots.Scatterplots.
Multivariate detection (DMultivariate detection (D22).).
8/7/2019 Presentation Fbook Version
15/22
2-15
Rules of Thumb 2Rules of Thumb 244
Outlier DetectionOutlier Detectionyy Univariate methodsUnivariate methods examine all metric variables to identify unique or extremeexamine all metric variables to identify unique or extreme
observations.observations.
yy For small samples (80 or fewer observations), outliers typically are defined as casesFor small samples (80 or fewer observations), outliers typically are defined as cases
with standard scores of 2.5 or greater.with standard scores of 2.5 or greater.
yy For larger sample sizes, increase the threshold value of standard scores up to 4.For larger sample sizes, increase the threshold value of standard scores up to 4.
yy If standard scores are not used, identify cases falling outside the ranges of 2.5If standard scores are not used, identify cases falling outside the ranges of 2.5
versus 4 standard deviations, depending on the sample size.versus 4 standard deviations, depending on the sample size.
yy Bivariate methodsBivariate methods focus their use on specific variable relationships, such as thefocus their use on specific variable relationships, such as the
independent versus dependent variables:independent versus dependent variables:
oo use scatterplots with confidence intervals at a specified Alpha level.use scatterplots with confidence intervals at a specified Alpha level.
yy Multivariate methodsMultivariate methods best suited for examining a complete variate, such as thebest suited for examining a complete variate, such as the
independent variables in regression or the variables in factor analysis:independent variables in regression or the variables in factor analysis:
oo threshold levels for the D2/df measure should be very conservative (.005 orthreshold levels for the D2/df measure should be very conservative (.005 or
.001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger samples..001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger samples.
8/7/2019 Presentation Fbook Version
16/22
2-16
Multivariate Assumptions
Multivariate Assumptions
NormalityNormality
LinearityLinearity
HomoscedasticityHomoscedasticity
NonNon--correlated Errorscorrelated Errors
Data Transformations?Data Transformations?
8/7/2019 Presentation Fbook Version
17/22
2-17
Testing AssumptionsTesting Assumptions
Normality assumptionsNormality assumptions Visual check of histogram.Visual check of histogram.
Kurtosis.Kurtosis.
Normal probability plot.Normal probability plot.
HomoscedasticityHomoscedasticity Equal variances across independentEqual variances across independent
variables.variables.
LeveneLevene test (test (univariateunivariate).).
Boxs M (multivariate).Boxs M (multivariate).
8/7/2019 Presentation Fbook Version
18/22
2-18
Rules of Thumb 2Rules of Thumb 255
Testing Statistical AssumptionsTesting Statistical Assumptionsyy Normality can have serious effects in small samples (less than 50cases), butNormality can have serious effects in small samples (less than 50cases), but
the impact effectively diminishes when sample sizes reach 200 cases orthe impact effectively diminishes when sample sizes reach 200 cases or
more.more.
yy Most cases of heteroscedasticity are a result of nonMost cases of heteroscedasticity are a result of non--normality in one ornormality in one or
more variables. Thus, remedying normality may not be needed due tomore variables. Thus, remedying normality may not be needed due to
sample size, but may be needed to equalize the variance.sample size, but may be needed to equalize the variance.
yy Nonlinear relationships can be very well defined, but seriously understatedNonlinear relationships can be very well defined, but seriously understated
unless the data is transformed to a linear pattern or explicit modelunless the data is transformed to a linear pattern or explicit model
components are used to represent the nonlinear portion of the relationship.components are used to represent the nonlinear portion of the relationship.
yy Correlated errors arise from a process that must be treated much likeCorrelated errors arise from a process that must be treated much like
missing data. That is, the researcher must first define the causes amongmissing data. That is, the researcher must first define the causes amongvariables either internal or external to the dataset. If they are not found andvariables either internal or external to the dataset. If they are not found and
remedied, serious biases can occur in the results, many times unknown toremedied, serious biases can occur in the results, many times unknown to
the researcher.the researcher.
8/7/2019 Presentation Fbook Version
19/22
2-19
Data Transformations ?Data Transformations ?
Data transformations . . . provide a means ofData transformations . . . provide a means of
modifying variables for one of two reasons:modifying variables for one of two reasons:
1.1. To correct violations of the statisticalTo correct violations of the statisticalassumptions underlying the multivariateassumptions underlying the multivariate
techniques, ortechniques, or
2.2. To improve the relationship (correlation)To improve the relationship (correlation)
between the variables.between the variables.
8/7/2019 Presentation Fbook Version
20/22
2-20
Examining DataExamining Data
Learning CheckpointLearning Checkpoint
1.1. Why examine your data?Why examine your data?
2.2. What are the principal aspects of dataWhat are the principal aspects of datathat need to be examined?that need to be examined?
3.3. What approaches would you use?What approaches would you use?
8/7/2019 Presentation Fbook Version
21/22
SPSS
CLASS 2: EXPLORING DATA
8/7/2019 Presentation Fbook Version
22/22
Exploring data
Step 1: Check for Missing Data
Step 2: Check for Outliers
Step 3: Check Assumptions
Top Related