Download - Presentation Fbook Version

8/7/2019 Presentation Fbook Version

1/22

2-1

Chapter 2

Examining YourData


2/22

2-2

LEARNING OBJECTIVESLEARNING OBJECTIVES

Upon completing this chapter, you should be able to do theUpon completing this chapter, you should be able to do the

following:following:

Select the appropriate graphical method to examine theSelect the appropriate graphical method to examine thecharacteristics of the data or relationships of interest.characteristics of the data or relationships of interest. Assess the type and potential impact of missing data.Assess the type and potential impact of missing data. Understand the different types of missing data processes.Understand the different types of missing data processes.

Explain the advantages and disadvantages of theExplain the advantages and disadvantages of theapproaches available for dealing with missing data.approaches available for dealing with missing data.

Chapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your Data


3/22

2-3

LEARNING OBJECTIVES continued . . .LEARNING OBJECTIVES continued . . .

Upon completing this chapter, you should be able to do theUpon completing this chapter, you should be able to do the

following:following:

IdentifyIdentify univariateunivariate,, bivariatebivariate, and multivariate outliers., and multivariate outliers. Test your data for the assumptions underlying mostTest your data for the assumptions underlying most

multivariate techniques.multivariate techniques.

Understand Transformation.Understand Transformation.

Chapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your Data


4/22

2-4

Examination PhasesExamination Phases

Graphical examination.Graphical examination.

Identify and evaluate missing values.Identify and evaluate missing values.

Identify and deal with outliers.Identify and deal with outliers.

Check whether statistical assumptions are met.Check whether statistical assumptions are met.

Develop a preliminary understanding of your data.Develop a preliminary understanding of your data.


5/22

2-5

Shape:Shape:

HistogramHistogram

Bar ChartBar Chart Box & Whisker plotBox & Whisker plot

Stem and Leaf plotStem and Leaf plot

Relationships:Relationships: ScatterplotScatterplot

OutliersOutliers

Graphical ExaminationGraphical Examination


6/22

2-6

Missing DataMissing Data

Missing Data = information not available for a subject (orMissing Data = information not available for a subject (orcase) about whom other information is available. Typicallycase) about whom other information is available. Typicallyoccurs when respondent fails to answer one or moreoccurs when respondent fails to answer one or more

questions in a survey.questions in a survey.

Systematic?Systematic?

Random?Random?

Researchers Concern =Researchers Concern = to identify the patterns andto identify the patterns andrelationships underlying the missing data in order torelationships underlying the missing data in order to

maintain as close as possible to the original distribution ofmaintain as close as possible to the original distribution of

values when any remedy is applied.values when any remedy is applied.

Impact . . .Impact . . . Reduces sample size available for analysis.Reduces sample size available for analysis.

Can distort results.Can distort results.


7/22

2-7

FourFour--Step Process forStep Process for

IdentifyingMissing DataIdentifyingMissing Data

Step 1: Determine the Type of Missing DataStep 1: Determine the Type of Missing Data

Step 2: Determine the Extent of Missing DataStep 2: Determine the Extent of Missing Data

Step 3: Diagnose the Randomness of the MissingStep 3: Diagnose the Randomness of the Missing

Data ProcessesData Processes

Step 4: Select the Imputation MethodStep 4: Select the Imputation Method


8/22

2-8

Missing Data

Missing Data

Strategies for handling missing data . . .Strategies for handling missing data . . .

use observations with complete datause observations with complete data

only;only;

delete case(s) and/or variable(s);delete case(s) and/or variable(s);

estimate missing values.estimate missing values.


9/22

2-9

Rules of Thumb 2Rules of Thumb 211

How Much Missing Data Is Too Much?How Much Missing Data Is Too Much?

yy Missing data under 10% for an individual case orMissing data under 10% for an individual case or

observation can generally be ignored, exceptobservation can generally be ignored, except

when the missing data occurs in a specificwhen the missing data occurs in a specific

nonrandom fashion (e.g., concentration in anonrandom fashion (e.g., concentration in a

specific set of questions, attrition at the end ofspecific set of questions, attrition at the end of

the questionnaire, etc.).the questionnaire, etc.).

yy The number of cases with no missing data mustThe number of cases with no missing data must

be sufficient for the selected analysis technique ifbe sufficient for the selected analysis technique if

replacement values will not be substitutedreplacement values will not be substituted

(imputed) for the missing data.(imputed) for the missing data.


10/22

2-10


Imputation of Missing DataImputation of Missing Data

yy Under 10%Under 10% Any of the imputation methods can be applied whenAny of the imputation methods can be applied when

missing data is this low, although the complete casemissing data is this low, although the complete case

method has been shown to be the least preferred.method has been shown to be the least preferred.

yy 10 to 20%10 to 20% The increased presence of missing data makes the allThe increased presence of missing data makes the allavailable, hot deck case substitution and regressionavailable, hot deck case substitution and regression

methods most preferred for MCAR data, while modelmethods most preferred for MCAR data, while model--

based methods are necessary with MAR missing databased methods are necessary with MAR missing data

processesprocesses

yy Over 20%Over 20% If it is necessary to impute missing data when the level isIf it is necessary to impute missing data when the level isover 20%, the preferred methods are:over 20%, the preferred methods are:

oo the regression method for MCAR situations, andthe regression method for MCAR situations, and

oo modelmodel--based methods when MAR missing data occurs.based methods when MAR missing data occurs.


11/22

2-11

Outlier = an observation/response with a uniqueOutlier = an observation/response with a unique

combination of characteristics identifiablecombination of characteristics identifiable

as distinctly different from the otheras distinctly different from the other

observations/responses.observations/responses.

Issue: Is the observation/response representativeIssue: Is the observation/response representative

of the population?of the population?

OutlierOutlier


12/22

2-12

Why Do OutliersOccur?Why Do OutliersOccur?

Procedural Error.Procedural Error.

Extraordinary Event.Extraordinary Event.

Extraordinary Observations.Extraordinary Observations.

Observations unique in theirObservations unique in their

combination of values.combination of values.


13/22

2-13

Dealing with OutliersDealing with Outliers

Identify outliers.Identify outliers.

Describe outliers.Describe outliers.

Delete or Retain?Delete or Retain?


14/22

2-14

IdentifyingOutliersIdentifyingOutliers

Standardize data and then identify outliers in termsStandardize data and then identify outliers in termsof number of standard deviations.of number of standard deviations.

Examine data using Box Plots, Stem & Leaf, andExamine data using Box Plots, Stem & Leaf, andScatterplots.Scatterplots.

Multivariate detection (DMultivariate detection (D22).).


15/22

2-15


Outlier DetectionOutlier Detectionyy Univariate methodsUnivariate methods examine all metric variables to identify unique or extremeexamine all metric variables to identify unique or extreme

observations.observations.

yy For small samples (80 or fewer observations), outliers typically are defined as casesFor small samples (80 or fewer observations), outliers typically are defined as cases

with standard scores of 2.5 or greater.with standard scores of 2.5 or greater.

yy For larger sample sizes, increase the threshold value of standard scores up to 4.For larger sample sizes, increase the threshold value of standard scores up to 4.

yy If standard scores are not used, identify cases falling outside the ranges of 2.5If standard scores are not used, identify cases falling outside the ranges of 2.5

versus 4 standard deviations, depending on the sample size.versus 4 standard deviations, depending on the sample size.

yy Bivariate methodsBivariate methods focus their use on specific variable relationships, such as thefocus their use on specific variable relationships, such as the

independent versus dependent variables:independent versus dependent variables:

oo use scatterplots with confidence intervals at a specified Alpha level.use scatterplots with confidence intervals at a specified Alpha level.

yy Multivariate methodsMultivariate methods best suited for examining a complete variate, such as thebest suited for examining a complete variate, such as the

independent variables in regression or the variables in factor analysis:independent variables in regression or the variables in factor analysis:

oo threshold levels for the D2/df measure should be very conservative (.005 orthreshold levels for the D2/df measure should be very conservative (.005 or

.001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger samples..001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger samples.


16/22

2-16

Multivariate Assumptions

Multivariate Assumptions

NormalityNormality

LinearityLinearity

HomoscedasticityHomoscedasticity

NonNon--correlated Errorscorrelated Errors

Data Transformations?Data Transformations?


17/22

2-17

Testing AssumptionsTesting Assumptions

Normality assumptionsNormality assumptions Visual check of histogram.Visual check of histogram.

Kurtosis.Kurtosis.

Normal probability plot.Normal probability plot.

HomoscedasticityHomoscedasticity Equal variances across independentEqual variances across independent

variables.variables.

LeveneLevene test (test (univariateunivariate).).

Boxs M (multivariate).Boxs M (multivariate).


18/22

2-18


Testing Statistical AssumptionsTesting Statistical Assumptionsyy Normality can have serious effects in small samples (less than 50cases), butNormality can have serious effects in small samples (less than 50cases), but

the impact effectively diminishes when sample sizes reach 200 cases orthe impact effectively diminishes when sample sizes reach 200 cases or

more.more.

yy Most cases of heteroscedasticity are a result of nonMost cases of heteroscedasticity are a result of non--normality in one ornormality in one or

more variables. Thus, remedying normality may not be needed due tomore variables. Thus, remedying normality may not be needed due to

sample size, but may be needed to equalize the variance.sample size, but may be needed to equalize the variance.

yy Nonlinear relationships can be very well defined, but seriously understatedNonlinear relationships can be very well defined, but seriously understated

unless the data is transformed to a linear pattern or explicit modelunless the data is transformed to a linear pattern or explicit model

components are used to represent the nonlinear portion of the relationship.components are used to represent the nonlinear portion of the relationship.

yy Correlated errors arise from a process that must be treated much likeCorrelated errors arise from a process that must be treated much like

missing data. That is, the researcher must first define the causes amongmissing data. That is, the researcher must first define the causes amongvariables either internal or external to the dataset. If they are not found andvariables either internal or external to the dataset. If they are not found and

remedied, serious biases can occur in the results, many times unknown toremedied, serious biases can occur in the results, many times unknown to

the researcher.the researcher.


19/22

2-19

Data Transformations ?Data Transformations ?

Data transformations . . . provide a means ofData transformations . . . provide a means of

modifying variables for one of two reasons:modifying variables for one of two reasons:

1.1. To correct violations of the statisticalTo correct violations of the statisticalassumptions underlying the multivariateassumptions underlying the multivariate

techniques, ortechniques, or

2.2. To improve the relationship (correlation)To improve the relationship (correlation)

between the variables.between the variables.


20/22

2-20

Examining DataExamining Data

Learning CheckpointLearning Checkpoint

1.1. Why examine your data?Why examine your data?

2.2. What are the principal aspects of dataWhat are the principal aspects of datathat need to be examined?that need to be examined?

3.3. What approaches would you use?What approaches would you use?


21/22

SPSS

CLASS 2: EXPLORING DATA


22/22

Exploring data

Step 1: Check for Missing Data

Step 2: Check for Outliers

Step 3: Check Assumptions