Download - PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Transcript
Page 1: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

PSY 1950Outliers, Missing Data, and

TransformationsSeptember 22, 2008

Page 2: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

On Suspecting Fishiness• Looking for outliers, gaps, and dips

– e.g., tests of clairvoyance• When gaps or dips are hypothesized

– e.g., is dyslexia a distinct entity• Cliffs

– e.g., differences between rating of ingroup and outgroup

• Peaks– e.g., the blackout and baby boom

• The occurrence of impossible scores

Page 3: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Visualize your data!• “make friends with your data”

– Rosenthal• “don’t becomes lovers with your

data”– Me

• Statistics condense data• View raw data graphically

– Frequency distribution graphs– Scatter plots

Page 4: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008
Page 5: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Outliers• Extreme scores• Come from samples other than

those of interest• Can lead to Type I and II

errors

Page 6: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Outlier Detection• Graph

– Box plots– Scatter plots

• Numerical criterion– Extremity (central tendency +/- spread)

• Outside fences– lower: Q1 - 3(Q3 - Q1)– upper: Q3 + 3(Q3 - Q1)

• z-score

– Probability (Extremity + # measurements)• Chauvenat’s/Peirce’s criterion, Grubb’s test

– Absolute cutoff

Page 7: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Outlier Analysis• Determine nature of impact

– Quantitative• Changes numbers, not inferences

– Qualitative• Changes numbers and inferences

• Consider source of outlier– Quantitative

• Same underlying mechanism/sample

– Qualitative• Different underlying mechanisms/samples

– e.g., digit span = 107, simple RT = 1200 ms

Page 8: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Outlier Coping• Options

– Retain– Remove– Reduce

• Windsorize• Normalizing transformation

• Considerations– Impact/Source– Convention– Believability

• Justification• Replication

QuickTime™ and a decompressor

are needed to see this picture.

Page 9: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Transformations• Linear “rescaling”

– unit conversion•e.g., # items correct, # items wrong •e.g., standardization

• Curvilinear “reexpression”– variable conversion

•e.g., time (sec/trial) to speed (trials/sec)

•e.g., normalization

Page 10: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Standardization• Why standardize data?

– Intra-distribution statistics• You got 8 questions wrong on one exam• You were one standard deviation below

the mean

– Inter-distribution statistics• You got 8 questions wrong on the

midterm and 5 questions wrong on the final

• Aggregation: Overall, you were one standard deviation below the mean

• Comparison: You did better on the midterm than the final

Page 11: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

z-score• # standard deviations

above/below the mean

Page 12: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008
Page 13: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

raw-score z-score IQ-scale20 -2.48 75.225 -1.01 89.925 -1.01 89.926 -0.71 92.927 -0.42 95.827 -0.42 95.827 -0.42 95.828 -0.12 98.828 -0.12 98.829 0.17 101.730 0.47 104.731 0.76 107.631 0.76 107.631 0.76 107.632 1.06 110.633 1.35 113.533 1.35 113.5

M 28.41 0.00 100.0SD 3.39 1.00 10.0

Test Performance

Page 14: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Normal Distributions• “…normality is a myth; there

never was, and never will be, a normal distribution.”– Geary (1947)

• “Experimentalists think that it is a mathematical theorem while the mathematicians believe it to be an experimental fact.”– Lippman (1917)

Page 15: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Normalization• Why normalize DV?

– Meet statistical assumption of normality in situations when it matters• Small n• Unequal n• One-sample t and z tests

– Increase power• Why NOT normalize DV?

– Interpretability– Affects measurement scale

Page 16: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Tests of Normality• Frequency distribution• Skew/kurtosis statistics• Kolmorogov-Smirnov test • Probability plots (e.g., P-P

plot)QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 17: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Types of Curvilinear Transformations

Page 18: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Does normalization help?• Games & Lucas (1966): Normalizing

transformations hurt– Reduce interpretability, power

• Levin & Dunlap (1982): Transformations help– Increase power

• Games (1983): It Depends, Levin and Dunlap are stupid

• Levine & Dunlap (1984): It depends, Games is stupid

• Games (1984): This debate is stupid

Page 19: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Does non-normality hurt?

Page 20: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Normalize If and Only If• It matters

– In theory: Got robust?– In practice: Got change?

• Must assume normality (i.e., no non-parametric test available)

Page 21: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Missing Data

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 22: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Why are they missing?– MCAR

• Variable’s missingness unrelated to both its value and other variables’ values

• e.g., equipment malfunction• No bias

– MAR• Variable’s missingness unrelated to its value after

controlling for its relation to other variables• e.g., depression and income• Bias

– MNAR• Variable’s missingness related to its value after

controlling for its relation to other variables• e.g., income reporting• Bias

Page 23: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Diagnosing Missing Data• How much?• How concentrated?• How essential?• MCAR, MAR, MNAR?• How influential?

Page 24: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Dealing with Missing Data– Treat missing data as data– Note bias

• “lower income individuals are underrepresented”

– Delete variables– Delete cases

• Listwise• Casewise

– Estimation• Prior knowledge• Mean substitution• Regression substitution• Expectation-maximization (EM)• Hot decking• Multiple imputation (MI)

Page 25: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Missing Data: Conclusions• Avoid missing data!• If rare (<5%), MCAR,

nonessential, concentrated, or impotent, delete appropriately

• If frequent, patterned, essential, diffuse, influential, use MI

• If MNAR, treat missingness as DV

Page 26: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

• Question: What’s the best method for identifying and removing RT outliers?

• Alternatives– RT cutoff (5 values)– z-score cutoff (1, 1.5)– Transformation (log, inverse)– Trimming– Medians– Windsorizing (2 SD)

Page 27: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Method• Conduct series of simulations

– DV: power (# sig simulations/1000)• 2 x 2 ANOVA

– One main effect (20, 30, 40 ms)• 7 observations/condition

– 10% outlier probability– Outliers 0-2000 ms

• 32 participants• Between-participants variability

Page 28: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

SpreadDrift

ex-Gaussian distribution

Page 29: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008
Page 30: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008
Page 31: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008
Page 32: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Inferences• Absolute cutoffs resulted in greatest

power• Best cutoff values depended on type

of effect– Shift: 10-15% cutoff– Spread: 5% cutoff

• Inverse transformation good, too• With high between-participant

variability, SD cutoff becomes effective

Page 33: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

Recommendations• Try range of cutoffs to

examine robustness • Replicate with inverse

transformation (or SD cutoff)• Replicate novel, unexpected,

or important effects• Choose method before

analyzing data