Data Mind Traps

1

Data Mind Traps

September 2009Guy Lion

2

Introduction

A data mind trap is when you use a quantitative method that does not fit the structure of the data. As a result, you derive the wrong conclusion.

We have come across several data mind traps when conducting modeling, data analysis, and hypothesis testing. Today, we will cover the following ones:

• Testing many hypotheses using the same data set;

• Assuming data is normally distributed, when it is not, within an unpaired hypothesis framework;

• Confusing “statistically significant” with material; and

• Dealing with the causation issue.

3

Testing many hypotheses with the same data…You run into a greater probability that a random event will be “statistically significant.” That probability actually jumps to 40% if you test 10 hypotheses and 64% if you test 20

hypotheses!

4

Testing many hypotheses

This is a mock up of an observational study testing 12 different foods impact on cholesterol (good or bad). By observing 2400 individuals, we notice that the 200 who ate more tomatoes than the other 2200 have a slightly reduced level of cholesterol (197.7. mg/DL vs 203.5 respectively). The p value is < 5% and Confidence level > 95% that eating tomatoes does reduce cholesterol. WHAT’S WRONG WITH THIS?

Medical Observational Study researching impact of various foods on Cholesterol levelTwo sample unpaired t test.

1 2 3 4 5 6 7 8 9 10 11 12Strawberry Tomato Carrot Potato Almond Olive Pork Beef Chicken Fish Milk Coffee Total

Sample size 200 200 200 200 200 200 200 200 200 200 200 200 2400Cholesterol mg/dL 199.9 197.7 200.9 206.1 202.3 201.0 205.0 204.9 205.0 202.6 207.2 202.9 203Standard deviation 40 40 40 40 40 40 40 40 40 40 40 40 40Standard error 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8Average difference -3.4 -5.8 -2.3 3.4 -0.8 -2.2 2.2 2.1 2.2 -0.4 4.6 -0.1

OutputGroup standard error 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95t statistic 1.1 2.0 0.8 1.1 0.3 0.7 0.7 0.7 0.7 0.1 1.6 0.0Degree of freedom 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398Group's t statistics:P Value, prob. samples are same: Two tail 25.2% 5.0% 43.8% 25.2% 79.6% 46.0% 46.0% 48.3% 46.0% 88.3% 12.1% 97.1%

5

The Zodiac Paradox

“… the more we look for patterns, the more likely we are to find them, particularly when we don’t begin with a particular question… Then we leap to conclusions… for why we saw the results we did.”

Peter Austin, PhD, Clinical Evaluation Science.

Medical Observational Study researching impact of various foods on Cholesterol levelTwo sample unpaired t test.

1 2 3 4 5 6 7 8 9 10 11 12Aries Taurus Gemini Cancer Leo Virgo Libra Scorpio Sagittarius CapricornAquarius Pisces Total

Sample size 200 200 200 200 200 200 200 200 200 200 200 200 2400Cholesterol mg/dL 201.6 202.5 202.3 206.1 202.3 199.9 197.7 204.9 204.0 203.1 207.2 203.9 203Standard deviation 40 40 40 40 40 40 40 40 40 40 40 40 40Standard error 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8Average difference -1.5 -0.5 -0.8 3.4 -0.8 -3.4 -5.8 2.1 1.1 0.1 4.6 1.0

OutputGroup standard error 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95t statistic 0.5 0.2 0.3 1.1 0.3 1.1 2.0 0.7 0.4 0.0 1.6 0.3Degree of freedom 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398Group's t statistics:P Value, prob. samples are same: Two tail 60.5% 85.4% 79.6% 25.2% 79.6% 25.2% 5.0% 48.3% 71.2% 97.1% 12.1% 74.0%

6

How to fix the Zodiac Paradox?

You have to adjust the P value threshold downward to reflect the # of hypotheses you are testing.

To reach the 95% confidence level, you adjust the P value following the formula shown above where n is the # of hypotheses.

As a close estimate, you can also divide the P value threshold by the number of hypotheses.

Another way to avoid this issue is to use a second data set and test a single hypothesis: does eating tomatoes reduce cholesterol?

Adjusted P value threshold = 1 - (1 - P value)(1/n)

Calculating Adjusted P value thresholdConfidence level 95%Unadjusted P value 5.00%

2 2.53%Adjusted 3 1.70%P value # 4 1.27%

Threshold of 5 1.02%hypotheses 6 0.85%

7 0.73%8 0.64%

10 0.51%

7

Assuming data is normally distributed when it is not can lead to wrong conclusions within an unpaired

hypothesis testing framework. Always watch for the shape of your data

distribution.

Distribution of Salary levels for Dept. A & B

0%

10%

20%

30%

40%

$29,000

$32,000

$35,000

$38,000

$41,000

$44,000

$47,000

$50,000

$53,000

$56,000

$59,000

$62,000

Dept. A

Dept. B

8

Looking at salary levels of two departments

If we assume the data is normally distributed, those two departments (Dept. A & Dept. B) are deemed to have the same salary level. Using the unpaired t test, they represent two samples who come from populations with near identical distributions (P value 98%). But, as we’ll find out the data is not normally distributed which invalidates our conclusion.

Two sample unpaired t test. Dept. A Dept. B

Sample size 10 10Average 37,022$ 37,100$ Standard deviation 4,933$ 8,576$ Standard error 1,560$ 2,712$ Average difference (78)$

OutputGroup standard error 3,129$ t statistic 0.025Degree of freedom 18P Value (two tail test) 98.0%

Dept. A Dept. B36,400$ 29,200$ 43,500$ 34,000$ 29,000$ 39,651$ 41,000$ 31,852$ 39,000$ 35,916$ 37,200$ 36,712$ 42,250$ 38,500$ 29,950$ 30,167$ 33,420$ 35,505$ 38,500$ 59,500$

9

… the data is not normally distributed!

Distribution of Salary levels for Dept. A & B

0%

10%

20%

30%

40%

$29,000

$32,000

$35,000

$38,000

$41,000

$44,000

$47,000

$50,000

$53,000

$56,000

$59,000

$62,000

Dept. A

Dept. B

Jarque- Bera testProbability distribution is Normal

Dept. A Dept. B Combinedn 10 10 20Skewness -0.5 2.3 1.8Kurtosis -0.8 6.1 5.5JB score 0.7 24.0 36.7DF 2 2 2p-value 0.7 0.0 0.0

Using the Jarque-Bera test, there is a 0% probability that either Dept. B or the combined data sets are normally distributed. This is obvious when looking at the histogram below.

10

Mann-Whitney test

Instead of dealing with average value the MW test deals with average rank. Ranking of salaries are in ascending order. The MW test neutralizes the impact of the salary outlier ($59,500). This test suggests there is only a 47% probability that those two departments have the same salary level. If they are different, the salary level of Dept. A is deemed the higher one (higher avg. rank). Very different conclusion vs the unpaired t test…

Dept. A Dept. B Dept. A Dept. B36,400$ 29,200$ 10 243,500$ 34,000$ 19 729,000$ 39,651$ 1 1641,000$ 31,852$ 17 539,000$ 35,916$ 15 937,200$ 36,712$ 12 1142,250$ 38,500$ 18 1329,950$ 30,167$ 3 433,420$ 35,505$ 6 838,500$ 59,500$ 13 20

Ranking

Mann-Whitney test (using avg. rank)Dept. A Dept. B

Sample size 10 10Average rank 11.4 9.5Difference in avg. rank 1.9

Standard Error: (n1 + n2)(SQRT(n1+n2+1)/(12n1n2)n1 + n2 20n1 + n2 + 1 2112n1n2 1200Standard Error 2.6

Test statistics: Difference in avg. rank/Standard ErrorDiff. in avg. rank 1.9Standard Error 2.6Test statistic - Z value 0.72 nbr. of Standard ErrorP Value (two tail test) 47.3% Using NORMSDIST

11

“Statistically significant” vs material

Example: A rep is marketing a math SAT prep course for a $1,000. The rep tells you the course has improved students score and is associated with a P value of 5% or a 95% Confidence level. Should you buy the course?

Statistically significant does not necessarily mean consequential or material for your business. Sometimes completely trivial differences can be statistically significant simply because you have very large sample sizes.

12

Assessing an SAT math course with an Effect Size measure: Cohen’s d value

Effect Size Cohen’s d value is set up similarly to the unpaired t test. The main difference is that it measures the statistical distance in standard deviations instead of standard errors.

Effect size Cohen d value interpretation0.0 < Small0.2 Small0.5 Medium

0.8 Large

Two sample Unpaired t test. Effect Size Cohen's d value Tests Controls Tests Controls

Sample size 278 278 Sample size 278 278SAT math score avg. 517 500 SAT math score avg. 517 500Standard deviation 100 100 Standard deviation 100 100Standard error 6.0 6.0Average difference 17 Average difference 17

Output OutputGroup standard error 8.48 Pooled - Standard deviation 100.0

t statistic 1.97 Cohen's d value 0.17Degree of freedom 554 Percentile standing - Tests Avg. 56.6%Group's t statistics: Effect size strength < SmallP Value (2 tail) prob. samples are same: 5.0% Effect size midpoint 0.08

Area under curve 53.3%Area beyond curve 46.7%Overlap area 93.4%Nonoverlap area 6.6%

13

Cohen’s d informationEffect Size

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

3.5%

4.0%

4.5%

-2.5

0-2

.20

-1.9

0-1

.60

-1.3

0-1

.00

-0.7

0-0

.40

-0.1

00.

200.

500.

801.

101.

401.

702.

002.

30

Z value

Nonoverlap Area

Effect Size

Overlap Area

Midpoint

Pilot Avg.Percentile Standing

A B C = 2B D = 1 - CCohen's d Test avg. Area

Effect Percentile Area beyond Overlap NonoverlapSize Standing Midpoint under curve curve area area0.17 56.6% 0.08 53.3% 46.7% 93.4% 6.6%0.20 57.9% 0.10 54.0% 46.0% 92.0% 8.0%0.50 69.1% 0.25 59.9% 40.1% 80.3% 19.7%0.80 78.8% 0.40 65.5% 34.5% 68.9% 31.1%1.96 97.5% 0.98 83.6% 16.4% 32.7% 67.3%

Effect Size Cohen's d value Tests Controls

Sample size 278 278SAT math score avg. 517 500Standard deviation 100 100

Average difference 17

OutputPooled - Standard deviation 100.0

Cohen's d value 0.17Percentile standing - Tests Avg. 56.6%Effect size strength < SmallEffect size midpoint 0.08

A Area under curve 53.3%B Area beyond curve 46.7%

C = 2B Overlap area 93.4%D = 1 - C Nonoverlap area 6.6%

14

Cohen’s d Confidence Interval

The Effect Size standard deviation formula allows to build Confidence Intervals around Effect Size values.

Effect Size Cohen's d value Tests Controls

Sample size 278 278SAT math score avg. 517 500Standard deviation 100 100

Average difference 17

OutputPooled - Standard deviation 100.0

Cohen's d value 0.17Percentile standing - Tests Avg. 56.6%Effect size strength < SmallEffect size midpoint 0.08Area under curve 53.3%Area beyond curve 46.7%Overlap area 93.4%Nonoverlap area 6.6%Confidence Interval a) In standardized units:Effect Size standard deviation 0.08

C.I. Z value Low high95.0% 1.96 0.00 0.33

b) In regular units:C.I. Z value Low high

95.0% 1.96 0.04 33.3

The “Ns” just stand for the size of a sample.

Regarding this SAT prep course, stating that its 17 math point effect is small and has a 95% confidence interval of 0 to 33 math points provides you better info than stating simply it is statistically significant with a p value of 5%.

15

Gamma Index (staying with SAT example)

This is a nonparametric alternative to Cohen’s d in case variables are not normally distributed.

I recalculated the Gamma Index so it is consistent in sign with Cohen’s d. Now a positive Effect size (Z Value) denotes the Tests values being higher than the Controls’.

Gamma IndexTests Controls617 402401 594517 415406 610550 400461 600602 360424 651687 420522 561442 426666 566427 500

Median 517 500Difference 17

OutputEffect Size - Gamma Index# < Control Median 6Proportion 46.2%Z Value (left tail) -0.10

Adjusted Effect Size - Gamma Index# > Control Median 7Proportion 53.8%Z Value (right tail) 0.10

16

Dealing with the causation issue

Causation is elusive to prove. We’ll cover a couple of methods that get you somewhat closer. Causation ultimately depends on your logical flows. The stats can support the logic much more than the other way around.

17

Causation Part I: Granger Causality

18

Granger Causality

Independent Variable being tested

A Granger causes B

Base case Model Test ModelAutoregressive Multivariate

X1 = Lag B X1 = Lag BY = B X2 = Lag A

Y = B

Square Residuals Square Residuals

Hypothesis testing

Unpaired t TestDo 2 samples of squaredresiduals come from same population?

Linear Regression

19

Granger Causality – the whole pictureA Granger causes B B Granger causes A

Base case Model Test Model Base case Model Test ModelAutoregressive Multivariate Autoregressive Multivariate

X1 = Lag B X1 = Lag B X1 = Lag A X1 = Lag AY = B X2 = Lag A Y = A X2 = Lag B

Y = B Y = A

Square Residuals Square Residuals Square Residuals Square Residuals

Hypothesis testing Hypothesis testing

Unpaired t Test Unpaired t TestDo 2 samples of squared Do 2 samples of squaredresiduals come from residuals come from same population? same population?

Granger CausalityCompare unpaired t test P value to find out if A causes B more than B causes A.

Linear Regression Linear Regression

This method still does not fully demonstrate causality. The rooster crows “Granger causes” sunrise.

20

Causation Part II: Path Analysis

We’ll illustrate Path Analysis through an example inspired from this book.

The author’s theory is that human capital causes homeownership rate to decline.

+ -

21

Path Analysis: Direct and Indirect Effects

The Correlation of the independent variable can be decomposed into its Direct Effect and Indirect Effect on the dependent variable. The Indirect Effect is derived from intermediary variables in between the independent and dependent one. The Causal Effect is the sum of the mentioned Effects and should equal the Correlation.

Direct Effect

Correlation Causal Effect

Indirect Effect

22

The Path Analysis Diagram

The Path Analysis Diagram defines our hypothesis. Human Capital has an effect on:

• Home Affordability (-) as highly educated wage earners bid up prices of homes,

• Demographic/youth (-), and

• Unemployment (-) as Human Capital lowers unemployment.

In turn, those intermediary variables have an effect on Homeownership rate:

• Housing Affordability (+), if homes are more affordable homeownership goes up.

• Demographic-Youth (% of population between 20 and 29), (-) as younger people starting out can ill afford homes, and

• Unemployment (-) as unemployed lack the income to buy homes.

Independent Intermediary Dependentvariable variables variable

Housing

- affordability +

Human - Demographic - HomeCapital (youth) ownership

- -Unemployment

+ -

23

The Actual Correlations

We embedded the correlations within the diagram. We also added a correlation directly from Human Capital to Home ownership. Most correlation signs support the hypothesis except Unemployment.

Correlations -0.176Housing

-0.181 affordability 0.573Human HomeCapital -0.064 Youth -0.249 ownership

-0.250 Unemploy. 0.064

Correlation matrix using standardized variablesVariable Human cap. Housing afford.Young pop. Unemploy. Homeownership

Human capital 1 -0.181 -0.064 -0.250 -0.176

Housing affordability -0.181 1 -0.102 0.301 0.573Young population -0.064 -0.102 1 -0.167 -0.249Unemployment rate -0.250 0.301 -0.167 1 0.064Homeownership rate -0.176 0.573 -0.249 0.064 1

24

Path Analysis

With standardized variables within a single relationship the Correlation is equal to the Slope.

Correlation = COVAR (X, Y)/(s X)(s Y)Slope = COVAR (X, Y)/VAR XIf both s = 1. Correlation = Slope

25

The Path Coefficients

SUMMARY OUTPUTRegression Statistics

Multiple R 0.633R Square 0.401Adjusted R Square 0.375Standard Error 0.785Observations 111

Coefficients Stand. Error t Stat P-valueHuman capital -0.130 0.078 -1.665 0.099Housing affordability 0.581 0.079 7.334 0.000Young population -0.228 0.077 -2.976 0.004Unemployment -0.181 0.081 -2.227 0.028

Given that the variables are standardized, all bivariate correlations already represent Path coefficients (in white). We’ll calculate the Path coefficients in yellow with a regression model.

Dependent variable is Homeownership rate

Path Coefficients-0.130

Housing-0.181 affordability 0.581

Human HomeCapital -0.064 Youth -0.228 ownership

-0.250 Unemploy. -0.181

26

Human Capital Direct and Indirect Effects

Human Capital causal effect (-0.176) on Homeownership equals its correlation.

Decomposing correlations into indirect and direct effectsHuman Capital indirect effect on Homeownership A B A x BHuman Cap. -> Housing affordability -> Homeownership -0.181 0.581 -0.105Human Cap. -> Youth -> Homeownership -0.064 -0.228 0.015Human Cap. -> Unemployment -> Homeownership -0.250 -0.181 0.045

-0.045

Human Capital direct effect on Homeownership -0.130

Human Capital Causal effect on Homeownershipa) Indirect effect -0.045b) Direct effect -0.130Total causal effect -0.176

Path Coefficients-0.130

Housing-0.181 affordability 0.581

Human HomeCapital -0.064 Youth -0.228 ownership

-0.250 Unemploy. -0.181

27

Take Aways

Adjust P value

Mann-Whitney test

Cohen's d value

Gamma IndexGranger Causality

Path Analysis

Effect Size

Tackling Causality?

Testing many hypotheses

Dealing with non normal datain hypothesis testing

Evaluating statistical significance

Data Mind Traps

Education

Transcript of Data Mind Traps