Data Mind Traps
-
Upload
gaetan-lion -
Category
Education
-
view
764 -
download
0
description
Transcript of Data Mind Traps
1
Data Mind Traps
September 2009Guy Lion
2
Introduction
A data mind trap is when you use a quantitative method that does not fit the structure of the data. As a result, you derive the wrong conclusion.
We have come across several data mind traps when conducting modeling, data analysis, and hypothesis testing. Today, we will cover the following ones:
• Testing many hypotheses using the same data set;
• Assuming data is normally distributed, when it is not, within an unpaired hypothesis framework;
• Confusing “statistically significant” with material; and
• Dealing with the causation issue.
3
Testing many hypotheses with the same data…You run into a greater probability that a random event will be “statistically significant.” That probability actually jumps to 40% if you test 10 hypotheses and 64% if you test 20
hypotheses!
4
Testing many hypotheses
This is a mock up of an observational study testing 12 different foods impact on cholesterol (good or bad). By observing 2400 individuals, we notice that the 200 who ate more tomatoes than the other 2200 have a slightly reduced level of cholesterol (197.7. mg/DL vs 203.5 respectively). The p value is < 5% and Confidence level > 95% that eating tomatoes does reduce cholesterol. WHAT’S WRONG WITH THIS?
Medical Observational Study researching impact of various foods on Cholesterol levelTwo sample unpaired t test.
1 2 3 4 5 6 7 8 9 10 11 12Strawberry Tomato Carrot Potato Almond Olive Pork Beef Chicken Fish Milk Coffee Total
Sample size 200 200 200 200 200 200 200 200 200 200 200 200 2400Cholesterol mg/dL 199.9 197.7 200.9 206.1 202.3 201.0 205.0 204.9 205.0 202.6 207.2 202.9 203Standard deviation 40 40 40 40 40 40 40 40 40 40 40 40 40Standard error 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8Average difference -3.4 -5.8 -2.3 3.4 -0.8 -2.2 2.2 2.1 2.2 -0.4 4.6 -0.1
OutputGroup standard error 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95t statistic 1.1 2.0 0.8 1.1 0.3 0.7 0.7 0.7 0.7 0.1 1.6 0.0Degree of freedom 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398Group's t statistics:P Value, prob. samples are same: Two tail 25.2% 5.0% 43.8% 25.2% 79.6% 46.0% 46.0% 48.3% 46.0% 88.3% 12.1% 97.1%
5
The Zodiac Paradox
“… the more we look for patterns, the more likely we are to find them, particularly when we don’t begin with a particular question… Then we leap to conclusions… for why we saw the results we did.”
Peter Austin, PhD, Clinical Evaluation Science.
Medical Observational Study researching impact of various foods on Cholesterol levelTwo sample unpaired t test.
1 2 3 4 5 6 7 8 9 10 11 12Aries Taurus Gemini Cancer Leo Virgo Libra Scorpio Sagittarius CapricornAquarius Pisces Total
Sample size 200 200 200 200 200 200 200 200 200 200 200 200 2400Cholesterol mg/dL 201.6 202.5 202.3 206.1 202.3 199.9 197.7 204.9 204.0 203.1 207.2 203.9 203Standard deviation 40 40 40 40 40 40 40 40 40 40 40 40 40Standard error 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8Average difference -1.5 -0.5 -0.8 3.4 -0.8 -3.4 -5.8 2.1 1.1 0.1 4.6 1.0
OutputGroup standard error 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95 2.95t statistic 0.5 0.2 0.3 1.1 0.3 1.1 2.0 0.7 0.4 0.0 1.6 0.3Degree of freedom 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398 2398Group's t statistics:P Value, prob. samples are same: Two tail 60.5% 85.4% 79.6% 25.2% 79.6% 25.2% 5.0% 48.3% 71.2% 97.1% 12.1% 74.0%
6
How to fix the Zodiac Paradox?
You have to adjust the P value threshold downward to reflect the # of hypotheses you are testing.
To reach the 95% confidence level, you adjust the P value following the formula shown above where n is the # of hypotheses.
As a close estimate, you can also divide the P value threshold by the number of hypotheses.
Another way to avoid this issue is to use a second data set and test a single hypothesis: does eating tomatoes reduce cholesterol?
Adjusted P value threshold = 1 - (1 - P value)(1/n)
Calculating Adjusted P value thresholdConfidence level 95%Unadjusted P value 5.00%
2 2.53%Adjusted 3 1.70%P value # 4 1.27%
Threshold of 5 1.02%hypotheses 6 0.85%
7 0.73%8 0.64%
10 0.51%
7
Assuming data is normally distributed when it is not can lead to wrong conclusions within an unpaired
hypothesis testing framework. Always watch for the shape of your data
distribution.
Distribution of Salary levels for Dept. A & B
0%
10%
20%
30%
40%
$29,000
$32,000
$35,000
$38,000
$41,000
$44,000
$47,000
$50,000
$53,000
$56,000
$59,000
$62,000
Dept. A
Dept. B
8
Looking at salary levels of two departments
If we assume the data is normally distributed, those two departments (Dept. A & Dept. B) are deemed to have the same salary level. Using the unpaired t test, they represent two samples who come from populations with near identical distributions (P value 98%). But, as we’ll find out the data is not normally distributed which invalidates our conclusion.
Two sample unpaired t test. Dept. A Dept. B
Sample size 10 10Average 37,022$ 37,100$ Standard deviation 4,933$ 8,576$ Standard error 1,560$ 2,712$ Average difference (78)$
OutputGroup standard error 3,129$ t statistic 0.025Degree of freedom 18P Value (two tail test) 98.0%
Dept. A Dept. B36,400$ 29,200$ 43,500$ 34,000$ 29,000$ 39,651$ 41,000$ 31,852$ 39,000$ 35,916$ 37,200$ 36,712$ 42,250$ 38,500$ 29,950$ 30,167$ 33,420$ 35,505$ 38,500$ 59,500$
9
… the data is not normally distributed!
Distribution of Salary levels for Dept. A & B
0%
10%
20%
30%
40%
$29,000
$32,000
$35,000
$38,000
$41,000
$44,000
$47,000
$50,000
$53,000
$56,000
$59,000
$62,000
Dept. A
Dept. B
Jarque- Bera testProbability distribution is Normal
Dept. A Dept. B Combinedn 10 10 20Skewness -0.5 2.3 1.8Kurtosis -0.8 6.1 5.5JB score 0.7 24.0 36.7DF 2 2 2p-value 0.7 0.0 0.0
Using the Jarque-Bera test, there is a 0% probability that either Dept. B or the combined data sets are normally distributed. This is obvious when looking at the histogram below.
10
Mann-Whitney test
Instead of dealing with average value the MW test deals with average rank. Ranking of salaries are in ascending order. The MW test neutralizes the impact of the salary outlier ($59,500). This test suggests there is only a 47% probability that those two departments have the same salary level. If they are different, the salary level of Dept. A is deemed the higher one (higher avg. rank). Very different conclusion vs the unpaired t test…
Dept. A Dept. B Dept. A Dept. B36,400$ 29,200$ 10 243,500$ 34,000$ 19 729,000$ 39,651$ 1 1641,000$ 31,852$ 17 539,000$ 35,916$ 15 937,200$ 36,712$ 12 1142,250$ 38,500$ 18 1329,950$ 30,167$ 3 433,420$ 35,505$ 6 838,500$ 59,500$ 13 20
Ranking
Mann-Whitney test (using avg. rank)Dept. A Dept. B
Sample size 10 10Average rank 11.4 9.5Difference in avg. rank 1.9
Standard Error: (n1 + n2)(SQRT(n1+n2+1)/(12n1n2)n1 + n2 20n1 + n2 + 1 2112n1n2 1200Standard Error 2.6
Test statistics: Difference in avg. rank/Standard ErrorDiff. in avg. rank 1.9Standard Error 2.6Test statistic - Z value 0.72 nbr. of Standard ErrorP Value (two tail test) 47.3% Using NORMSDIST
11
“Statistically significant” vs material
Example: A rep is marketing a math SAT prep course for a $1,000. The rep tells you the course has improved students score and is associated with a P value of 5% or a 95% Confidence level. Should you buy the course?
Statistically significant does not necessarily mean consequential or material for your business. Sometimes completely trivial differences can be statistically significant simply because you have very large sample sizes.
12
Assessing an SAT math course with an Effect Size measure: Cohen’s d value
Effect Size Cohen’s d value is set up similarly to the unpaired t test. The main difference is that it measures the statistical distance in standard deviations instead of standard errors.
Effect size Cohen d value interpretation0.0 < Small0.2 Small0.5 Medium
0.8 Large
Two sample Unpaired t test. Effect Size Cohen's d value Tests Controls Tests Controls
Sample size 278 278 Sample size 278 278SAT math score avg. 517 500 SAT math score avg. 517 500Standard deviation 100 100 Standard deviation 100 100Standard error 6.0 6.0Average difference 17 Average difference 17
Output OutputGroup standard error 8.48 Pooled - Standard deviation 100.0
t statistic 1.97 Cohen's d value 0.17Degree of freedom 554 Percentile standing - Tests Avg. 56.6%Group's t statistics: Effect size strength < SmallP Value (2 tail) prob. samples are same: 5.0% Effect size midpoint 0.08
Area under curve 53.3%Area beyond curve 46.7%Overlap area 93.4%Nonoverlap area 6.6%
13
Cohen’s d informationEffect Size
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
4.0%
4.5%
-2.5
0-2
.20
-1.9
0-1
.60
-1.3
0-1
.00
-0.7
0-0
.40
-0.1
00.
200.
500.
801.
101.
401.
702.
002.
30
Z value
Nonoverlap Area
Effect Size
Overlap Area
Midpoint
Pilot Avg.Percentile Standing
A B C = 2B D = 1 - CCohen's d Test avg. Area
Effect Percentile Area beyond Overlap NonoverlapSize Standing Midpoint under curve curve area area0.17 56.6% 0.08 53.3% 46.7% 93.4% 6.6%0.20 57.9% 0.10 54.0% 46.0% 92.0% 8.0%0.50 69.1% 0.25 59.9% 40.1% 80.3% 19.7%0.80 78.8% 0.40 65.5% 34.5% 68.9% 31.1%1.96 97.5% 0.98 83.6% 16.4% 32.7% 67.3%
Effect Size Cohen's d value Tests Controls
Sample size 278 278SAT math score avg. 517 500Standard deviation 100 100
Average difference 17
OutputPooled - Standard deviation 100.0
Cohen's d value 0.17Percentile standing - Tests Avg. 56.6%Effect size strength < SmallEffect size midpoint 0.08
A Area under curve 53.3%B Area beyond curve 46.7%
C = 2B Overlap area 93.4%D = 1 - C Nonoverlap area 6.6%
14
Cohen’s d Confidence Interval
The Effect Size standard deviation formula allows to build Confidence Intervals around Effect Size values.
Effect Size Cohen's d value Tests Controls
Sample size 278 278SAT math score avg. 517 500Standard deviation 100 100
Average difference 17
OutputPooled - Standard deviation 100.0
Cohen's d value 0.17Percentile standing - Tests Avg. 56.6%Effect size strength < SmallEffect size midpoint 0.08Area under curve 53.3%Area beyond curve 46.7%Overlap area 93.4%Nonoverlap area 6.6%Confidence Interval a) In standardized units:Effect Size standard deviation 0.08
C.I. Z value Low high95.0% 1.96 0.00 0.33
b) In regular units:C.I. Z value Low high
95.0% 1.96 0.04 33.3
The “Ns” just stand for the size of a sample.
Regarding this SAT prep course, stating that its 17 math point effect is small and has a 95% confidence interval of 0 to 33 math points provides you better info than stating simply it is statistically significant with a p value of 5%.
15
Gamma Index (staying with SAT example)
This is a nonparametric alternative to Cohen’s d in case variables are not normally distributed.
I recalculated the Gamma Index so it is consistent in sign with Cohen’s d. Now a positive Effect size (Z Value) denotes the Tests values being higher than the Controls’.
Gamma IndexTests Controls617 402401 594517 415406 610550 400461 600602 360424 651687 420522 561442 426666 566427 500
Median 517 500Difference 17
OutputEffect Size - Gamma Index# < Control Median 6Proportion 46.2%Z Value (left tail) -0.10
Adjusted Effect Size - Gamma Index# > Control Median 7Proportion 53.8%Z Value (right tail) 0.10
16
Dealing with the causation issue
Causation is elusive to prove. We’ll cover a couple of methods that get you somewhat closer. Causation ultimately depends on your logical flows. The stats can support the logic much more than the other way around.
17
Causation Part I: Granger Causality
18
Granger Causality
Independent Variable being tested
A Granger causes B
Base case Model Test ModelAutoregressive Multivariate
X1 = Lag B X1 = Lag BY = B X2 = Lag A
Y = B
Square Residuals Square Residuals
Hypothesis testing
Unpaired t TestDo 2 samples of squaredresiduals come from same population?
Linear Regression
19
Granger Causality – the whole pictureA Granger causes B B Granger causes A
Base case Model Test Model Base case Model Test ModelAutoregressive Multivariate Autoregressive Multivariate
X1 = Lag B X1 = Lag B X1 = Lag A X1 = Lag AY = B X2 = Lag A Y = A X2 = Lag B
Y = B Y = A
Square Residuals Square Residuals Square Residuals Square Residuals
Hypothesis testing Hypothesis testing
Unpaired t Test Unpaired t TestDo 2 samples of squared Do 2 samples of squaredresiduals come from residuals come from same population? same population?
Granger CausalityCompare unpaired t test P value to find out if A causes B more than B causes A.
Linear Regression Linear Regression
This method still does not fully demonstrate causality. The rooster crows “Granger causes” sunrise.
20
Causation Part II: Path Analysis
We’ll illustrate Path Analysis through an example inspired from this book.
The author’s theory is that human capital causes homeownership rate to decline.
+ -
21
Path Analysis: Direct and Indirect Effects
The Correlation of the independent variable can be decomposed into its Direct Effect and Indirect Effect on the dependent variable. The Indirect Effect is derived from intermediary variables in between the independent and dependent one. The Causal Effect is the sum of the mentioned Effects and should equal the Correlation.
Direct Effect
Correlation Causal Effect
Indirect Effect
22
The Path Analysis Diagram
The Path Analysis Diagram defines our hypothesis. Human Capital has an effect on:
• Home Affordability (-) as highly educated wage earners bid up prices of homes,
• Demographic/youth (-), and
• Unemployment (-) as Human Capital lowers unemployment.
In turn, those intermediary variables have an effect on Homeownership rate:
• Housing Affordability (+), if homes are more affordable homeownership goes up.
• Demographic-Youth (% of population between 20 and 29), (-) as younger people starting out can ill afford homes, and
• Unemployment (-) as unemployed lack the income to buy homes.
Independent Intermediary Dependentvariable variables variable
Housing
- affordability +
Human - Demographic - HomeCapital (youth) ownership
- -Unemployment
+ -
23
The Actual Correlations
We embedded the correlations within the diagram. We also added a correlation directly from Human Capital to Home ownership. Most correlation signs support the hypothesis except Unemployment.
Correlations -0.176Housing
-0.181 affordability 0.573Human HomeCapital -0.064 Youth -0.249 ownership
-0.250 Unemploy. 0.064
Correlation matrix using standardized variablesVariable Human cap. Housing afford.Young pop. Unemploy. Homeownership
Human capital 1 -0.181 -0.064 -0.250 -0.176
Housing affordability -0.181 1 -0.102 0.301 0.573Young population -0.064 -0.102 1 -0.167 -0.249Unemployment rate -0.250 0.301 -0.167 1 0.064Homeownership rate -0.176 0.573 -0.249 0.064 1
24
Path Analysis
With standardized variables within a single relationship the Correlation is equal to the Slope.
Correlation = COVAR (X, Y)/(s X)(s Y)Slope = COVAR (X, Y)/VAR XIf both s = 1. Correlation = Slope
25
The Path Coefficients
SUMMARY OUTPUTRegression Statistics
Multiple R 0.633R Square 0.401Adjusted R Square 0.375Standard Error 0.785Observations 111
Coefficients Stand. Error t Stat P-valueHuman capital -0.130 0.078 -1.665 0.099Housing affordability 0.581 0.079 7.334 0.000Young population -0.228 0.077 -2.976 0.004Unemployment -0.181 0.081 -2.227 0.028
Given that the variables are standardized, all bivariate correlations already represent Path coefficients (in white). We’ll calculate the Path coefficients in yellow with a regression model.
Dependent variable is Homeownership rate
Path Coefficients-0.130
Housing-0.181 affordability 0.581
Human HomeCapital -0.064 Youth -0.228 ownership
-0.250 Unemploy. -0.181
26
Human Capital Direct and Indirect Effects
Human Capital causal effect (-0.176) on Homeownership equals its correlation.
Decomposing correlations into indirect and direct effectsHuman Capital indirect effect on Homeownership A B A x BHuman Cap. -> Housing affordability -> Homeownership -0.181 0.581 -0.105Human Cap. -> Youth -> Homeownership -0.064 -0.228 0.015Human Cap. -> Unemployment -> Homeownership -0.250 -0.181 0.045
-0.045
Human Capital direct effect on Homeownership -0.130
Human Capital Causal effect on Homeownershipa) Indirect effect -0.045b) Direct effect -0.130Total causal effect -0.176
Path Coefficients-0.130
Housing-0.181 affordability 0.581
Human HomeCapital -0.064 Youth -0.228 ownership
-0.250 Unemploy. -0.181
27
Take Aways
Adjust P value
Mann-Whitney test
Cohen's d value
Gamma IndexGranger Causality
Path Analysis
Effect Size
Tackling Causality?
Testing many hypotheses
Dealing with non normal datain hypothesis testing
Evaluating statistical significance