DATA CONFUSION
How to confuse yourself and others with Data Analysis
AGENDA FOR TODAY’S TALK
• Good Graphs – Bad Graphs• The Law of Averages• PTBD Analysis• Enumerative & Analytical Problems• PARC Analysis• Wrong Methods of Analysis
“There are three kinds of lies:
Lies, damned lies and statistics”Attributed to Benjamin Disraeli by Mark Twain
GOOD GRAPHS AND BAD GRAPHS
DATA RELEVANCE
• Graphs are only as good as the data they display
• No amount of creativity can produce good graphs from dubious data
DATA CONTENT
• Don’t produce graphs from very small amounts of data
• The human brain can grasp 1, 2 or 3 numbers without a graph
RULES FOR PRODUCING GOOD GRAPHS
• KEEP IT SIMPLE AND STUPID– Jesse Ventura
• Tell the truth – don’t distort the data
GOOD GRAPHS
• Portray information without distortion
• Contain no distracting elements
– No false third dimensions, irrelevant decoration, or colour (chartjunk)
• Use an appropriate scale
• Label axes and tick marks properly, including measurement units
• Have a descriptive title and/ or caption and legend
• Have a low ink – to – information ratio
Temperature (degC) of Air and Subject during one day
15
20
25
30
35
40
6 am Noon 6 pm Midnight 6 am
Time of Day
Tem
per
atu
re (
deg
C)
Air
Subject
0
5
10
15
20
25
30
35
40
6 am Noon 6 pm Midnight 6 am
Air
Subject
Temperature (degC) of Air and Subject during one day
0
10
20
30
40
50
60
70
80
90
100
6 am Noon 6 pm Midnight 6 am
Time of Day
Tem
per
atu
re (
deg
C)
Air
Subject
Temperature (degC) of Air and Subject during one day
15
20
25
30
35
40
6 am Noon 6 pm Midnight 6 am
Time of Day
Tem
per
atu
re (
deg
C)
subject
air
BAD GRAPH GOOD GRAPH
BAD GRAPH EVEN BETTER GRAPH
0
2
4
6
8
10
12
14
16
18
A B C D E
Data
EDCBA
17
16
15
14
13
12
Boxplot of A, B, C, D, E
Project: more chewchat data.MPJ; Worksheet: Worksheet 1; 12/04/2006; Graham Errington
BAD GRAPH
GOOD GRAPH GOOD GRAPH
Data17.016.516.015.515.014.514.013.513.012.512.011.5
A
B
C
D
E
Dotplot of A, B, C, D, E
Project: more chewchat data.MPJ; Worksheet: Worksheet 1; 12/04/2006; Graham Errington
GRAPHS THAT CONFUSE
MONTHLY REJECTS
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
MONTH
No
. R
EJ
EC
TS 2001
2002
2003
2004
2005
MONTHLY REJECTS
0%
20%
40%
60%
80%
100%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
MONTH
No
. R
EJ
EC
TS 2005
2004
2003
2002
2001
MONTHLY REJECTS
0
5000
10000
15000
20000
25000
30000
35000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
MONTH
No
. R
EJ
EC
TS 2005
2004
2003
2002
2001
CHART JUNK
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001
2003
20050
1000
2000
3000
4000
5000
6000
7000
8000
No
. R
EJ
EC
TS
MONTH
MONTHLY REJECTS
2001
2002
2003
2004
2005
50
00
45
67
44
76
47
25
51
04
51
57
51
86
51
86
52
14
58
11
59
40
59
40
35
05
53
21
42
05
46
01
54
52
54
58
71
74
69
13
42
05
47
34
46
05
51
57
68
52
77
32
52
14
51
63
47
25
53
01
58
01
62
61
60
09
58
72
56
93
44
76
45
03
54
32
54
52
42
47
47
45
36
33
62
61
60
09
58
72
56
93
39
76
44
32
66
66
78
90
58
11
49
89
45
26
64
13
71
74
72
43
60
09
48
70
49
89
45
26
0 5000 10000 15000 20000 25000 30000 35000
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
2001 2002 2003 2004 2005
0%
20%
40%
60%
80%
100%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001 2002 2003 2004 2005
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001
2003
2005
2001
2002
2003
2004
2005
GRAPHS THAT TELL A STORYNo. REJ
ECTS
YEAR
MONTH
2005
2005
2005
2005
2005
2005
2004
2004
2004
2004
2004
2004
2003
2003
2003
2003
2003
2003
2002
2002
2002
2002
2002
2002
2001
2001
2001
2001
2001
2001
Nov
Sep
Jul
May
MarJan
Nov
Sep
Jul
May
MarJan
Nov
Sep
Jul
May
MarJan
Nov
Sep
Jul
May
MarJan
Nov
Sep
Jul
May
MarJan
8000
7000
6000
5000
4000
3000
Time Series Plot of No. REJECTS
Project: Untitled; Worksheet: Worksheet 3; 04/04/2006; Graham Errington
MONTH
Indiv
idual V
alu
e
8000
6000
4000
2000
_X=5926
UCL=8406
LCL=3445
2001 2002 2003 2004 2005
MONTH
Movin
g R
ange
3000
2000
1000
0
__MR=933
UCL=3048
LCL=0
2001 2002 2003 2004 2005
1
111
22
11
22
1
1
I-MR Chart of No. REJECTS by YEAR
Project: Data for ChewChat 13 Apr 2006.MPJ; Worksheet: Worksheet 3; 04/04/2006; Graham Errington
HISTOGRAMS
• No meaningless gaps
• Reasonable Choice of bins
• Easy to choose or adjust bins
• Good aspect ratio
• Meaningful labels on axes
• Appropriate labels on bin tick marks
Histogram
01020
35
05
47
57
.8
57
1
60
10
.7
14
2
72
63
.5
71
4
Bin
Fre
qu
en
cy
Frequency
Data
Frequency
80007000600050004000
14
12
10
8
6
4
2
0
Histogram of C20
Project: DATA FOR CHEWCHAT 13 APR 2006.MPJ; Worksheet: Worksheet 1; 07/04/2006; Graham Errington
TRENDING RANDOM VARIATION
“Upward trend”
“Downturn”
“Rebound”
“Setback”
“Turnaround”
“Downward trend”
THE LAW OF AVERAGES
“If I sit in a freezer and plunge my head into a pan of boiling chip fat. . . . .
on average, I’m quite comfortable.”
SHEWHART’S RULES FOR PRESENTATION OF DATA
• Rule One
– Data should always be presented in a way that preserves the evidence in the data
• Rule Two
– When an average, standard deviation or histogram is used to summarize data, the user should not be misled into to taking action they would not take if the data were presented in a time series
USING THE WRONG METHODS
Descriptive Statistics: A, B, C, D
Variable N Mean StDev CoefVar Minimum Maximum
A 20 11.950 0.102 0.85 11.83 12.08
B 20 11.950 0.100 0.84 11.85 12.25
C 20 11.950 0.102 0.86 11.75 12.15
D 20 11.950 0.100 0.84 11.81 12.14
Process: A B C D
1 11.85 11.85 11.75 12.14
2 11.83 11.86 11.95 12.01
3 11.87 11.87 11.8 11.88
4 11.84 11.87 11.94 12.07
5 11.85 11.88 11.95 11.95
6 11.86 11.89 12 11.87
7 11.85 11.89 12.05 12.06
8 11.85 11.9 11.85 11.94
9 11.84 11.92 11.94 11.84
10 11.86 11.91 11.85 12.05
11 12.05 11.93 12.05 11.93
12 12.06 11.93 11.85 11.83
13 12.03 11.95 12.05 12.04
14 12.02 11.97 11.95 11.92
15 12.03 11.96 11.95 11.82
16 12.04 11.99 11.95 12.03
17 12.06 12 11.85 11.91
18 12.06 12 12.1 11.81
19 12.04 12.16 12 12.01
20 12.08 12.25 12.15 11.81
NO SIGNIFICANT DIFFERENCE HERE!
One-way ANOVA: A, B, C, D Source DF SS MS F P Factor 3 0.0000 0.0000 0.00 1.000 Error 76 0.7746 0.0102 Total 79 0.7746 S = 0.1010 R-Sq = 0.00% R-Sq(adj) = 0.00% Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev --------+---------+---------+---------+- A 20 11.950 0.102 (-----------------*-----------------) B 20 11.950 0.100 (-----------------*-----------------) C 20 11.950 0.102 (-----------------*-----------------) D 20 11.950 0.100 (-----------------*-----------------) --------+---------+---------+---------+- 11.925 11.950 11.975 12.000 Pooled StDev = 0.101
NO DIFFERENCE?!?
Sample
12.2
12.1
12.0
11.9
11.8
2018161412108642
12.2
12.1
12.0
11.9
11.8
2018161412108642
A B
C D
Project: ENUMERATIVE VS ANALYTICAL STUDIES.MPJ; Worksheet: Worksheet 1; 05/04/2006; Graham Errington
FOUR PROCESSES WITH SAME MEAN AND SDMean = 11.95, SD = .10
ALWAYS CARRY OUT PTBD ANALYSIS
PLOT THE B….. DOTS!
TYPES OF STATISTICAL STUDIES
• Descriptive
• Enumerative
• Analytic
DESCRIPTIVE STUDY
• Count all fish in barrel
• Count number of goldfish
• Proportion of goldfish applies to the fish population in this barrel and no other barrels of fish
ENUMERATIVE STUDY
• Take a sample of fish from the barrel, and count the number of goldfish in the sample
• Point estimate of the proportion of goldfish in the barrel population
• Many statistical procedures do this
• Cannot make any inference about any other barrels of fish
ANALYTICAL STUDY
• Will we get the same proportion of goldfish in the future as we got in the past?
• An analytical study allows prediction within limits
Fish Packing Process over Time
ANALYTICAL STUDY
• Proportion of goldfish is stable over time
• Fish packing process is predictable within limits
• We can expect, on average, 4 goldfish per barrel, but as many as 10 and as few as 0 in any single barrel
Week No.
Sam
ple
Count
191715131197531
10
8
6
4
2
0
_C=4
UCL=10
LCL=0
C Chart of No goldfish per Barrel
Project: ENUMERATIVE VS ANALYTICAL STUDIES.MPJ; Worksheet: Fish in Barrel; 05/04/2006; Graham Errington
ENUMERATIVE vs ANALYTICAL METHODS
• Enumerative methods – seek to provide numeric summaries, confidence intervals,etc– use significance tests, ANOVA, descriptive stats, etc.,
assume single, stable population • Analytical methods
– seek to understand the system under study– use primarily graphical tools such as run charts, control
charts, histograms, box plots, etc– in the real world, most problems are analytical
“Analysis of variance, t-tests, confidence intervals, and other statistical techniques taught in books,….., are inappropriate because they provide no basis for prediction and because they bury the information contained in the order of production.”
W.E. Deming, Out of the Crisis
Traditional statistical methods have their place, but are widely abused in the real world. When this is the case, statistics do more to cloud the issue than to enlighten.
PARC ANALYSIS
Practical
Accumulated
Records
Compilation
Passive
Analysis (by)
Regression
Correlations
Planning
After
Research
Completed
Profound
Analysis
Relying (on)
Computers
note inverse relationship with
Continuous
Recording (of)
Administrative
Procedures
Constant
Repetition (of)
Anecdotal
Perceptions
PLANNING A PROCESS IMPROVEMENT STUDY
• Why collect the data?• What statistical methods for analysis?• What data will be collected?• How much data do we need?• How will the data be measured?• How good is the measurement system?• When and where will data be collected?• Who will collect the data?• Remember:
GARBAGE IN – GARBAGE OUT
WHAT’S SIGNIFICANT?
Two-sample T for C1 vs C2
N Mean StDev SE Mean
A 5 13.652 0.487 0.22
B 5 14.369 0.646 0.29
Difference = mu (C1) - mu (C2)
Estimate for difference: -0.716615
95% CI for difference: (-1.551531, 0.118301)
T-Test of difference = 0 (vs not =): T-Value = -1.98 P-Value = 0.083 DF = 8
Both use Pooled StDev = 0.5725
Two-sample T for C3 vs C4
N Mean StDev SE Mean
A 200 13.510 0.501 0.035
A 200 13.667 0.492 0.035
Difference = mu (C3) - mu (C4)
Estimate for difference: -0.157292
95% CI for difference: (-0.254935, -0.059649)
T-Test of difference = 0 (vs not =): T-Value = -3.17 P-Value = 0.002 DF = 398
Both use Pooled StDev = 0.4967
Mean A = 13.7, Mean B = 14.4
Not significant?
Mean A = 13.5, Mean B = 13.7
Significant?
WHAT SHOULD I DO WITH OUTLIERS?
• Data point far away from the rest of the data
• Don’t remove outliers to make data “look good”
• Do you know why it is different?
• If you do, remove it. If you don’t, leave it in
• Could have a big impact on the analysis
• Re – run analysis without outlier, and compare results
“REGRESSION” WITH EXCEL
• Usually means drawing an X-Y plot, fitting a straight line and coming up with an R2 value.
• As long as R2 is high, everything’s hunky-dory.
WRONG!
“REGRESSION” WITH EXCEL
Defects vs Cure Time
y = 0.1913x - 5.5192
R2 = 0.5079
-2
-1
0
1
2
3
4
5
6
20 25 30 35 40 45 50
Cure Time s
No
. o
f D
efe
cts
Relationship is clearly not linear, and should not be presented as such
“REGRESSION” WITH EXCEL• Regression model checking – in Excel?
• Residual plots:
– Normally distributed
– Random pattern when plotted vs fitted values
OK Variance not homogeneous
Model incorrect
PITFALLS OF REGRESSION ANALYSIS
• Non-Linear Relationships
• Influential Points
• Extrapolating
• Lurking Variables
• Summary Data
• Assuming Causation
• THAT’S (WITH REASONABLE PROBABILITY) THE END FOLKS!
And remember,
• With statistics, you never have to say you’re certain!
• THANK YOU FOR YOUR ATTENTION• ARE THERE ANY QUESTIONS?
• GOOD LUCK!!