Download - Data confusion (how to confuse yourself and others with data analysis)

DATA CONFUSION

How to confuse yourself and others with Data Analysis

AGENDA FOR TODAY’S TALK

• Good Graphs – Bad Graphs• The Law of Averages• PTBD Analysis• Enumerative & Analytical Problems• PARC Analysis• Wrong Methods of Analysis

“There are three kinds of lies:

Lies, damned lies and statistics”Attributed to Benjamin Disraeli by Mark Twain

GOOD GRAPHS AND BAD GRAPHS

DATA RELEVANCE

• Graphs are only as good as the data they display

• No amount of creativity can produce good graphs from dubious data

DATA CONTENT

• Don’t produce graphs from very small amounts of data

• The human brain can grasp 1, 2 or 3 numbers without a graph

RULES FOR PRODUCING GOOD GRAPHS

• KEEP IT SIMPLE AND STUPID– Jesse Ventura

• Tell the truth – don’t distort the data

GOOD GRAPHS

• Portray information without distortion

• Contain no distracting elements

– No false third dimensions, irrelevant decoration, or colour (chartjunk)

• Use an appropriate scale

• Label axes and tick marks properly, including measurement units

• Have a descriptive title and/ or caption and legend

• Have a low ink – to – information ratio

Temperature (degC) of Air and Subject during one day

15

20

25

30

35

40

6 am Noon 6 pm Midnight 6 am

Time of Day

Tem

per

atu

re (

deg

C)

Air

Subject

0

5

10

15

20

25

30

35

40


Air

Subject


0

10

20

30

40

50

60

70

80

90

100


Time of Day

Tem

per

atu

re (

deg

C)

Air

Subject


15

20

25

30

35

40


Time of Day

Tem

per

atu

re (

deg

C)

subject

air

BAD GRAPH GOOD GRAPH

BAD GRAPH EVEN BETTER GRAPH

0

2

4

6

8

10

12

14

16

18

A B C D E

Data

EDCBA

17

16

15

14

13

12

Boxplot of A, B, C, D, E

Project: more chewchat data.MPJ; Worksheet: Worksheet 1; 12/04/2006; Graham Errington

BAD GRAPH

GOOD GRAPH GOOD GRAPH

Data17.016.516.015.515.014.514.013.513.012.512.011.5

A

B

C

D

E

Dotplot of A, B, C, D, E

Project: more chewchat data.MPJ; Worksheet: Worksheet 1; 12/04/2006; Graham Errington

GRAPHS THAT CONFUSE

MONTHLY REJECTS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

MONTH

No

. R

EJ

EC

TS 2001

2002

2003

2004

2005

MONTHLY REJECTS

0%

20%

40%

60%

80%

100%


MONTH

No

. R

EJ

EC

TS 2005

2004

2003

2002

2001

MONTHLY REJECTS

0

5000

10000

15000

20000

25000

30000

35000


MONTH

No

. R

EJ

EC

TS 2005

2004

2003

2002

2001

CHART JUNK


2001

2003

20050

1000

2000

3000

4000

5000

6000

7000

8000

No

. R

EJ

EC

TS

MONTH

MONTHLY REJECTS

2001

2002

2003

2004

2005

50

00

45

67

44

76

47

25

51

04

51

57

51

86

51

86

52

14

58

11

59

40

59

40

35

05

53

21

42

05

46

01

54

52

54

58

71

74

69

13

42

05

47

34

46

05

51

57

68

52

77

32

52

14

51

63

47

25

53

01

58

01

62

61

60

09

58

72

56

93

44

76

45

03

54

32

54

52

42

47

47

45

36

33

62

61

60

09

58

72

56

93

39

76

44

32

66

66

78

90

58

11

49

89

45

26

64

13

71

74

72

43

60

09

48

70

49

89

45

26

0 5000 10000 15000 20000 25000 30000 35000

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

2001 2002 2003 2004 2005

0%

20%

40%

60%

80%

100%


2001 2002 2003 2004 2005


2001

2003

2005

2001

2002

2003

2004

2005

GRAPHS THAT TELL A STORYNo. REJ

ECTS

YEAR

MONTH

2005

2005

2005

2005

2005

2005

2004

2004

2004

2004

2004

2004

2003

2003

2003

2003

2003

2003

2002

2002

2002

2002

2002

2002

2001

2001

2001

2001

2001

2001

Nov

Sep

Jul

May

MarJan

Nov

Sep

Jul

May

MarJan

Nov

Sep

Jul

May

MarJan

Nov

Sep

Jul

May

MarJan

Nov

Sep

Jul

May

MarJan

8000

7000

6000

5000

4000

3000

Time Series Plot of No. REJECTS

Project: Untitled; Worksheet: Worksheet 3; 04/04/2006; Graham Errington

MONTH

Indiv

idual V

alu

e

8000

6000

4000

2000

_X=5926

UCL=8406

LCL=3445

2001 2002 2003 2004 2005

MONTH

Movin

g R

ange

3000

2000

1000

0

__MR=933

UCL=3048

LCL=0

2001 2002 2003 2004 2005

1

111

22

11

22

1

1

I-MR Chart of No. REJECTS by YEAR

Project: Data for ChewChat 13 Apr 2006.MPJ; Worksheet: Worksheet 3; 04/04/2006; Graham Errington

HISTOGRAMS

• No meaningless gaps

• Reasonable Choice of bins

• Easy to choose or adjust bins

• Good aspect ratio

• Meaningful labels on axes

• Appropriate labels on bin tick marks

Histogram

01020

35

05

47

57

.8

57

1

60

10

.7

14

2

72

63

.5

71

4

Bin

Fre

qu

en

cy

Frequency

Data

Frequency

80007000600050004000

14

12

10

8

6

4

2

0

Histogram of C20

Project: DATA FOR CHEWCHAT 13 APR 2006.MPJ; Worksheet: Worksheet 1; 07/04/2006; Graham Errington

TRENDING RANDOM VARIATION

“Upward trend”

“Downturn”

“Rebound”

“Setback”

“Turnaround”

“Downward trend”

THE LAW OF AVERAGES

“If I sit in a freezer and plunge my head into a pan of boiling chip fat. . . . .

on average, I’m quite comfortable.”

SHEWHART’S RULES FOR PRESENTATION OF DATA

• Rule One

– Data should always be presented in a way that preserves the evidence in the data

• Rule Two

– When an average, standard deviation or histogram is used to summarize data, the user should not be misled into to taking action they would not take if the data were presented in a time series

USING THE WRONG METHODS

Descriptive Statistics: A, B, C, D

Variable N Mean StDev CoefVar Minimum Maximum

A 20 11.950 0.102 0.85 11.83 12.08

B 20 11.950 0.100 0.84 11.85 12.25

C 20 11.950 0.102 0.86 11.75 12.15

D 20 11.950 0.100 0.84 11.81 12.14

Process: A B C D

1 11.85 11.85 11.75 12.14

2 11.83 11.86 11.95 12.01

3 11.87 11.87 11.8 11.88

4 11.84 11.87 11.94 12.07

5 11.85 11.88 11.95 11.95

6 11.86 11.89 12 11.87

7 11.85 11.89 12.05 12.06

8 11.85 11.9 11.85 11.94

9 11.84 11.92 11.94 11.84

10 11.86 11.91 11.85 12.05

11 12.05 11.93 12.05 11.93

12 12.06 11.93 11.85 11.83

13 12.03 11.95 12.05 12.04

14 12.02 11.97 11.95 11.92

15 12.03 11.96 11.95 11.82

16 12.04 11.99 11.95 12.03

17 12.06 12 11.85 11.91

18 12.06 12 12.1 11.81

19 12.04 12.16 12 12.01

20 12.08 12.25 12.15 11.81

NO SIGNIFICANT DIFFERENCE HERE!

One-way ANOVA: A, B, C, D Source DF SS MS F P Factor 3 0.0000 0.0000 0.00 1.000 Error 76 0.7746 0.0102 Total 79 0.7746 S = 0.1010 R-Sq = 0.00% R-Sq(adj) = 0.00% Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev --------+---------+---------+---------+- A 20 11.950 0.102 (-----------------*-----------------) B 20 11.950 0.100 (-----------------*-----------------) C 20 11.950 0.102 (-----------------*-----------------) D 20 11.950 0.100 (-----------------*-----------------) --------+---------+---------+---------+- 11.925 11.950 11.975 12.000 Pooled StDev = 0.101

NO DIFFERENCE?!?

Sample

12.2

12.1

12.0

11.9

11.8

2018161412108642

12.2

12.1

12.0

11.9

11.8

2018161412108642

A B

C D

Project: ENUMERATIVE VS ANALYTICAL STUDIES.MPJ; Worksheet: Worksheet 1; 05/04/2006; Graham Errington

FOUR PROCESSES WITH SAME MEAN AND SDMean = 11.95, SD = .10

ALWAYS CARRY OUT PTBD ANALYSIS

PLOT THE B….. DOTS!

TYPES OF STATISTICAL STUDIES

• Descriptive

• Enumerative

• Analytic

DESCRIPTIVE STUDY

• Count all fish in barrel

• Count number of goldfish

• Proportion of goldfish applies to the fish population in this barrel and no other barrels of fish

ENUMERATIVE STUDY

• Take a sample of fish from the barrel, and count the number of goldfish in the sample

• Point estimate of the proportion of goldfish in the barrel population

• Many statistical procedures do this

• Cannot make any inference about any other barrels of fish

ANALYTICAL STUDY

• Will we get the same proportion of goldfish in the future as we got in the past?

• An analytical study allows prediction within limits

Fish Packing Process over Time

ANALYTICAL STUDY

• Proportion of goldfish is stable over time

• Fish packing process is predictable within limits

• We can expect, on average, 4 goldfish per barrel, but as many as 10 and as few as 0 in any single barrel

Week No.

Sam

ple

Count

191715131197531

10

8

6

4

2

0

_C=4

UCL=10

LCL=0

C Chart of No goldfish per Barrel

Project: ENUMERATIVE VS ANALYTICAL STUDIES.MPJ; Worksheet: Fish in Barrel; 05/04/2006; Graham Errington

ENUMERATIVE vs ANALYTICAL METHODS

• Enumerative methods – seek to provide numeric summaries, confidence intervals,etc– use significance tests, ANOVA, descriptive stats, etc.,

assume single, stable population • Analytical methods

– seek to understand the system under study– use primarily graphical tools such as run charts, control

charts, histograms, box plots, etc– in the real world, most problems are analytical

“Analysis of variance, t-tests, confidence intervals, and other statistical techniques taught in books,….., are inappropriate because they provide no basis for prediction and because they bury the information contained in the order of production.”

W.E. Deming, Out of the Crisis

Traditional statistical methods have their place, but are widely abused in the real world. When this is the case, statistics do more to cloud the issue than to enlighten.

PARC ANALYSIS

Practical

Accumulated

Records

Compilation

Passive

Analysis (by)

Regression

Correlations

Planning

After

Research

Completed

Profound

Analysis

Relying (on)

Computers

note inverse relationship with

Continuous

Recording (of)

Administrative

Procedures

Constant

Repetition (of)

Anecdotal

Perceptions

PLANNING A PROCESS IMPROVEMENT STUDY

• Why collect the data?• What statistical methods for analysis?• What data will be collected?• How much data do we need?• How will the data be measured?• How good is the measurement system?• When and where will data be collected?• Who will collect the data?• Remember:

GARBAGE IN – GARBAGE OUT

WHAT’S SIGNIFICANT?

Two-sample T for C1 vs C2

N Mean StDev SE Mean

A 5 13.652 0.487 0.22

B 5 14.369 0.646 0.29

Difference = mu (C1) - mu (C2)

Estimate for difference: -0.716615

95% CI for difference: (-1.551531, 0.118301)

T-Test of difference = 0 (vs not =): T-Value = -1.98 P-Value = 0.083 DF = 8

Both use Pooled StDev = 0.5725

Two-sample T for C3 vs C4

N Mean StDev SE Mean

A 200 13.510 0.501 0.035

A 200 13.667 0.492 0.035

Difference = mu (C3) - mu (C4)

Estimate for difference: -0.157292

95% CI for difference: (-0.254935, -0.059649)

T-Test of difference = 0 (vs not =): T-Value = -3.17 P-Value = 0.002 DF = 398

Both use Pooled StDev = 0.4967

Mean A = 13.7, Mean B = 14.4

Not significant?

Mean A = 13.5, Mean B = 13.7

Significant?

WHAT SHOULD I DO WITH OUTLIERS?

• Data point far away from the rest of the data

• Don’t remove outliers to make data “look good”

• Do you know why it is different?

• If you do, remove it. If you don’t, leave it in

• Could have a big impact on the analysis

• Re – run analysis without outlier, and compare results

“REGRESSION” WITH EXCEL

• Usually means drawing an X-Y plot, fitting a straight line and coming up with an R2 value.

• As long as R2 is high, everything’s hunky-dory.

WRONG!

“REGRESSION” WITH EXCEL

Defects vs Cure Time

y = 0.1913x - 5.5192

R2 = 0.5079

-2

-1

0

1

2

3

4

5

6

20 25 30 35 40 45 50

Cure Time s

No

. o

f D

efe

cts

Relationship is clearly not linear, and should not be presented as such

“REGRESSION” WITH EXCEL• Regression model checking – in Excel?

• Residual plots:

– Normally distributed

– Random pattern when plotted vs fitted values

OK Variance not homogeneous

Model incorrect

PITFALLS OF REGRESSION ANALYSIS

• Non-Linear Relationships

• Influential Points

• Extrapolating

• Lurking Variables

• Summary Data

• Assuming Causation

• THAT’S (WITH REASONABLE PROBABILITY) THE END FOLKS!

And remember,

• With statistics, you never have to say you’re certain!

• THANK YOU FOR YOUR ATTENTION• ARE THERE ANY QUESTIONS?

• GOOD LUCK!!