Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are...

81
Organizing and Displaying Data

Transcript of Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are...

Page 1: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Organizing and Displaying Data

Page 2: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Data FilesData is almost always stored in a format where:

ROWS are cases or individuals

and

COLUMNS are variables

Page 3: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

SYS- DIA- OUT-ID EJEC VOL VOL OCCLU STEN TIME COME AGE SMOKE BETA CHOLa SURG

390 72 36 131 0 0 143 0 49 2 2 59 0279 52 74 155 37 63 143 0 54 2 2 68 1391 62 52 137 33 47 16 2 56 2 2 52 0201 50 165 329 33 30 143 0 42 2 2 39 0202 50 47 95 0 100 143 0 46 2 2 74 169 27 124 170 77 23 143 0 57 2 2 NA 2

310 60 86 215 7 50 40 0 51 2 2 58 0392 72 37 132 40 10 9 5 56 2 2 75 0311 60 65 163 0 40 142 0 45 2 2 72 0393 63 52 140 0 10 142 0 46 2 2 90 070 29 117 164 50 0 142 0 48 2 2 72 0

203 48 69 133 0 27 142 0 54 2 2 NA 0394 59 54 133 30 13 142 0 39 2 1 NA 0204 50 67 135 37 63 141 0 49 2 2 86 2280 53 65 138 0 33 140 0 58 2 1 49 055 17 184 221 57 13 5 1 50 2 2 70 279 37 88 140 37 47 118 5 58 2 2 NA 0

205 45 106 193 33 43 140 0 47 1 1 38 1206 43 85 150 0 50 23 5 51 2 2 61 0312 60 59 149 7 37 139 0 43 2 1 56 080 38 103 168 47 43 100 1 55 2 2 62 1

281 57 53 124 0 57 140 0 58 2 1 93 0207 44 68 121 27 60 139 0 55 2 2 63 1282 51 53 109 0 77 139 0 41 2 2 45 4396 63 58 157 0 73 139 0 51 2 2 60 0208 49 81 157 13 13 139 0 49 2 2 60 0209 48 58 112 0 0 72 1 56 2 2 57 0283 58 71 167 27 0 138 0 45 2 1 46 0210 42 92 159 0 0 139 0 57 2 2 58 0397 68 50 156 0 100 138 0 51 2 1 NA 0211 43 146 259 47 33 3 1 56 2 2 70 0398 67 43 130 0 70 138 0 49 2 2 NA 3284 52 70 146 0 23 137 0 47 1 2 NA 0399 63 73 195 27 0 136 0 36 1 1 61 0285 54 62 133 33 23 137 0 38 2 2 NA 071 37 93 148 47 0 137 0 59 2 2 NA 0

286 51 65 133 43 7 136 0 54 2 2 NA 0212 42 95 163 40 10 109 3 57 2 2 NA 4400 66 49 144 10 50 65 1 52 2 2 55 0287 54 66 145 7 40 136 0 47 2 2 62 081 39 144 237 13 87 136 0 39 2 2 56 3

813 63 52 141 0 47 43 3 48 2 2 NA 068 30 219 314 33 45 76 1 53 1 2 NA 0

288 59 39 94 0 0 135 0 47 1 2 63 0407 67 39 117 0 73 53 1 57 2 2 62 2

a

Complete Data Table on Male Heart Attack Patients

Page 4: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Portion of the Data Table on Male Heart Attack Patients

Row 1

Variable values for subject #390

SYS- DIA- OUT-ID EJEC VOL VOL OCCLU STEN TIME COME AGE SMOKE BETA CHOL SURG

390 72 36 131 0 0 143 0 49 2 2 59 0

279 52 74 155 37 63 143 0 54 2 2 68 1

391 62 52 137 33 47 16 2 56 2 2 52 0

201 50 165 329 33 30 143 0 42 2 2 39 0

202 50 47 95 0 100 143 0 46 2 2 74 1

69 27 124 170 77 23 143 0 57 2 2 NA 2

310 60 86 215 7 50 40 0 51 2 2 58 0

392 72 37 132 40 10 9 5 56 2 2 75 0

311 60 65 163 0 40 142 0 45 2 2 72 0

393 63 52 140 0 10 142 0 46 2 2 90 0

Column 3

Systolic volume for the first 10 subjects

Page 5: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

VariablesID: A patient identifier instead of a

name.

EJEC: Ejection fraction, % of blood ejected from left ventricle in one beat.

SYSVOL: End-systolic volume, a measure of the size of the heart.

DIAVOL: End-diastolic volume.

OCCLU: Occlusion score (% of myocardium of the left ventricle supplied by arteries that are totally blocked).

Page 6: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

VariablesSTEN: Stenosis score (% supplied by

arteries that are significantly narrowed but not completely blocked).

TIME: Time in months from when patient was admitted until OUTCOME.

OUTCOME: Coded variable

0 = alive at last follow up

1 = sudden cardiac death

2 = death within 30 days of heart attack

3 = death from heart failure

4 = death during / after coronary surgery

5 = non-cardiac death

Page 7: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

VariablesSMOKE: Coded variable

1 = patient continued to smoke 2 = patient did not continue smoking

BETA: Coded variable 1 = patient took beta blockers 2 = patient did not take beta blockers

AGE: Patient’s age at admission (years)

CHOL: Blood cholesterol (mmoles/litre)

SURG: Coded variable 0 = no surgery 1 = surgery as part of a trial 2 = surgery for symptoms within 1 year 3 = surgery for symptoms within 1 to 5

years 4 = surgery for symptoms after 5 years

Page 8: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

In this presentation we will look at…

Tools to help us:• explore

search for important features / messages• communicate

report the important features/messages

Two types of tools:• visual summaries

plots, graphs, charts, etc.• numerical summaries

center, spread, percentages, frequencies, etc.

Page 9: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Types of Variables

A quantitative or numeric variable measures or counts something.

e.g. height of a student, number of sisters

A qualitative or categorical or nominal variable defines group membership.

e.g. gender, ethnicity

Page 10: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Quantitative/Numeric VariablesContinuous variables have no gaps between possible values. (measurements)e.g. weight, temperature

Discrete variables have gaps between possible values. (counts)e.g. number of brothers

Variables with few repeated values are treated as continuous.Variables with many repeated values are treated as discrete.

Page 11: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Qualitative (Categorical) Variables

A categorical or nominal variable is one that has no order.

e.g. ethnicity, gender

An ordinal variable is one where the categories can be ordered.

e.g. income group (low, middle, high); age group (young, old)

Likert scale, e.g. (1 = strong disagree,…, 5 = strongly agree)

Page 12: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Types of Variables

Quantitative

(measurements and counts)

Qualitative

(define groups)

Continuous (few repeated

values)

Discrete (many repeated

values)

Categorical/Nominal (no idea of order)

Ordinal (fall in natural

order)

Page 13: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

SYS- DIA- OUT-ID EJEC VOL VOL OCCLU STEN TIME COME AGE SMOKE BETA CHOLa SURG

390 72 36 131 0 0 143 0 49 2 2 59 0279 52 74 155 37 63 143 0 54 2 2 68 1391 62 52 137 33 47 16 2 56 2 2 52 0201 50 165 329 33 30 143 0 42 2 2 39 0202 50 47 95 0 100 143 0 46 2 2 74 169 27 124 170 77 23 143 0 57 2 2 NA 2

310 60 86 215 7 50 40 0 51 2 2 58 0392 72 37 132 40 10 9 5 56 2 2 75 0311 60 65 163 0 40 142 0 45 2 2 72 0393 63 52 140 0 10 142 0 46 2 2 90 070 29 117 164 50 0 142 0 48 2 2 72 0

203 48 69 133 0 27 142 0 54 2 2 NA 0394 59 54 133 30 13 142 0 39 2 1 NA 0204 50 67 135 37 63 141 0 49 2 2 86 2280 53 65 138 0 33 140 0 58 2 1 49 055 17 184 221 57 13 5 1 50 2 2 70 279 37 88 140 37 47 118 5 58 2 2 NA 0

205 45 106 193 33 43 140 0 47 1 1 38 1206 43 85 150 0 50 23 5 51 2 2 61 0312 60 59 149 7 37 139 0 43 2 1 56 080 38 103 168 47 43 100 1 55 2 2 62 1

281 57 53 124 0 57 140 0 58 2 1 93 0207 44 68 121 27 60 139 0 55 2 2 63 1282 51 53 109 0 77 139 0 41 2 2 45 4396 63 58 157 0 73 139 0 51 2 2 60 0208 49 81 157 13 13 139 0 49 2 2 60 0209 48 58 112 0 0 72 1 56 2 2 57 0283 58 71 167 27 0 138 0 45 2 1 46 0210 42 92 159 0 0 139 0 57 2 2 58 0397 68 50 156 0 100 138 0 51 2 1 NA 0211 43 146 259 47 33 3 1 56 2 2 70 0398 67 43 130 0 70 138 0 49 2 2 NA 3284 52 70 146 0 23 137 0 47 1 2 NA 0399 63 73 195 27 0 136 0 36 1 1 61 0285 54 62 133 33 23 137 0 38 2 2 NA 071 37 93 148 47 0 137 0 59 2 2 NA 0

286 51 65 133 43 7 136 0 54 2 2 NA 0212 42 95 163 40 10 109 3 57 2 2 NA 4400 66 49 144 10 50 65 1 52 2 2 55 0287 54 66 145 7 40 136 0 47 2 2 62 081 39 144 237 13 87 136 0 39 2 2 56 3

813 63 52 141 0 47 43 3 48 2 2 NA 068 30 219 314 33 45 76 1 53 1 2 NA 0

288 59 39 94 0 0 135 0 47 1 2 63 0407 67 39 117 0 73 53 1 57 2 2 62 2

a

Complete Data Table on Male Heart Attack Patients

Page 14: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Heart Attack Data in JMP

Page 15: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

VariablesID: A patient identifier instead of a

name.

EJEC: Ejection fraction, % of blood ejected from left ventricle in one beat.

SYSVOL: End-systolic volume, a measure of the size of the heart.

DIAVOL: End-diastolic volume.

OCCLU: Occlusion score (% of myocardium of the left ventricle supplied by arteries that are totally blocked).

Classify each variable according to its type.

N = nominal, O = ordinal, C = continuous/discrete

N

C

C

C

C

Page 16: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

VariablesSTEN: Stenosis score (% supplied by

arteries that are significantly narrowed but not completely blocked).

TIME: Time in months from when patient was admitted until OUTCOME.

OUTCOME: Coded variable

0 = alive at last follow up

1 = sudden cardiac death

2 = death within 30 days of heart attack

3 = death from heart failure

4 = death during / after coronary surgery

5 = non-cardiac death

Classify each variable by type

(C or O or N)

C

C

N

Page 17: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

VariablesSMOKE: Coded variable

1 = patient continued to smoke 2 = patient did not continue smoking

BETA: Coded variable 1 = patient took beta blockers 2 = patient did not take beta blockers

AGE: Patient’s age at admission (years)

CHOL: Blood cholesterol (mmoles/litre)

SURG: Coded variable 0 = no surgery 1 = surgery as part of a trial 2 = surgery for symptoms within 1 year 3 = surgery for symptoms within 1 to 5

years 4 = surgery for symptoms after 5 years

Classify each variable by type

(C or O or N)

N

N

C

C

N O?

Page 18: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Data Types in JMP

Page 19: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Reporting Findings in Tables 1. Don’t try to do too much in the table. Model

tables off of published research.

2. Use white space effectively.

3. Make sure tables and text refer to each other, however you do not need to write everything in table as text. If you interpret one or two key findings in a table, the reader should be able to handle the rest.

4. Use some aspect of the data to order and group rows/columns in table, e.g. size, chronology, or to show similarity or invite comparisons.

Page 20: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Reporting Findings in Tables Example: Exercise 3 Grove

Comparisons between the Intervention group and the Control group are the focus here. The P column contains p-values from an appropriate test comparing the two groups on the given variables.

Page 21: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Reporting Findings in Tables

5. If appropriate, frame the table with summary statistics in rows and columns to provide a standard of comparison.

6. It is useful to round numbers in a table to one or two decimal places.

Page 22: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Example 2The three tables below show six-monthly circulation figures for six weekly magazines in New Zealand.

Table 1: Circulation of Weekly Magazines

We want to compare

New Idea Listener Woman’s Day

Woman’s Weekly

Time TV Guide

Jan 1 to Jun 30, 1999 67,070 90,521 165,914 126,640 38,136 241,356

Jul 1 to Dec 31, 1998 63,444 90,018 162,182 126,486 38,236 248,786

Jan 1 to Jun 30, 1998 59,039 92,786 175,002 129,920 38,635 258,806

circulation figures betweenmagazines.

It is easier to make circulation comparisons whenthe circulation data are in columns.

Page 23: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Numbers need to be

Jan 1 to Jun 30, 1999 Jul 1 to Dec 31, 1998 Jan 1 to Jun 30, 1998

New Idea 67,070 63,444 59,039

Listener 90,521 90,018 92,786

Woman’s Day 165,914 162,182 175,002

Woman’s Weekly 126,640 126,486 129,920

Time 38,136 38,236 38,635

TV Guide 241,356 248,786 258,806

Table 2: Circulation of Weekly Magazines

Example 1

rounded.

Magazines need to be ordered by circulation.

Page 24: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Jan 1 to Jun 30,

1998

Jul 1 to Dec 31,

1998

Jan 1 to Jun 30,

1999

Average

TV Guide 259 249 241 250

Woman’s Day 175 162 166 168

Woman’s Weekly 130 126 127 128

Listener 93 90 91 91

New Idea 59 63 67 63

Time 39 38 38 38

Table 3: Circulation of Weekly Magazines (in thousands)

Row averages allow comparisons between the most recent circulation data and the average for the magazine.

Example 1

Page 25: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Jan 1 to Jun 30,

1998

Jul 1 to Dec 31,

1998

Jan 1 to Jun 30,

1999

Average

TV Guide 259 249 241 250

Woman’s Day 175 162 166 168

Woman’s Weekly 130 126 127 128

Listener 93 90 91 91

New Idea 59 63 67 63

Time 39 38 38 38

Average 126 122 122

Table 3: Circulation of Weekly Magazines (in thousands)

Column averages allow comparisons between the circulation data and the average for the 6 magazines for the time period.

Example 1

Page 26: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Jan 1 to Jun 30,

1998

Jul 1 to Dec 31,

1998

Jan 1 to Jun 30,

1999

Average

TV Guide 259 249 241 250

Woman’s Day 175 162 166 168

Woman’s Weekly 130 126 127 128

Listener 93 90 91 91

New Idea 59 63 67 63

Time 39 38 38 38

Average 126 122 122

Table 3: Circulation of Weekly Magazines (in thousands)

Verbal Summary: During 1998 and the first-half of 1999 theTV Guide had the highest circulation for weekly magazines in New Zealand.

Example 1

Page 27: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Univariate Analyses

Variable type dictates how we display and summarize the distribution.

For nominal or ordinal data the notion of distribution is typically the percentage of observations falling into each of the categories or ordered levels.

For numeric data distribution refers shape of the distribution, central tendency or “average”, and variability or spread.

Page 28: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Types of Variables

Quantitative

(measurements and counts)

Qualitative

(define groups)

Continuous (few repeated

values)

Discrete (many repeated

values)

Categorical (no idea of

order)

Ordinal (fall in natural

order)

Page 29: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Displays for Numeric Variables

• Stem-and-Leaf Plots (simple, but outdated)

• Histograms & Smooth Density Estimates

• Quantile and Outlier Boxplots

Page 30: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

17.4 Australia

20.1 Austria

10.1 Czechoslovakia

13.0 Denmark

13.1 W. Germany

21.1 Greece

10.3 Israel

10.4 Japan

10.5 Norway

14.6 Poland

15.7 Switzerland

18.6 United States

19.9 Belgium

12.5 Bulgaria

11.6 Finland

20.0 France

5.4 Hong Kong

17.1 Hungary

26.8 Kuwait

11.3 Netherlands

25.6 Portugal

12.6 Singapore

12.1 N. Ireland

12.0 Scotland

15.8 Canada12.0 E. Germany 15.3 Ireland20.1 New Zealand 9.8 Sweden 10.1 England & Wales

Data for 1983, 1984 or 1985 depending on the country (prior to reunification of Germany)

Traffic Death-Rates (per 100,000 population) for 30 Countries

Page 31: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Collapse to

12 stems

Units: 17 | 4 = 17.4 deaths per 100,0005 46789 8

10 1 1 3 4 511 3 612 0 0 1 5 613 0 114 615 3 7 81617 1 418 619 920 0 1 121 122232425 626 8

Traffic Death-Rates (per 100,000 population) for 30 Countries

Units: 1 | 7 = 17 deaths per 100,0000 5001 0 0 0 0 0 1 11 2 2 2 2 3 3 31 5 51 6 6 7 71 92 0 0 0 0 1222 6 7

3

Page 32: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Traffic Death-Rates (per 100,000) for 30 Countries

Stem Leaf

2 67

2

2

2 00001

1 9

1 6677

1 55

1 22223333

1 0000011

0

0

0 5

Count

2

5

1

4

2

8

7

1

0|5 represents 5

Stem and Leaf

Traffic Death Rate (per 100,000)

Distributions

Stem-and-Leaf plot from JMP

Page 33: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Histograms

Divide range of data into equal width class intervals and use the number or percentage of observations in each class interval to determine the height of a bar centered over each interval.

Traffic Death-Rates (per 100,000)

Class Intervals % 5 – 10 6.710 – 15 50.015 – 20 23.320 – 25 13.325 – 30 6.7

Page 34: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Histograms

An appropriate histogram should have 5-15 intervals.

Histograms are used when the sample size is moderate to large. Use n 50 as a guide.

Strengths of histograms:Show the shape of the distribution.

Show gaps, outliers, clusters, groupings.

Page 35: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Histograms – Example 2Birth weights (g) of infants born to smoking and nonsmoking mothers

Weights for Infants Born to Smokers2557 2594 2600 2663 2665 2769 2769 2782 2821 2906 2920 2948 2948 2977 2977 29923005 3033 3042 3076 3076 3090 3132 3147 3203 3260 3303 3317 3321 3331 3374 34303444 3629 3637 3643 3651 3651 3756 3856 3884 3940 4238 709 1135 1790 1818 1885 1928 1928 1936 2084 2084 2125 2126 2187 2211 2225 2296 2296 2353 2367 2381 2381 2410 2410 2414 2424 2466 2466 2466 2495 2495

Weights for Infants Born to Nonsmokers2523 2551 2622 2637 2637 2722 2733 2750 2750 2778 2807 2835 2835 2836 2863 2877 2877 2920 2920 2920 2977 2977 3062 3062 3062 3080 3090 3090 3100 3104 3175 3175 3203 3203 3225 3225 3232 3232 3234 3274 3274 3317 3317 3374 3402 3416 3459 3460 3473 3475 3487 3544 3600 3614 3614 3629 3651 3651 3699 3728 3770 3770 3770 37903799 3827 3860 3860 3884 3912 3941 3941 3969 3983 3997 3997 4054 4054 4111 41534167 4174 4593 4990 1021 1330 1474 1588 1588 1701 1729 1893 1899 1928 1970 20552055 2082 2100 2187 2240 2240 2282 2301 2325 2353 2381 2395 2438 2442 2450 24952495

Page 36: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Histograms – Example 2

Would like to compare birth weights of infants born to mother’s who smoked during pregnancy to those who did not.

What distributional differences, if any, do you see?

Page 37: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Histograms – Example 3

No outliers or gaps. Two broad groupings (one group of days with little or no sun and another group of days with between 4 to 13 hours of sun).

151050

15

10

5

0

Daily sunshine (hours)

Fre

que

ncy

Auckland sunshine hours, January to April, 2000

Page 38: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Distributional Properties - Shape

(a) Unimodal

(d) Symmetric

(c) Trimodal(b) Bimodal

(e) Positively or Right skewed (long upper tail)

(f) Negatively or Left skewed (long lower tail)

(g) Symmetric (h) Bimodal with gap (i) Exponential shape

Page 39: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Distribution Properties - Outliers

• Outliers Mistakes or something

interesting/unusual.

(k) Outliers

Outlier Outlier

Page 40: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Distributional Properties - Modality

• Existence of more than one peakModality (unimodal, bimodal, etc).

(c) Trimodal(b) Bimodal

Page 41: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Distributional Properties - Skewness

• Shape of the distribution Symmetry, skewness.

(d) Symmetric

(e) Positively or right skewed (long upper tail)

(f) Negatively or left skewed (long lower tail)

Normal distribution

Page 42: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Distributional Properties – Central Tendency and Variability/Spread

• Central values and spreadWhat is the central value? How spread out are values about center?

A majority of infants have birth weights within 500g of what is typical.

Typical birth weight of infants born to nonsmokers is approx. 3000g.

Page 43: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Interpreting Stem-and-Leaf Plots and Histograms

• Be suspicious of abrupt changes

(j) Spike in pattern

Spike

Page 44: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Histograms – Example 4# of Cigarettes Smoked Per Day by WSU smokers

How would you characterize this distribution?

Page 45: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Interpreting Stem-and-Leaf Plots and Histograms

• Be suspicious of abrupt changes

(l) Truncation plus outlier

Page 46: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

(a) Unimodal

(d) Symmetric

(c) Trimodal(b) Bimodal

(e) Positively or Right skewed (long upper tail)

(f) Negatively or Left skewed (long lower tail)

(g) Symmetric (h) Bimodal with gap (i) Exponential shape

Features to look for in histograms and stem-and-leaf plots

• Outliers • Existence of more than one peak• Shape of the distribution • Central values and spread• Be suspicious of abrupt changes

Normal Distribution

Page 47: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

(k) Outliers

(j) Spike in pattern

(l) Truncation plus outlier

Outlier Outlier

Spike

Features to look for in histograms and stem-and-leaf plots

• Outliers • Existence of more than one peak• Shape of the distribution • Central values and spread• Be suspicious of abrupt changes

Page 48: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Quantile and Outlier Boxplot

Page 49: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Quantile and Outlier Boxplots

MedQ1 Q3

Width of box represents the IQR, the interquartile range, which is the range of the middle 50% of the data

x = sample mean

Birth weights of babies born to smoking mothers

Outlier

Page 50: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Quantile and Outlier Boxplots

Boxplots are useful for comparing a numeric response variable across populations.

Page 51: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Quantile and Outlier Boxplots

• Individual box plots can show outliers and skewness.

gives

Right skewed data

Page 52: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Quantile and Outlier Boxplots

• A wide box plot with short whiskers could be coming from a bimodal distribution or a very short tailed distribution.

Short whiskers

and givee

Page 53: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Quantile and Outlier Boxplots

All three populations in this study have right skewed mean NFCS scores, with extreme outliers. The Baseline group seems to have the lowest scores.

Page 54: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Which plot do I use?

Choose plots that best display the features you see in the data.

Generally look at several to see most important features of your data.

Simple Plots for Continuous Variables

Page 55: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Types of Variables

Quantitative

(measurements and counts)

Qualitative

(define groups)

Continuous (few repeated

values)

Discrete (many repeated

values)

Categorical (no idea of

order)

Ordinal (fall in natural

order)

• Stem-and-leaf plots• Histograms• Boxplots

Page 56: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Types of Variables

Quantitative

(measurements and counts)

Qualitative

(define groups)

Continuous (few repeated

values)

Discrete or Ordinal (many repeated

values)

Categorical (no idea of

order)

Ordinal (fall in natural

order)

Page 57: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Repeated and Grouped Data

Repeated Data (i.e. Discrete Variables)

e.g. Years of Education

# of Children

Display Tools:Frequency table, bar graph

Page 58: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Frequency Table

Grove, Exercise 6:

Katsma and Souza’s (2000) study are presented in

tables on pg. 36. They contain both the nurse’s

opinion regarding a patient’s self-reported pain

assessment on a 10-pt. ordinal scale and what they

actually reported in the patient’s chart. There were

two classifications of patients: smiling and

grimacing.

Page 59: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Frequency TableFrequency Table for Nurse’s Opinion of Patient’s Self-Reported Pain Score (smiling patients)

Pain AssessmentScale (xi)

Frequency(fi)

Percentage

(fi /n) x 100

Cumulative %

1 7 8.1 16.22 5 5.8 22.03 8 9.4 31.44 10 11.6 43.05 11 12.8 55.86 5 5.8 61.67 2 2.3 63.98 31 36.1 100.09 0 0.0 100.0

10 0 0.0 100.0

n = 86 100.0

0 7 8.1 8.1

Page 60: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Frequency Table

Has the columns:value xj

each distinct value in the sample

frequency fi

how often each value occurs

percentage (fi /n) x 100

percentage of sample with that value

cumulative percentage

percentage of sample with value xi or less

Page 61: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Frequency TableFrequency Table for Nurse’s Opinion of Patient’s Self-Reported Pain Score (smiling patients)

Pain AssessmentScale (xi)

Frequency(fi)

Percentage

(fi /n) x 100

Cumulative %

1 7 8.1 16.22 5 5.8 22.03 8 9.4 31.44 10 11.6 43.05 11 12.8 55.86 5 5.8 61.67 2 2.3 63.98 31 36.1 100.09 0 0.0 100.0

10 0 0.0 100.0

n = 86 100.0

0 7 8.1 8.1

31 of the86 nurses had

same opinion as patient.

31 / 86 100% = 36.1% of the agreed with

patient’s score

100.0% of the nurse’s felt the pain score was

at or below patient’s score.

Page 62: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Bar Graph

Similar to histogram (for continuous data), except bars / rectangles are not necessarily joined up.

Page 63: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Data Entered into JMP (with frequencies)

Page 64: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Frequency Tables & Bar Graphs

Computing

Frequency tables are produced from raw data in JMP under Analyze Distribution.

Be sure to tell JMP that the frequencies have been entered and should be interpreted as such.

Page 65: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Bar Graph and Frequency Table in JMP

Page 66: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Types of Variables

Quantitative

(measurements and counts)

Qualitative

(define groups)

Continuous (few repeated

values)

Discrete or Ordinal (many repeated

values)

Categorical (no idea of

order)

Ordinal (fall in natural

order)

Page 67: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Qualitative/Categorical/Nominal Variables

Display Tools:Frequency table, bar graph

Frequency Table

Used in exactly the same way as for discrete variables.

Page 68: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Order categories by size (i.e. by frequency unless there is some very compelling reason for some other ordering).

20%

40%

60%

SURG

0 1 2 3 40%

SURG:

0: No surgery

1: Surgery as part of trial

2: Surgery for symptomsin 1 year

3: Surgery for symptomswithin 1 to 5 years

4: Surgery for symptomsafter 5 years

Pe

rce

nta

ge

Bar Graph for the variable SURG

Qualitative Variables: Bar Graph

Page 69: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

SURG Frequency Percentage Cumulativepercentage

No surgery performed 0 409 66.4 66.4

Surg. as part of trial 1 89 14.4 80.8

Surg. for sympt. within 1 year 2 72 11.7 92.5

Surg. for sympt. 1 to 5 years 3 29 4.7 97.2

Surg. for sympt. > 5 years 4 17 2.8 100.0

616 100.0

Categorical/Nominal Variables: Frequency Table

Frequency Table for the variable SURG

Page 70: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Heart Attack Data in JMP

Page 71: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Bar Graph for Surgery Variable in JMP

Computing

Frequency tables are produced from raw data in JMP under Analyze Distribution.

Notice that there is no frequency column in this data table, that is because the data was entered where each row represents one subject in the study.

Page 72: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

The Big Mac Index

In 1986 The Economist started to compare prices of Big Macs between countries (converted to US dollars).

This provides a measure of whether the currency is undervalued or overvalued compared to the United States dollar.

Page 73: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

The Big Mac Index

Price of Big Macs ($US)

0

0.5

1

1.5

2

2.5

3

3.5

4Is

rael

Jap

an

Fra

nce

Tai

wan

Sin

gap

ore

New

Zea

lan

d

Ho

ng

Ko

ng

Country

Pri

ce (

$US

)

USA

Page 74: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

More General Use of Bar Graphs

• Excellent for relating labels to relative importance or relative size.

Price of Big Macs ($US)

0

0.5

1

1.5

2

2.5

3

3.5

4

Isra

el

Jap

an

Fra

nce

Tai

wan

Sin

gap

ore

New

Zea

lan

d

Ho

ng

Ko

ng

Country

Pri

ce (

$US

)

Page 75: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

More General Use of Bar Graphs

• Can be used to display a quantitative variable other than frequency (e.g. time, amount of money).

Price of Big Macs ($US)

0

0.5

1

1.5

2

2.5

3

3.5

4

Isra

el

Jap

an

Fra

nce

Tai

wan

Sin

gap

ore

New

Zea

lan

d

Ho

ng

Ko

ng

Country

Pri

ce (

$US

)

Page 76: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

More General Use of Bar Graphs

• Where possible, order items by size.

Price of Big Macs ($US)

0

0.5

1

1.5

2

2.5

3

3.5

4

Isra

el

Jap

an

Fra

nce

Tai

wan

Sin

gap

ore

New

Zea

lan

d

Ho

ng

Ko

ng

Country

Pri

ce (

$US

)

Page 77: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Other Forms of Graphs • Pie chart (For displaying the “measurement” of

each object as a proportion of the total.)• Segmented bar graph (Same purpose as the pie

chart.) Percentages of the World's Gold Production

Country 1983 1985 1987 1989 1991

S. Africa 48.6 43.8 36.2 30.8 28.7

U.S. 4.4 5.0 9.3 13.4 13.9

USSR 19.1 17.7 16.7 14.4 11.5

Australia 2.2 3.8 6.7 10.3 11.2

Canada 5.3 5.7 7.0 8.0 8.3

China 4.1 4.0 4.3 4.0 5.7

Rest 16.3 20.2 19.7 19.0 20.8

Page 78: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

• Pie chart (For displaying the “measurement” of each object as a proportion of the total.)

• Segmented bar graph (Same purpose as the pie chart.)

0%

20%

40%

60%

80%

100%

(c) Segmented bar

S. Africa

U.S.USSRAustr.Can.China

Rest

29%

11%11%

8%

6%

21% S. Africa

USSRAustr.

Can.

China

Rest

(b) Pie chart

14%U.S.0%

10%

20%

30%

(a) Bar graph

Pe

rce

nta

ge

S.

Af

U.S

.

US

SR

Au

str.

Can

.

Ch

ina

Res

t

Other Forms of Graphs

Page 79: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Choosing between Types of Graphs

• Bar graphs better at presenting relative sizes.

• Pie charts do not communicate information as well.

• Perspective pie charts are disastrous!• Avoid using perspective bar graphs.

A

D

E

F22%

13%

23%7%

25%

10%

B

C

22%

13%

23%7%

25%

10%

A

BCD

EF

13%

0%

5%

10%

15%

20%

25%

A B C D E FGroup

Per

cent

age

Page 80: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Some Principles of Graphical Excellence

• A well-designed presentation of interesting data. A matter of substance, of statistics, and of design.

• Communicates complex ideas with clarity, precision and efficiency.

• Gives the viewer the greatest number of ideas in the shortest possible time.

• Tells the truth about the data.The Visual Display of Quantitative Information

E. R. Tufte

Page 81: Organizing and Displaying Data. Data Files Data is almost always stored in a format where: ROWS are cases or individuals and COLUMNS are variables.

Graphical Displays for Data on a Single Variable

Discrete or Ordinal

Quantitative/numeric

- continuousQualitative, Categorical or

Nominal

Histogram, box plot, stem-and-leaf plot

Frequency table, bar graph

Frequency table, bar graph, pie chart, or mosaic plot.