Chapter 3 Summarizing Data. Graphical Methods - 1 Variable After data collected, sorted into...

Chapter 3

Summarizing Data

Graphical Methods - 1 Variable

• After data collected, sorted into categories/ranges of values so that each individual observation falls in exactly one category/range– Numeric Responses: Break “range” of values into non-

overlapping bins and count number of units in each bin

– Categorical Responses: List all possible categories (with “Other” if needed), and count numbers of units in each

• Pie Chart: Displays percent in each category/range• Bar Chart: Displays frequency/percent per category• Histogram: Displays frequency/percent per “range”

Constructing Pie Charts

• Select a small number of categories (say 5 or 6 at most) to avoid many narrow “slivers”

• If possible, arrange categories in ascending or descending order for categorical variables

Philly Monthy Rainfall 1825-1869 (1/100 inches)

1

2

3

4

5

6

7

8

9

10

11

Category Range Count1 <100 172 100-199 783 200-299 1324 300-399 1155 400-499 866 500-599 557 600-699 278 700-799 179 800-899 610 900-999 311 >1000 4

Monthly Philly Rainfall 1825-1869 (1/100 in)

Constructing Bar Charts• Put frequencies on one axis (typically vertical, unless

many categories) and categories on other• Draw rectangles over categories with height=frequency• Leave spaces between categories

Constructing Histograms

• Used for numeric variables, so need Class Intervals– Let Range = Largest - Smallest Measurement

– Break range into (say) 5-20 intervals depending on sample size

– Make the width of the subintervals a convenient unit, and make “break points” so that no observations fall on them

– Obtain Class Frequencies, the number in each subinterval

– Obtain Relative Frequencies, proportion in each subinterval

• Construct Histogram– Draw bars over each subinterval with height representing class

frequency or relative frequency (shape will be the same)

– Leave no space between bars to imply adjacency of class intervals

Histogram

0

20

40

60

80

100

120

140

rain100

Fre

qu

ency

100

200

300

400

500

600

700

800

900

1000

1100

1200

More

Interpreting Histograms

• Probability: Heights of bars over the class intervals are proportional to the “chances” an individual chosen at random would fall in the interval

• Unimodal: A histogram with a single major peak• Bimodal: Histogram with two distinct peaks (often

evidence of two distinct groups of units)• Uniform: Interval heights are approximately equal• Symmetric: Right and Left portions are same shape• Right-Skewed: Right-hand side extends further• Left-Skewed: Left-hand side extends further

Stem-and-Leaf Plots

• Simple, crude approach to obtaining shape of distribution without losing individual measurements to class intervals. Procedure:– Split each measurement into 2 sets of digits (stem and leaf)

– List stems from smallest to largest

– Line corresponding leaves aside stems from smallest to largest

– If too cramped/narrow, break stems into two groups: low with leaves 0-4 and high with leaves 5-9

– When numbers have many digits, trim off right-most (less significant) digits. Leaves should always be a single digit.

Time Series Plots

• Many datasets represent a single variable measured on a single unit at different time points

• When measurements are made at equally spaced time points, goal is often to describe temporal variation

• Annual measurements can reveal long-term trends• Sub-annual (weekly, monthly, quarterly) measurements

can reveal long-term trends as well as seasonal fluctuations

• Plots generally have measurement on vertical axis and time period on horizontal.

• Some plots include bars around points to represent fluctuations within that time period

Philly Rainfall 1/1825-12/1869

0

1000

2000

Month

Rai

nfa

ll (1

/100

th in

che

s)

Numerical Descriptive Measures

• Numeric summaries of a set of measurements• Measures of Central Tendency describe the

“location” or center of a set of measurements• Measures of Variability describe the “spread” or

dispersion of a set of measurements• Parameters: Numeric descriptive measures based on

Populations of measurements• Statistics: Numeric descriptive measures based on

Samples of measurements

Measures of Central Tendency - I

• Mode: Most often occuring outcome (typically only of interest for variables taking on only “discrete” values)

• Median: Middle value when measurements ordered from smallest to largest

• Mean: Sum of all measurements, divided by total number of measurements (equal distribution of total)

n

yyn

N

yN

i i

i i

:elements) ( Sample

:elements) ( Population

In practice, we only observe sample, and use to estimate y

Example - Philadelphia Rainfall

340341339 :Amounts Ordered

68.367540

198547198547

)Population as (Treating Months 540

)271()270(

540

1

Myy

y

N

ii

Note: The mean is higher than median as a few very large amounts were observed.

Measures of Central Tendency - II

• Outlier: Individual measurement(s) falling far away from others. Can have large effect on mean, not median

• Trimmed Mean (TM): Mean that is based on center measurements (deleting extreme measurements).

• Mode: For continuous (smooth) distributions, mode is value corresponding to the peak of the frequency curve

• Skewness: Shape of the distribution:– Mound-Shaped Distributions: Mode Median Mean

TM

– Right-Skewed Distributions: Mode < Median < TM < Mean

– Left-Skewed Distributions: Mean < TM < Median < Mode

Measures of Variability - I

• Variability: Magnitude of dispersion in data.• Range: Difference between largest and smallest

measurements in a set.• pth-Percentile: Value that has at most p% of

measurements below, and (100-p)% above it (0<p<100)– Lower Quartile = 25th Percentile (Q1)

– Median = 50th Percentile (Q2)

– Upper Quartile = 75th Percentile (Q3)

• Interquartile Range: Difference between the upper and lower quartiles (measures the amount of spread in he middle 50% of ordered measurements). IQR = Q3-Q1

Measures of Variability - II• Deviation: Distance between an individual

measurement and the group mean:• Variance: “Average” squared deviation• Standard Deviation: Square root variance (data’s units)

yy

22

2

22

2

Dev. Std. 1

)( :Variance :elements) ( Sample

Dev. Std. )(

:Variance :elements) ( Population

ssn

yysn

N

yN

i i

i i

Empirical rule (measurements with mound-shaped histogram)

Approximately 68% of measurements lie within 1 SD of mean

Approximately 95% of measurements lie within 2 SD of mean

Virtually all of measurements lie within 3 SD of mean

Example - Philadelphia Rainfall (Population)

)9.750,0(2.3837.3672

)3.559,1.176(6.1917.367

6.1918.36708

8.36708540

19822752

19822752)(

25.23575.232468 :Range Quartile-Inter

468 :Percentile75

232.75 :Percentile25

*

2

540

1

2

th

th

i

iy

IQR

Note: 383 (71%) Months lie within 1 of and 518 (96%) within 2

Boxplots

• Graph highlighting spread of set of measurements, highlighting quartiles and outliers.

• Constructing a boxplot:– Draw box with top at Q3, bottom at Q1, and line crossing at median

(Q2). Height of box is IQR = Q3 - Q1

– Compute “lower inner fence” = Q1-1.5(IQR) = LIF

– Compute “upper inner fence” = Q3+1.5(IQR) = UIF

– Compute “lower outer fence” = Q1-3.0(IQR) = LOF

– Compute “upper outer fence” = Q3+3.0(IQR) = UOF

– Draw line from Q3 to max(UIF, largest y value). Place ‘*’ for any y values between UIF and UOF, ‘o’ for any above UOF

– Draw line from Q1 to min(LIF, smallest y value). Place ‘*’ for any y values between LIF and LOF, ‘o’ for any below LOF

BoxPlot

0 500 1000 1500 2000

UIF = 468+1.5(232.25) = 816.375 UOF = 468+3(232.25) = 1164.75

Summarizing Data of More than One Variable• Contingency Table: Cross-tabulation of units based on

measurements of two qualitative variables simultaneously• Stacked Bar Graph: Bar chart with one variable

represented on the horizontal axis, second variable as subcategories within bars

• Cluster Bar Graph: Bar chart with one variable forming “major groupings” on horizontal axis, second variable used to make side-by-side comparisons within major groupings (displays all combinations in factorial expt)

• Scatterplot: Plot with quantitaive variables y and x plotted against each other for each unit

• Side-by-Side Boxplot: Compares distributions by groups

Example - Ginkgo and Acetazolamide for Acute Mountain Syndrome Among Himalayan Trekkers

AMS No AMS TotalPlacebo 40 79 119Acet 14 104 118Ginkgo 43 81 124Acc+Gi 18 108 126Total 115 372 487

Contingency Table (Counts)

AMS No AMS TotalPlacebo 33.61 66.39 100Acet 11.86 88.14 100Ginkgo 34.68 65.32 100Acc+Gi 14.29 85.71 100

Percent Outcome by Treatment

Stacked Bar Graph of AMS Incidence (Percent)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Placebo Acet Ginkgo Acc+Gi

Treatment

No AMS

AMS

Cluster Bar Graph of AMS Incidence (Counts)

0

20

40

60

80

100

120

Placebo Acet Ginkgo Acc+Gi

Treatment

Fre

qu

en

cy

AMS

No AMS

PlaceboAcet

GinkgoAcc+Gi

AMS

No AMS0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

Percent within Treatment

Treatment

Outcome

3-D Barchart of Incidence of AMS

Scatterplots

• Identify the explanatory and response variables of interest, and label them as x and y

• Obtain a set of individuals and observe the pairs (xi , yi) for each pair. There will be n pairs.

• Statistical convention has the response variable (y) placed on the vertical (up/down) axis and the explanatory variable (x) placed on the horizontal (left/right) axis. (Note: economists reverse axes in price/quantity demand plots)

• Plot the n pairs of points (x,y) on the graph

France August,2003 Heat Wave Deaths

• Individuals: 13 cities in France• Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002• Explanatory Variable: Change in Mean Temp in period (C)• Data: City Dth03 Dth9902 %chng (y) Degchg(x)

Little 200 192.3 4 4.0Marseilles 571 456.8 25 4.3Grenoble 148 115.6 28 6.3Rennes 156 114.7 36 5.6Toulouse 315 231.6 36 6.6Bordeaux 318 222.4 43 6.2Strasbourg 253 167.5 51 5.9Nice 341 222.9 53 4.3Poitiers 184 102.8 79 7.3Lyon 447 248.3 80 6.8Le Mans 204 112.1 82 7.0Dijon 168 87.0 93 7.4Paris 1854 766.1 142 6.7

France August,2003 Heat Wave Deaths2003 France Heat Wave Mortality

0

20

40

60

80

100

120

140

160

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

Change in Mean Temp (Celsius)

Ex

cess

Mo

rta

lity

(%

)

Possible Outlier

Example - Pharmacodynamics of LSD

Score (y) LSD Conc (x)78.93 1.1758.20 2.9767.47 3.2637.47 4.6945.65 5.8332.92 6.0029.97 6.41

• Response (y) - Math score (mean among 5 volunteers)

• Explanatory (x) - LSD tissue concentration (mean of 5 volunteers)

• Raw Data and scatterplot of Score vs LSD concentration:

LSD_CONC

7654321

SCORE

80

70

60

50

40

30

20

Source: Wagner, et al (1968)

Manufacturer Production/Cost Relation

Month Prod Cost Month Prod Cost Month Prod Cost1 46.75 92.64 17 36.54 91.56 33 32.26 66.712 42.18 88.81 18 37.03 84.12 34 30.97 64.373 41.86 86.44 19 36.60 81.22 35 28.20 56.094 43.29 88.80 20 37.58 83.35 36 24.58 50.255 42.12 86.38 21 36.48 82.29 37 20.25 43.656 41.78 89.87 22 38.25 80.92 38 17.09 38.017 41.47 88.53 23 37.26 76.92 39 14.35 31.408 42.21 91.11 24 38.59 78.35 40 13.11 29.459 41.03 81.22 25 40.89 74.57 41 9.50 29.0210 39.84 83.72 26 37.66 71.60 42 9.74 19.0511 39.15 84.54 27 38.79 65.64 43 9.34 20.3612 39.20 85.66 28 38.78 62.09 44 7.51 17.6813 39.52 85.87 29 36.70 61.66 45 8.35 19.2314 38.05 85.23 30 35.10 77.14 46 6.25 14.9215 39.16 87.75 31 33.75 75.47 47 5.45 11.4416 38.59 92.62 32 34.29 70.37 48 3.79 12.69

X= Amount Produced Y= Total Cost n=48 months (not in order)

Manufacturer Production/Cost Relation

Production (x) / Cost (y) Relation

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45 50

Total Production

To

tal C

ost

Chapter 3 Summarizing Data. Graphical Methods - 1 Variable After data collected, sorted into...

Documents

Transcript of Chapter 3 Summarizing Data. Graphical Methods - 1 Variable After data collected, sorted into...