CHAPTER 3 Displaying and Describing Categorical Data.

73
CHAPTER 3 Displaying and Describing Categorical Data

Transcript of CHAPTER 3 Displaying and Describing Categorical Data.

Page 1: CHAPTER 3 Displaying and Describing Categorical Data.

CHAPTER 3

Displaying and Describing

Categorical Data

Page 2: CHAPTER 3 Displaying and Describing Categorical Data.

Objectives

• Frequency Table• Relative Frequency Table• Distribution• Area Principle• Bar Chart Pie Chart• Contingency Table• Marginal Distribution• Conditional Distribution• Independence• Segmented Bar Chart• Simpson’s Paradox

Page 3: CHAPTER 3 Displaying and Describing Categorical Data.

The Three Rules of Data Analysis

• The three rules of data analysis won’t be difficult to remember:

1. Make a picture—things may be revealed that are not obvious in the raw data. These will be things to think about.

2. Make a picture—important features of and patterns in the data will show up. You may also see things that you did not expect.

3. Make a picture—the best way to tell others about your data is with a well-chosen picture.

Page 4: CHAPTER 3 Displaying and Describing Categorical Data.

Frequency Table• What is a frequency Table? A

frequency table is an organization of raw data in tabular form, using classes (or intervals) and frequencies.

• What is a frequency count? The frequency or the frequency count for a data value is the number of times the value occurs in the data set.

Page 5: CHAPTER 3 Displaying and Describing Categorical Data.

Categorical Frequency Tables

• NOTE: Later we will consider qualitative frequency tables.

• What is a categorical frequency table? A categorical frequency table represents data that can be placed in specific categories, such as gender, hair color, political affiliation etc.

Page 6: CHAPTER 3 Displaying and Describing Categorical Data.

A frequency table is a tabular summary of data showing the frequency (or number) of items in each of several non-overlapping categories.

The objective is to provide insights about the data that cannot be quickly obtained by looking only at the original data.

Frequency Table

Page 7: CHAPTER 3 Displaying and Describing Categorical Data.

Frequency Tables:

• We can “organize” the data by counting the number of data values in each category of interest.

• We can organize these counts into a frequency table, which records the totals and the category names.

Page 8: CHAPTER 3 Displaying and Describing Categorical Data.

Categorical Frequency Table

• Example: The blood types of 25 blood donors are given below. Summarize the data using a frequency distribution.

AB B A O B O B O A O B O B B B A O AB AB O A B AB O A

Page 9: CHAPTER 3 Displaying and Describing Categorical Data.

Categorical Frequency Table for the Blood Types

Note: The classes for the distribution are the blood types.

Page 10: CHAPTER 3 Displaying and Describing Categorical Data.

Your Turn

• Guests staying at Marada Inn were asked to rate the quality of their accommodations as being excellent (E),above average (AA), average (A), below average (BA), or poor (P). The ratings provided by a sample of 20 guests:

• BA, AA, AA, A, AA, A, AA, A, AA, BA, P, E, AA, A, AA, AA, BA, P, AA, A .

• Make a frequency table.

Page 11: CHAPTER 3 Displaying and Describing Categorical Data.

Categorical Frequency Distribution

Rating Counts

Poor (P) 2

Below Average (BA) 3

Average (A) 5

Above Average (AA) 9

Excellent (E) 1

Total 20

Page 12: CHAPTER 3 Displaying and Describing Categorical Data.

Relative Frequency Tables:• A relative frequency table is similar, but gives the relative

frequency, a decimal or percentage (instead of counts) for each category.

Page 13: CHAPTER 3 Displaying and Describing Categorical Data.

The relative frequency of a class is the fraction or proportion of the total number of data items belonging to the class.

A relative frequency table is a tabular summary of a set of data showing the relative frequency for each class.

Relative Frequency Table

Page 14: CHAPTER 3 Displaying and Describing Categorical Data.

Percent Frequency Table

The percent frequency of a class is the relative frequency multiplied by 100.

A percent frequency table is a tabular summary of a set of data showing the percent frequency for each class.

Page 15: CHAPTER 3 Displaying and Describing Categorical Data.

Relative & Percent Categorical Frequency Table

• Using the frequency table below, from the Marada Inn problem, create a relative and percent frequency table.

• Add two additional columns labeled relative frequency and percent frequency.

Rating Counts

Poor (P) 2

Below Average (BA) 3

Average (A) 5

Above Average (AA) 9

Excellent (E) 1

Total 20

Page 16: CHAPTER 3 Displaying and Describing Categorical Data.

Relative & Percent Categorical Frequency Table

Rating Frequency Rel. Freq. %Freq

Poor 2 2/20 = .10 10%

Below avg. 3 3/20 = .15 15%

Avg. 5 5/20 = .25 25%

Above avg. 9 9/20 = .45 45%

Excellent 1 1/20 = .05 5%

Total: 20 1 100%

Page 17: CHAPTER 3 Displaying and Describing Categorical Data.

Frequency Tables:

• All three types of tables show how cases are distributed across the categories.

• They describe the distribution of a categorical variable because they name the possible categories and tell how frequently each occurs.

Page 19: CHAPTER 3 Displaying and Describing Categorical Data.

Misleading Statistics

• Now that we have the frequency table, we are ready to make a picture or a graph of the data.

• Misleading graphs

• Scale• Pictographs

Page 20: CHAPTER 3 Displaying and Describing Categorical Data.

Misleading Statistics - Scale

• Adjusting the scale of a graph is a common way to mislead (or lie) with statistics.

• Example:

Page 21: CHAPTER 3 Displaying and Describing Categorical Data.

Misleading Scale

Page 22: CHAPTER 3 Displaying and Describing Categorical Data.

Misleading Statistics

• The best data displays observe a fundamental principle of graphing data called the area principle.

• The area principle says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents.

• Violations of the area principle are another common way of misleading with statistics.

Page 23: CHAPTER 3 Displaying and Describing Categorical Data.

What’s Wrong With This Picture?

• You might think that

a good way to show

the Titanic data is

with this display:

Page 24: CHAPTER 3 Displaying and Describing Categorical Data.

What’s Wrong With This Picture?

• The ship display makes it look like most of the people on the Titanic were crew members, with a few passengers along for the ride.

• When we look at each ship, we see the area taken up by the ship, instead of the length of the ship.

• The ship display violates the area principle: • The area occupied by a part of the

graph should correspond to the magnitude of the value it represents. Slide 3 - 24

Page 25: CHAPTER 3 Displaying and Describing Categorical Data.

Double the length, width, and height of a cube, and the volume increases

by a factor of eight

Area Principle - Pictographs

Page 26: CHAPTER 3 Displaying and Describing Categorical Data.

Area Principle Pictographs

Page 27: CHAPTER 3 Displaying and Describing Categorical Data.

GRAPHING CATEGORICAL OR QUALITATIVE DATA

Page 28: CHAPTER 3 Displaying and Describing Categorical Data.

Ways to Graph Categorical Data

Because the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.).

1. Bar Charts – Each category is represented by a bar.

2. Pie Charts - The slices must represent the parts of one whole.

Page 29: CHAPTER 3 Displaying and Describing Categorical Data.

Bar Charts

• A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison.

• A bar chart stays true to the area principle.

• Thus, a better display for the ship data is:

Page 30: CHAPTER 3 Displaying and Describing Categorical Data.

Bar Charts (cont.)

• A relative frequency bar chart displays the relative proportion of counts for each category.

• A relative frequency bar chart also stays true to the area principle.

• Replacing counts with percentages in the ship data:

Slide 3 - 30

Page 31: CHAPTER 3 Displaying and Describing Categorical Data.

Bar Charts

A bar chart is a graphical device for depicting qualitative data.

On one axis (usually the horizontal axis), we specify the labels that are used for each of the categories. A frequency, relative frequency, or percent frequency scale can be used for the other axis (usually the vertical axis).

Using a bar of fixed width drawn above each class label, we extend the height appropriately.

The bars are separated to emphasize the fact that each class is a separate category.

Page 32: CHAPTER 3 Displaying and Describing Categorical Data.

Bar Charts

• Either counts (frequency bar chart) or proportions (relative frequency bar chart) may be shown on the y-axis. This will not change the shape or relationships of the graph.

• Make sure all graphs have a descriptive title and that the axes are labeled (this is true for all graphs in AP Stats).

Page 33: CHAPTER 3 Displaying and Describing Categorical Data.

Pie Charts

• When you are interested in parts of the whole, a pie chart might be your display of choice.

• Pie charts show the whole group of cases as a circle.

• They slice the circle into pieces whose size is proportional to the fraction of the whole in each category.

Slide 3 - 33

Page 34: CHAPTER 3 Displaying and Describing Categorical Data.

Pie Chart The pie chart is a commonly used graphical device for presenting relative frequency distributions for qualitative data.

First draw a circle; then subdivide the circle into sectors that correspond in area to the relative frequency for each category.

Since there are 360 degrees in a circle, a category with a relative frequency of .25 would consume .25(360) = 90 degrees of the circle.

Page 35: CHAPTER 3 Displaying and Describing Categorical Data.

Relations between Two Categorical Variables

• Examples:• Is gender or race related to political

preference? • What type of music can make people

relax? • Will different packaging of the same

product attract people with different social-economic background?

• A contingency table or two-way table is a way to display the data from two categorical variables. A sort of Venn Diagram which shows how a population splits according to two factors.

Page 36: CHAPTER 3 Displaying and Describing Categorical Data.

Contingency Tables• A contingency table allows us to look at two categorical

variables together. • It shows how individuals are distributed along each variable,

contingent on the value of the other variable.• Example: we can examine the class of ticket and whether a

person survived the Titanic:

Page 37: CHAPTER 3 Displaying and Describing Categorical Data.

Contingency Tables (cont.)• The margins of the table, both on the right and on the

bottom, give totals and the frequency distributions for each of the variables.

• Each frequency distribution is called a marginal distribution of its respective variable.• The marginal distribution of Survival is:

Page 38: CHAPTER 3 Displaying and Describing Categorical Data.

Contingency Tables (cont.)• Each cell of the table gives the count for a combination

of values of the two values.• For example, the second cell in the crew column

tells us that 673 crew members died when the Titanic sunk.

Page 39: CHAPTER 3 Displaying and Describing Categorical Data.

Conditional Distributions

• A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable.• The following is the conditional distribution of ticket

Class, conditional on having survived:

Page 40: CHAPTER 3 Displaying and Describing Categorical Data.

Conditional Distributions (cont.)

• The following is the conditional distribution of ticket Class, conditional on having perished:

Page 41: CHAPTER 3 Displaying and Describing Categorical Data.

Conditional Distributions (cont.)

• The conditional distributions tell us that there is a difference in class for those who survived and those who perished.

• This is better shown with pie charts of the two distributions:

Page 42: CHAPTER 3 Displaying and Describing Categorical Data.

Conditional Distributions (cont.)

• We see that the distribution of Class for the survivors is different from that of the nonsurvivors.

• This leads us to believe that Class and Survival are associated, that they are not independent.

• The variables would be considered independent when the distribution of one variable in a contingency table is the same for all categories of the other variable.

Page 43: CHAPTER 3 Displaying and Describing Categorical Data.

Segmented Bar Charts

• A segmented bar chart displays the same information as a pie chart, but in the form of bars instead of circles.

• Each bar is treated as the “whole” and is divided proportionally into segments corresponding o the percentage in each group.

• Here is the segmented bar chart for ticket Class by Survival status:

Slide 3 - 43

Page 44: CHAPTER 3 Displaying and Describing Categorical Data.

Example: Income level vs. Job Satisfaction

Income Job Satisfaction Row Total

1 2 3 4 < 30K 20 24 80 82

206 30K-50K 22 38 104 125

289 50K-80K 13 28 81 113

235 > 80K 7 18 54 92

171 C. Total 62 108 319 412

901

• This is a Contingency table with Income Level as the Row Variable and Job Satisfaction as the Column Variable.

• The distributions of income to job satisfaction or job satisfaction to income are called Conditional Distributions.

• The distributions of income alone and job satisfaction alone are called Marginal Distributions.

• Relationships between categorical variables are described by calculating appropriate percents from the counts given in each cell.

Conditionaldistribution Marginal

distribution

Tabletotal

Page 45: CHAPTER 3 Displaying and Describing Categorical Data.

Example:

• A Statistics class reports the following data on sex and eye color for students in the class:

1. What percent of females are brown-eyed?

2. What percent of brown-eyed students are female?

3. What percent of students are brown-eyed females?

4. What’s the distribution of eye color?

5. What’s the conditional distribution of eye color for the males?

6. Compare the percent who are female among the blue-eyed students to the percent of all students who are female?

7. Does it seem that eye color and sex are independent? Explain.

Eye Color

Sex

Blue Brown Green/Hazel/Other Total

Males 6 20 6 32

Females 4 16 12 32

Total 10 36 18 64

Page 46: CHAPTER 3 Displaying and Describing Categorical Data.

Solution: Eye Color

Sex

Blue Brown Green/Hazel/Other Total

Males 6 20 6 32

Females 4 16 12 32

Total 10 36 18 64

1. What percent of females are brown-eyed? 16/32 = .5 or 50%

2. What percent of brown-eyed students are female? 16/36 = .444 or 44.4%

3. What percent of students are brown-eyed females? 16/64 = .25 or 25%

4. What’s the distribution of eye color? 10/64 = .156 or 15.6% Blue, 36/64 = .563 or 56.3% Brown, 18/64 =.281 or 28.1% Green/Hazel/Other

5. What’s the conditional distribution of eye color for the males? 6/32 = .188 or 18.8% Blue, 20/32 = .625 or 62.5% Brown, 6/32 = .188 or 18.8% Green/Hazel/Other

6. Compare the percent who are female among the blue-eyed students to the percent of all students who are female? 4/10 = .4 or 40% of the blue-eyed students are female, while 32/64 = .5 or 50% of all students are female.

7. Does it seem that eye color and sex are independent? Explain. Since blue-eyed students appear less likely to be female, it seems that Sex and Eye Color may not be independent. (But the numbers are small.)

Page 47: CHAPTER 3 Displaying and Describing Categorical Data.

SIMPSON’S PARADOX

Page 48: CHAPTER 3 Displaying and Describing Categorical Data.

Simpson’s Paradox• Discovered by E. H. Simpson in 1951.• Occurs when averaging different samples of different

sizes• Two groups from one sample are compared to two similar

groups from another sample• One sample’s success rate for both

groups is higher than the success rates for the other sample

Not E. H. Simpson

Page 49: CHAPTER 3 Displaying and Describing Categorical Data.

Simpson’s Paradox

• However, when both groups’ respective success rates are combined, the sample with the lower success rate ends up with the better overall proportion of successes. Thus, the paradox.

• One sample usually has a considerably smaller number of members than the other groups

• Simpson’s Paradox does not occur in populations with similar amounts

Page 50: CHAPTER 3 Displaying and Describing Categorical Data.

What is Simpson’s Paradox?

• Simpson’s Paradox occurs when an association between two variables is reversed upon observing a third variable.

Page 51: CHAPTER 3 Displaying and Describing Categorical Data.

• Simpson’s paradox lurking variable creates a reversal in the direction of an association (“confounding”)

• To uncover Simpson’s Paradox, divide data into subgroups based on the lurking variable

Simpson’s Paradox

Page 52: CHAPTER 3 Displaying and Describing Categorical Data.

Recent Cleveland Indians season records2003—68-94, 42.0% winning percentage

2004—80-82, 49.4% winning percentage

Two-season record: 148-176, 45.7% win percentage

Recent Minnesota Twins season records2003—90-72, 55.6% win percentage

2004—92-70, 56.8% win percentage

Two-season record: 182-142, 56.2% win percentage

Notice that the Twins had a higher percentage in both 2003 and 2004, as well as in the two-year period. Not Simpson’s Paradox.

Page 53: CHAPTER 3 Displaying and Describing Categorical Data.

Recent Cleveland Indians season records2003—68-94, 42.0% winning percentage

2004—80-82, 49.4% winning percentage

Two-season record: 148-176, 45.7% win percentage

Recent Minnesota Twins season records2003—90-72, 55.6% win percentage

2004—92-70, 56.8% win percentage

Two-season record: 182-142, 56.2% win percentage

Notice that the Twins had a higher percentage in both 2003 and 2004, as well as in the two-year period. Not Simpson’s Paradox.

Page 54: CHAPTER 3 Displaying and Describing Categorical Data.

Simpson’s Paradox at workRonnie Belliard

2002—61/289, .211 of his at-bats were hits

2003—124/447, .277 of his at-bats were hits

Two-season average: 185/736, hits .2514 of the time

Casey Blake

2002—4/20, .200 of his at-bats were hits

2003—143/557, .257 of his at-bats were hits

Two-season average: 147/577, hits .2548 of the time

The two season batting avg. for Belliard was lower than Blake’s, but divided into separate seasons, Belliard’s had a higher batting avg. both seasons.

This is Simpson’s Paradox.

Page 55: CHAPTER 3 Displaying and Describing Categorical Data.

Consider college acceptance rates by sex

Discrimination? (Simpson’s Paradox)

Accepted Notaccepted

Total

Men 198 162 360

Women 88 112 200

Total 286 274 560

198 of 360 (55%) of men accepted 88 of 200 (44%) of women accepted Is there a sex bias?

Page 56: CHAPTER 3 Displaying and Describing Categorical Data.

Discrimination? (Simpson’s Paradox)

• Or is there a lurking variable that explains the association?

• To evaluate this, split applications according to the lurking variable "major applied to”• Business School (240 applicants) • Art School (320 applicants)

Page 57: CHAPTER 3 Displaying and Describing Categorical Data.

Discrimination? (Simpson’s Paradox)

18 of 120 men (15%) of men were accepted to B-school24 of 120 (20%) of women were accepted to B-schoolA higher percentage of women were accepted

BUSINESS SCHOOL

Accepted Notaccepted

Total

Men 18 102 120

Women 24 96 120

Total 42 198 240

Page 58: CHAPTER 3 Displaying and Describing Categorical Data.

Discrimination (Simpson’s Paradox)

ART SCHOOL

180 of 240 men (75%) of men were accepted64 of 80 (80%) of women were accepted A higher percentage of women were accepted.

Accepted Notaccepted

Total

Men 180 60 240

Women 64 16 80

Total 244 76 320

Page 59: CHAPTER 3 Displaying and Describing Categorical Data.

• Within each school, a higher percentage of women were accepted than men.

• No discrimination against women• Possible discrimination against men• This is an example of Simpson’s Paradox.

• When the lurking variable (School applied to) was ignored, the data suggest discrimination against women.

• When the School applied to was considered, the association is reversed.

Discrimination? (Simpson’s Paradox)

Page 60: CHAPTER 3 Displaying and Describing Categorical Data.

Colin R. Blyth’s example of Simpson’s Paradox

• A doctor was planning to try a new treatment on patients mostly local (C) and a few in Chicago (C’). A statistician advised him to use a table of random numbers and as each C patient became available, assign him to the new treatment with probability .91, leave him to the standard treatment with probability .09; and the same for C’ patient with probability .01 and .99 respectively. When the doctor returned with the data the statistician told him that the new treatment was obviously a very bad one, and criticized him for having continued trying it on so many patients.

Treatment Standard New

Dead 5950 9005

Alive 5050 1095

(46%) (11%)

Page 61: CHAPTER 3 Displaying and Describing Categorical Data.

• The doctor replied that he continued because the new treatment was obviously a very good one, having nearly doubled the recovery rate in both cities.

C patient only C’ patient only

Treatment Standard New Standard New

Dead 950 9000 5000 5

Alive 50 1000 5000 95

5% 10% 50% 95%

Page 62: CHAPTER 3 Displaying and Describing Categorical Data.
Page 63: CHAPTER 3 Displaying and Describing Categorical Data.

Smokers’ Example

• In England a study was conducted to examine the survival rates of smokers and non-smokers. The result implied a significant positive correlation between smoking & survival rates because only 24% of smokers died as compared to 31% of non-smokers. When the data were broken down by age group in a contingency table, it was found that there were more older people in the non-smoker group. Thus age played a very significant role in the outcome but since it was overlooked the researchers were left with deceiving results. (Appleton & French, 1996).

Page 64: CHAPTER 3 Displaying and Describing Categorical Data.
Page 65: CHAPTER 3 Displaying and Describing Categorical Data.

The Paradox

What’s true for the parts isn’t true for the whole.

Page 66: CHAPTER 3 Displaying and Describing Categorical Data.

CONCLUSION!!!!

Simpson’s paradox is a rare phenomenon! It does not occur often! Thus statisticians must be trained academically & ethically well enough to make sure that if it has occurred they will detect and correct it. This is where practice, critical thinking skills, and repetition come into play!

Page 67: CHAPTER 3 Displaying and Describing Categorical Data.

What Can Go Wrong?

• Don’t violate the area principle.

• While some people might like the pie chart on the left better, it is harder to compare fractions of the whole, which a well-done pie chart does.

Page 68: CHAPTER 3 Displaying and Describing Categorical Data.

What Can Go Wrong? (cont.)

• Keep it honest—make sure your display shows what it says it shows.

• This plot of the percentage of high-school students who engage in specified dangerous behaviors has a problem. Can you see it?

Page 69: CHAPTER 3 Displaying and Describing Categorical Data.

What Can Go Wrong? (cont.)

• Don’t confuse similar-sounding percentages—pay particular attention to the wording of the context.

• Don’t forget to look at the variables separately too—examine the marginal distributions, since it is important to know how many cases are in each category.

Page 70: CHAPTER 3 Displaying and Describing Categorical Data.

What Can Go Wrong? (cont.)

• Be sure to use enough individuals!

• Do not make a report like “We found that 66.67% of

the rats improved their performance with training.

The other rat died.”

Page 71: CHAPTER 3 Displaying and Describing Categorical Data.

What Can Go Wrong? (cont.)

• Don’t overstate your case—don’t claim something you can’t.

• Don’t use unfair or silly averages—this could lead to Simpson’s Paradox, so be careful when you average one variable across different levels of a second variable.

Page 72: CHAPTER 3 Displaying and Describing Categorical Data.

What have we learned?

• We can summarize categorical data by counting the number of cases in each category (expressing these as counts or percents).

• We can display the distribution in a bar chart or pie chart.

• And, we can examine two-way tables called contingency tables, examining marginal and/or conditional distributions of the variables.

Page 73: CHAPTER 3 Displaying and Describing Categorical Data.

Assignment

• Exercises pg. 37 – 43: #5, 7, 11, 16, 21, 23, 27 -31 odd, 37

• Read Ch-4, pg. 44 - 71